AI4PublicPolicy Components pt.4
In our previous blog post about the main components of the AI4PublicPolicy platform, the XAI (eXplainable AI) and the Policy Explainability and Interpretation components have been described – find the 3rd blog post on the platform components here.
In this fourth part of the AI4PublicPolicy platform components’ analysis, we are shedding light on three more components.
The objective of this component is to deliver defence strategies against AI threats that will attempt to sabotage AI models in ways that compromise their correct operation. In particular, the mechanisms primarily protect the AI systems against data poisoning and evasion attacks. Starting with the data poisoning attacks, in the data collected from our data sources, an attacker can potentially inject designated adversarial samples into training data to affect the resulting decision function. Therefore, ensuring the purity of training data and improving the robustness of learning algorithms are two main countermeasures to such adversaries at the training phase.
In this case, the Cyber-Defence tool has to check the data received from the various data sources and possibly detect any anomalies in the samples. Here different strategies are available as a detection based on data provenance strategy and the XAI techniques. The second defence is a revisited adversarial training technique, which trains empirically robust models using the Fast Gradient Sign method for the adversarial training and has a significantly lower cost with respect to the projected gradient descent-based training. In general, at this point, the defence tool should train the model using new inputs with adversarial perturbations and correct output labels aiming at minimising the errors caused by adversarial data.
The following figure shows the high-level architecture of the defensive process:
The Poisoning Detector collects data from the pilot sources, while the Evasion Training tool trains a policy model.
The Poisoning Detector exports sanitized training data, while the Evasion Training corrects output labels to be a robust policy model.
This is a component for selecting among a set of well-established algorithms the optimum ones to realize the AI processes chain. The mechanism realizing the latter will automatically extract statistical meta-features from datasets to perform an optimum mapping, while also considering all processes steps and dependencies within an analysis pipeline/chain. These are the following subcomponents:
In this subcomponent, a user selects the datasets that will serve as input to the AutoML tool and sets parameters for data loading. Users can choose which columns to load or ignore and specify the target column that needs to be predicted. Additionally, the Dataset Explorer allows data transformation, such as featurization of dateTime values. Users can also choose to load data from a database or a raw CSV file.
After the data set is selected the user can preview them in a table preview, as well as in other graphical representations to help better understand the data.
A subcomponent which allows the user to choose parameters of the AutoML Engine (defaults are used if nothing is selected) and start the AutoML process. In order to customise the machine learning process, users can select from specific parameters: the search algorithm; general timeout; iteration timeout; algorithms filter; early stopping conditions, and the number of cross-validations.
A component that takes datasets, and parameters from the user as input and finds the best models that fit.
Datasets that are managed by the Data Collection and Management component, such as a DB table (cloud) or a CSV file.
The Output of the machine learning process is a Policy Model, which includes important details such as the algorithm name, meta-parameters of the model, accuracy, and datasets used for training the model. Additionally, the Policy Model provides information on feature importance, which ranks features based on their influence on the trained model.
The subcomponents and the pipeline for the AutoML component are shown in the following figure:
Text and Sentiment Analysis
The purpose of the Text and Sentiment Analysis component is to provide the sentiment of citizens’ feedback to Municipalities. To achieve that, the sentiment of each sentence/comment is evaluated using complex Machine Learning models built, as well as robust and load-balanced data processing pipelines. Sentiment analysis models specialise in polarity (positive, negative, neutral) as well as feelings and emotions (angry, happy, sad, etc), urgency (urgent, not urgent), and even intentions (interested or not interested).
The input data is in raw text format. The data is then processed using Natural Language Processing aiming at extracting the sentiment and emotion of each sentence. The resulting output is stored in an ElasticSearch index. When the ElasticSearch index is queried, results returned are aggregated therefore providing a clear visualisation around what was searched, usually in a front-end application.
Raw text and data sources from the pilots (structured and unstructured) such as citizens’ Tweets, comments from various platforms, complaints, reports, and surveys.
Analysed aggregated output in JSON format. The values of the sentiment will be described as numeric representation: 0 for neutral, -1 for negative and 1 for positive.
The Text Analyzer passes the raw text through a NER model to identify entities and then tokenizes them in an array and stores the resulting output in an ElasticSearch index along with the original text.
The following figure shows the Text Analysis flow:
The following figure shows the Sentiment Analyzer flow: