Adversarial AI Defences (part 2)

In the first part of our blog on Adversarial AI Defences, we introduced their generic categories.

To tailor our suggestions to each AI4PublicPolicy use case and model, we’ve created the following table outlining the most promising and effective defences. This table includes the names of specific attacks and defences currently used in the state-of-the-art Adversarial AI. Each defence is classified based on the type of attack and the algorithms that might be affected by it. At the current state, not every defence on this table has been implemented in AI4PublicPolicy yet.

Even if the number of adversarial attacks is growing and their detection is getting more difficult and computationally expensive, there is an increasing number of techniques to defend Machine Learning’s models against most of them or to mitigate their effects and probability of success.

These countermeasures can be divided into two sub-groups: Proactive Defences and Reactive Defences

Proactive Defences

Proactive strategies aim at making the model more robust and resistant to adversarial attacks by changing the way the model is structured or by changing its training process. A robust model might be achieved by different proactive strategies:

Adversarial Training

One of the most successful methods to create a robust model is Adversarial Training. As the name suggests this method implies including correctly labelled adversarial examples in the training set. After training on regular examples and adversarial examples, the model should be less vulnerable to adversarial attacks. Nowadays this method can achieve state-of-the-art accuracy on many competitive benchmarks.

Defensive Distillation

Defensive Distillation is a defensive technique that uses knowledge from a different Neural Network. With this method, it is possible to share knowledge from a larger and more complex architecture to a smaller model. By providing the model with knowledge from a larger Neural Network we can make a model more robust and capable of nullifying adversarial attacks.


Denoising is a defensive method that aims at reducing the perturbations in adversarial attacks. This method tends to reduce the success rate of adversarial attacks and it can be conducted in different stages of the model creation. Denoising can be implemented on the input before feeding it to the model or it can be implemented on the features learned by the model.

Feature Squeezing

Feature Squeezing aims at reducing the dimensionality of input space that is usually larger than necessary. An excessive input space can be exploited by attackers so reducing it can lead to a more robust model. This defence implies dropping any unnecessary feature resulting in a reduction of freedom to generate adversarial examples. After feature squeezing the output of the model is compared with the output of the original model without feature squeezing. This process helps identify adversarial examples, the examples that show the most difference between the two outputs are usually the result of some perturbation.

Reactive Defences

Adversarial Detection

There are many different ways to implement Adversarial Detection. One of the simplest ways is to train a small Neural Network to identify regular examples and adversarial examples. There have been observed cases where Adversarial Detection works well even when the attacker is aware of the existence of the Detector.

Input Reconstruction

The process of Input Reconstruction involves the detection and the transformation of adversarial attacks before they get fed into the model, turning those into regular examples. In order to implement Input Reconstruction a type of encoder is needed to transform any adversarial input into a cleaned one. The existence of an encoder makes it very hard for the attacker to create generally effective adversarial examples since without knowing what type of encoder will be used the attacker will be forced to train its adversarial examples on multiple encoders.