Adversarial AI Defences (part 1)

In our previous blog, in the context of securing machine learning, we conducted an in-depth analysis of Adversarial AI & Attacks. In this two-part blog series, our attention turns to the examination of Adversarial Defences.

While early work in machine learning has often assumed a closed and trusted environment, attacks against the machine learning process and resulting models have received increased attention in the past years. Adversarial machine learning aims to protect the machine learning pipeline to ensure its safety at training, test and inference time. Defending Machine Learning models involves certifying and verifying model robustness and model hardening with approaches such as pre-processing inputs, augmenting training data with adversarial samples, and leveraging runtime detection methods to flag any inputs that might have been modified by an adversary.

There is an increasing number of methods for defending against evasion attacks which can generally be categorised into:

Model hardening

This defending method refers to techniques resulting in a new classifier with better robustness properties

Among the model hardening methods, a widely explored approach is to augment the training data of the classifier, e.g., by adversarial examples (so-called adversarial training) or other augmentation methods. Another approach is input data preprocessing, often using non-differentiable or randomised transformations, transformations reducing the dimensionality of the inputs, or transformations aiming to project inputs onto the “true” data manifold. Other model hardening approaches involve special types of regularisations during model training or modifying elements of the classifier’s architecture.

Runtime detection

Runtime detection of adversarial samples by extending the original classifier with a detector in order to check whether a given input is adversarial or not.

A summary of some ideas of current approaches to this defending method is as follows.

Adversarial training

One idea of defending against adversarial examples is to train a better classifier. An intuitive way to build a robust classifier is to include adversarial information in the training process, that is,adversarial training. For example, one may use a mixture of normal and adversarial examples in the training set for data augmentation or mix the adversarial objective with the classification objective as regularizer. Though this idea is promising, it is hard to reason about what attacks to train for and how important the adversarial component should be.

Defensive distillation

Defensive distillation trains the classifier in a certain way such that it is nearly impossible for gradient-based attacks to generate adversarial examples directly on the network. Defensive distillation leverages distillation training techniques and hides the gradient between the pre-softmax layer (logits) and softmax outputs. However, it is easy to bypass the defence by adopting one of the three following strategies: To choose a more proper loss function; To calculate gradient directly from the pre-softmax layer instead of from the post-softmax layer; To attack an easy-to-attack network first and then transfer to the distilled network. In a whitebox attack where the attacker knows the parameters of the defence network, it is very difficult to prevent adversaries from generating adversarial examples that defeat the defence.

Detecting adversarial examples

Another idea of defense is to detect adversarial examples with hand-crafted statistical features or separate classification networks. For each attack-generating method considered, model detectors construct a deep neural network classifier (detector) to tell whether an input is normal or adversarial. Detectors are usually trained on both normal and adversarial examples. Detectors show good performance when the training and testing attack examples are generated from the same process and the perturbation is large enough, but they don’t generalize well across different attack parameters and attack generation processes.