Securing Machine Learning: Adversarial AI & Attacks (part 2)

In the first part of our blog on the Adversarial AI & Attacks, we explored the basics regarding this topic and the four primary types of Adversarial Attacks. In this second and final part of the blog, we will highlight Adversarial Attacks based on how they act and on what kind of vulnerability they exploit.

At AI4PublicPolicy, we conducted a comprehensive study of the state-of-the-art on Adversarial Attacks to provide the most appropriate suggestions for every use case and model in our project. The following table was created during this study, encompassing the most relevant and effective attacks. These attacks are further classified based on their type and the algorithms each one might affect.

Below are brief descriptions of the attacks shown in the table:

Fast Gradient Sign Method (FGSM)

This type of attack takes advantage of one of the fundamental weaknesses of neural networks: their linear nature. FGSM exploits the linear nature of neural networks by crafting a linear attack (perturbation). This causes the activation to grow linearly with the number of dimensions of the weight vector of the model, making it sufficient to apply many infinitesimal changes in the input to throw the model off. FGSM works the same as Gradient Descent but it aims at maximising the loss in order to utilise the smallest perturbation possible to achieve its goal.

Jacobian-based Saliency Map Attack (JSMA)

This type of attack targets a specific class. In order to do this, it crafts a perturbed object specifically to be misclassified as the targeted class. JSMA are usually white-box attacks. This means that the attacker has full knowledge of the model, its architecture and also of the data used in training. The first step of JSMA attacks is the saliency map. The attackers create one by leveraging the forward derivatives of the model and then use the map to craft an adversarial example that targets one specific class. The saliency map allows the attacker to identify the features that are used to classify each class. Once the features of the target class have been identified, the attacker will recreate them in the adversarial example and will minimise the features that are negatively correlated with the target class.

Generative Model Base Method

This type of attack is based on Generative Adversarial Network (GAN). A GAN is a combination of two neural networks: one called Generator and one called Discriminator. The two neural networks of a GAN act in opposition until they reach an equilibrium where the Generator is able to reproduce data from the distribution of the input space of the targeted model. The Discriminator’s job is to inform if the output of the Generator comes from the input space of the target model or from the Generator itself. GAN are used in adversarial attacks to craft adversarial examples that look more natural than the ones made with FGSM methods.

Universal Attacks

This type of attack aims at creating a single universal perturbation image that can be added to any natural image and causes the misclassification of said image. This single perturbation can even be used on different architectures because they have shown to generalise very well on different state-of-the-art classifiers. The existence of effective Universal Attacks highlights some vulnerabilities that are intrinsic in neural networks. Having highly dimensional decision boundaries, Neural Networks are very susceptible to adversarial examples that lie in the overlapping area of the lower dimension representation of these decision boundaries.

Black-box adversaries

After generating adversarial examples, they can be used in two main forms of attack settings.  Black box attacks occur in cases where the malicious actor has no, or limited information about the specifics of the model and how it works.

White-box adversaries

Conversely, the attacker is assumed to possess full or most of the required knowledge about the ML model and its parameters in a white-box setting. Transferability is a specially important feature of adversarial examples. In this context, most adversarial examples can be transferred from one model to another. Consequently, an adversarial example generated and tested by a certain malicious actor can attack other models, either with the same attacker or by others. (Reference)

Grey-box adversaries

Different from the previous two attacks, Grey-box attacks train a generative model to generate adversarial examples and only assume access to the target model during the training phase. This provides higher time efficiency and easier integration into adversarial defending algorithms. (Reference)