More Information

Other than the three aspects of related work we discussed in the survey, we also find the following two aspects to be relevant: (1) other reasons for ML models to exhibut adversarial vulnerability and (2) other attacks on machine learning models that related to data. We will briefly discuss them here.

Non-data related reasons behind evasion attacks

There has been a variety of hypotheses regarding the reasons behind adversarial vulnerability of ML systems, particularly for evasion attacks. These include data-related properties extensively discussed in this survey, as well as reasons related to the models themselves, computational resources, and feature learning procedures. We discuss these below.

❖ Model

When Szegedy et al. first discovered adversarial examples for visual models, they suspected that the high non-linearity of DNNs resulted in low probability `pockets' of adversarial examples in the learned representation manifold. They hypothesize that while these pockets can be found through attack algorithms, the samples residing in these pockets have different distributions compared to normal samples and are thus subsequently harder to find when randomly sampling from the input space. Instead, Goodfellow et al. hypothesize that the linearity from activation functions, like ReLU and sigmoid found in high-dimensional neural networks, induce vulnerability towards adversarial perturbations. To support their claim, they present the attack method FGSM that exploits the linearity of the target classifier. Fawzi et al. also argue against the hypothesis of high non-linearity as the cause for adversarial examples. They show that all classifiers are susceptible to adversarial attacks and claim that it is the low flexibility of the classifier compared to the complexity of the classification task that results in vulnerability. The lack of consensus on the primary causes of model vulnerability invites more studies on this topic. Singla et al. show that enforcing invariance to circular shifts (e.g., rotation) in neural networks induces decision boundaries with a smaller margin than normal, fully connected networks, which, in turn, reduces the adversarial robustness of the model. Moosavi-Dezfooli et al. introduce universal, input-agnostic perturbations to mislead the classifier and hypothesize that the vulnerability of a multi-class classifier to such perturbations is related to the shape of its decision boundaries, e.g., linear classifiers with decision boundaries that are parallel to each other and nonlinear classifier with decision boundaries that are curved in a similar way tend to be less robust as perturbations in one direction can change the prediction label for a different class. Tanay and Griffin conjecture that the decision boundary learned by the classifier being too close to (or `tilted towards') the data manifold instead of being perpendicular to it, results in small perturbations being sufficient to move samples across the decision boundary for misclassification.

❖ Computational Resources

Bubeck et al. use computational hardness theory to show that the time complexity for learning a robust model is exponential to the size of input data and thus is computationally intractable. Hence, they attribute adversarial vulnerability to computational limitations of current learning algorithms. Degwekar et al. further extend this work and also show the impossibility of efficiently training robust classifiers.

❖ Feature Learning

Ilyas et al. show that adversarial vulnerability can be a consequence of a model exploiting well-generalizing but non-robust features, i.e., features that are spurious and sometimes incomprehensible to humans; when constraining the model to use robust features, the adversarial robustness increases together with the interpretability of the learned features. However, Tsipras et al. note that, as the features for achieving high accuracy may be different from the ones for achieving high robustness, robustness may be at odds with standard accuracy. Instead of seeing adversarial vulnerability as a product of classifiers being overly sensitive to changes in spurious features, Jacobsen et al. hypothesize that classifiers can rather be overly insensitive to relevant semantic information, e.g., images with drastically different content can share similar latent representations. The authors introduce a new type of adversarial examples that exploit such insensitivity, where the content of images is altered without changing the resulting prediction label. While all these works propose possible reasons for adversarial vulnerabilities, they are orthogonal to our survey, which focuses particularly on the influence of training data.

Non-evasion attacks

Similar to evasion attacks, data poisoning and backdoor attacks aim to compromise model accuracy. However, they achieve it by tampering the training data to create deceptive model decision boundaries. In addition, backdoor attacks also require perturbing the test instance to result in a misclassification. This is achieved by introducing manipulated training data with triggers that can be activated during the testing phase. Goldblum et al. and Cinà et al. review recent literature on attack methodologies and countermeasures for both poisoning and backdoor attacks. Both of these surveys found that existing research made overly-optimistic assumptions when designing / validating attack techniques, e.g., assuming the knowledge of a large portion of training data. They advocate for researchers to test proposed methods in more realistic situations to better assess the potential threats. Furthermore, they encourage exploration of the relationship between poisoning attacks and evasion attacks. This could lead to the creation of attacks that produce less noticeable poisoning examples, or defensive strategies that can safeguard models against both backdoor and evasion attacks. In addition to undermining model accuracy, adversarial attacks also aim at breaching the privacy and confidentiality of training data. In particular, membership inference attacks attempt to determine whether a specific data point was part of the training set used to train the model. Hu et al. present a comprehensive survey of existing research efforts on membership inference attacks. They find that, similar to evasion attacks, the membership inference attack success rate decreases as the number of training samples increases. However, all these attacks are orthogonal to our survey, as we focus on adversarial evasion attacks.