Detecting adversarial examples in deep neural networks
Fabio Carrara, Fabrizio Falchi, Roberto Caldelli, Giuseppe Amato, Roberta Fumarola, Rudy Becarelli

Abstract

Deep learning has recently become state-of-the-art in many computer vision applications and in image classification in particular. It is now a mature technology that can be used in several real-life tasks. However, it is possible to create adversarial examples, containing changes unnoticeable to humans, which cause an incorrect classification by a deep convolutional neural network. This represents a serious threat for machine learning methods. In this paper we investigate the robustness of the representations learned by the fooled neural network. Specifically, we use a kNN classifier over the activations of hidden layers of the convolutional neural network, in order to define a strategy for distinguishing between correctly classified authentic images and adversarial examples. The results show that hidden layers activations can be used to detect incorrect classifications caused by adversarial attacks.

Paper (Preprint PDF, 2.3MB)

The paper has been presented at the 16th International Workshop on Content-Based Multimedia Indexing (CBMI)
@INPROCEEDINGS{2017-Carrara-CBMI, 
	author={F. Carrara, F. Falchi, R. Caldelli, G. Amato, R. Fumarola, R. Beccarelli}, 
	booktitle={2017 16th International Workshop on Content-Based Multimedia Indexing (CBMI)}, 
	title={Detecting adversarial examples in deep neural networks}, 
	year={2017}, 
	pages={1-7}, 
}

Slides


Additional resources

Experimental Results

In the following table, we report the scores assigned to adversarial images by our best approach (pool5 + PCA + DW-kNN).
From left to right, columns respectively report:

  • the adversarial image
  • its generation method (FGS or L-BFGS)
  • the original class of the image
  • the class the network is fooled to predict
  • the nearest neighbor image (in terms of L2 distance between average-pooled pool5 activations) belonging to the predicted class
  • the DW-kNN score s for the predicted class

A low kNN score indicates that the adversarial is correctly detected while a high score means that our approach is wrongly confident about the prediction of the CNN. The results show that high scoring adversarials examples often share some common visual aspects and semantic with the predicted (adversarial) class, resulting in a more challenging detection.

Adversarial Image Generation Algorithm Actual Class Fooled Class Nearest Neighbor kNN score {{adversarials.asd}}
{{adv.type}} {{adv.actual_text}} {{adv.pred_text}}
{{adv.knn_score | number : 2}}

This work was partially supported by Smart News, Social sensing for breaking news, co-founded by the Tuscany region under the FAR-FAS 2014 program, CUP CIPE D58C15000270008. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.