Eigen-Distortions of Hierarchical Representations


A. Berardino, J. Ballé, V. Laparra and E. P. Simoncelli

To read the original paper, to be published in NIPS 2017, please click here.

Introduction


Human capabilities for recognizing complex visual patterns are believed to arise through a cascade of transformations, implemented by neurons in successive stages in the visual system. Several recent studies have suggested that representations of deep convolutional neural networks trained for object recognition can predict activity in areas of the primate ventral visual stream better than models constructed explicitly for that purpose (Yamins et al. [2014], Khaligh-Razavi and Kriegeskorte, 2014]). On the other hand, several other studies have used synthesis techniques to generate images that indicate a profound mismatch between the sensitivity of these networks and that of human observers. Specifically, Szegedy et al. [2013] constructed image distortions, imperceptible to humans, that cause their networks to grossly misclassify objects. Similarly, Nguyen and Clune [2015] optimized randomly initialized images to achieve reliable recognition from a network, but found that the resulting ’fooling’ images were uninterpretable by human viewers. Simpler networks, designed for texture classification and constrained to mimic the early visual system, do not exhibit such failures (Portilla and Simoncelli [2000]). These results have prompted efforts to understand why generalization failures of this type are so consistent across deep network architectures, and to develop more robust training methods to defend networks against attacks designed to exploit these weaknesses(Goodfellow et al. [2014]).

From the perspective of modeling human perception, these synthesis failures suggest that representational spaces within deep neural networks deviate significantly from that of humans, and that methodsfor comparing representational similarity, based on fixed object classes and discrete sampling of therepresentational space, may be insufficient to expose these failures. Despite this, recent publicationshave proposed the use of deep networks trained on object recognition as models of human perception,explicitly employing their representations as perceptual metrics or loss functions (Henaff and Simoncelli [2016], Johnson et al. [2016], Dosovitskiy and Brox [2016]). If we are going to use such networks as models for human perception, we must reckon with this disparity.

Recent work has analyzed deep networks’ robustness to visual distortions on classification tasks, as well as the similarity of classification errors that humans and deep networks make in the presence of the same kind of distortion (Dodge and Karam [2017]). Here, we aim to accomplish something in the same spirit, but rather than testing on a set of hand-selected examples, we develop a model constrained synthesis method for generating targeted test stimuli that can be used to compare the layer-wise representational sensitivity of a model to human perceptual sensitivity. Specifically, we utilize Fisher information to establish a model-derived prediction of sensitivity to local perturbations around a given natural image. For a given image, we compute the eigenvectors of the Fisher information matrix with largest and smallest eigenvalues, corresponding to the model-predicted most- and least-noticeable image distortions, respectively. We test the quality of these predictions by determining how well human observers can discriminate these same changes. We test the power of this method on six layers of VGG16 (Simonyan and Zisserman [2015]), a deep convolutional neural network (CNN) trained to classify objects. We also compare these results to those derived from models explicitly trained to predict human sensitivity to image distortions, including both a 4-stage generic CNN, a fine-tuned version of VGG16, and a family of highly-structured models explicitly constructed to mimic the physiology of the early human visual system.

We find that the early layers of VGG16, a deep neural network optimized for object recognition, provide a better match to human perception than later layers, and a better match than a 4-stage convolutional neural network (CNN) trained on a database of human ratings of distorted image quality. On the other hand, we find that simple models of early visual processing, incorporating one or more stages of local gain control, trained on the same database of distortion ratings, provide substantially better predictions of human sensitivity than both the CNN and all layers of VGG16.

Predicting Discrimination Thresholds


Figure 1: Measuring and comparing model-derived predictions of image discriminability. Two models are applied to an image (depicted as a point x in the space of pixel values), producing response vectors rA and rB. Responses are assumed to be stochastic, and drawn from known distributions p(rA|x) and p(rB|x). The Fisher Information Matrices (FIM) of the models, JA[x] and JB[x], provide a quadratic approximation of the discriminability of distortions relative to an image (rightmost plot, colored ellipses). The extremal eigenvalues and eigenvectors of the FIMs (directions indicated by colored lines) provide predictions of the most and least visible distortions. We test these predictions by measuring human discriminability in these directions (colored points). In this example, the ratio of discriminability along the extremal eigenvectors is larger for model A than for model B, indicating that model A provides a better description of human perception of distortions (for this image).

d

Potential Models of Perceptual Sensitivity

Below We proposed several different types of models as potential models of Human perceptual sensitivity.
Click on the model to see a more detailed description of the movitation for testing this model and the details of the model itself.
LEFT : We started by exploring whether perceptual representations within deep neural networks could serve as potential models of human perceptual sensitivity.
RIGHT : We also optimized several different model architectures explicitly to predict Human perceptual sensitivity (Image Quality Assessment).

Deep Neural Networks trained for object recognition

Image Quality Assessment Models

dd

dd


Example Distortions
Click on any of the images below to see most- and least-noticeable Eigen-distortions for each of the models we tested.

Parrot

Hats

Bikes

Houses

Boats

Door

dd

dd

dd

dd

dd

dd