Eigen-Distortions of Hierarchical Representations

Image Quality Assessment Models

Since we clearly cannot probe the entire space of networks that achieve good results on object recognition, we aim instead to probe a more general form of the latter question. Specifically, we train multiple models of differing architecture to predict human image quality ratings, and test their ability to generalize by measuring human sensitivity to their eigen-distortions.

Generic Convolutional Neural Networks

We constructed a generic 4-layer convolutional neural network (CNN, 436908 parameters, Shown Below). Within this network, each layer applies a bank of 5 by 5 convolution filters to the outputs of the previous layer (or, for the first layer, the input image). The convolution responses are subsampled by a factor of 2 along each spatial dimension (the number of filters at each layer is increased by the same factor to maintain a complete representation at each stage). Following each convolution, we employ batch normalization, in which all responses are divided by the standard deviation taken over all spatial positions and all layers, and over a batch of input images (Ioffe and Szegedy [2015]). Finally, outputs are rectified with a softplus nonlinearity. After training, the batch normalization factors are fixed to the global mean and variance across the entire training set.

Biologically Inspired models of the Visual Thalamus

We compare our generic CNN to a model reflecting the structure and computations of the Lateral Geniculate Nucleus (LGN), the visual relay center of the Thalamus. Previous results indicate that such models can successfully mimic human judgments of image quality (Laparra et al. [2017]). The full model (On-Off, Shown Below), is constructed from a cascade of linear filtering, and nonlinear computational modules (local gain control and rectification). The first stage decomposes the image into two separate channels. Within each channel, the image is filtered by a differenceof-Gaussians (DoG) filter (2 parameters, controlling spatial size of the Gaussians - DoG filters in On and Off channels are assumed to be of opposite sign). Following this linear stage, the outputs are normalized by two sequential stages of gain control, a known property of LGN neurons (Mante et al. [2008]). Filter outputs are first normalized by a local measure of luminance (2 parameters, controlling filter size and amplitude), and subsequently by a local measure of contrast (2 parameters, again controlling size and amplitude). Finally, the outputs of each channel are rectified by a softplus nonlinearity, for a total of 12 model parameters.

In order to evaluate the necessity of each structural element of this model, we also test three reduced sub-models(Linear-Nonlinear (LN), Linear-Gain Control (LG), Linear-Gain Control-Gain Control (LGG)) each trained on the same data.

VGG16 Fine-tuned for Image Quality Assessment

Finally, we compare both of these models to a version of VGG16 (Simonyan and Zisserman [2015]) targeted at image quality assessment (VGG-IQA, Shown Below ). This model computes the weighted mean squared error over all rectified convolutional layers of the VGG16 network (13 weight parameters in total), with weights trained on the same perceptual data as the other models.

We trained all of the models on the TID-2008 database, which contains a large set of original and distorted images, along with corresponding human ratings of perceived distortion [Ponomarenko et al., 2009]. Details of the optimization are included in the original paper.

Example Distortions

Click on any of the images below to see most- and least-noticeable Eigen-distortions for each of the models we tested.

Parrot	Hats	Bikes	Houses	Boats	Door