Unsupervised learning of image manifolds with mutual information

Klindt,  D A; Ballé,  J; Shlens,  J; Simoncelli,  E P

Unsupervised learning of image manifolds with mutual information

D A Klindt, J Ballé, J Shlens and E P Simoncelli

Published in From Neuroscience to Artificially Intelligent Systems (NAISys), Nov 2020.

In the space of natural images, continuous real world transformations such as rotations or deformations of objects give rise to a smooth, highly nonlinear low-dimensional manifold. It was recently proposed that sparse coding filters represent a discrete sampling of this manifold, and that the filters can be ordered in a low-dimensional embedding space that preserves the topology of the original data manifold (Chen, Paiton & Olshausen, 2018, NeurIPS). The authors learn this representation by imposing a slowness prior (Wiskott & Sejnowski, 2002, Neur. Comp.), which straightens the trajectories of temporal input sequences (Hénaff & Simoncelli, 2016, ICLR). The main motivation for our work is to build a model based on these ideas, but (1) with a feedforward architecture that allows for incorporation into existing CNN models, and (2) a novel objective function that doesn't rely on image reconstruction, allows for end-to-end training and operates on still images to circumvent the need for spatiotemporal training data.

We propose a model layer implementing an overcomplete, convolutional linear expansion of the input signal, followed by divisive normalization (a form of local competition), and then a projection onto a low-dimensional embedding space. A topological organization in the embedding space is learned by maximizing a lower bound on mutual information between the input and the representation (Oord, Li & Vinyals, 2018, arXiv). We also include a term to maximize marginal entropy of the normalized responses. Combined, these terms encourage neurons that are equally utilized, but have locally sparse responses within the embedding space. When model layers are stacked, and the system optimized end-to-end, it learns a representation of the tangent directions on the data manifold at increasing scales and abstraction levels. The objective can also be combined with a supervised task loss.

We train a model with 3 layers on both MNIST and CIFAR10. We fix a two dimensional embedding space for visualization and, through a series of ablation experiments, demonstrate that: (1) Training only the classification objective yields somewhat unstructured filters that are not organized in any discernible pattern; (2) Including the mutual information term produces more structured filters, but many of them lie in the same location in the embedding space, and thus do not fully utilize the capacity of the system; and (3) Additionally maximizing the marginal entropy of the normalized responses encourages full use of all neurons, and yields a solution with highly structured filters that approximately uniformly sample the data manifold, with clearly evident continuity of feature attributes. For example, the first layer contains oriented filters laid out topologically, similar to the orientation tuning maps found in primate V1.

Listing of all publications