Self-supervised learning of a visual texture representation for cortical area V2

Parthasarathy,  N; Simoncelli,  E P

Self-supervised learning of a visual texture representation for cortical area V2

N Parthasarathy and E P Simoncelli

Published in From Neuroscience to Artificially Intelligent Systems (NAISys), Nov 2020.

Neurons in primary visual cortex (V1) respond selectively to oriented edge-like features, and this selectivity can be captured using oriented filters. Such filters can be learned from natural images using unsupervised techniques (eg., ICA or sparse coding), which provide a theoretical explanation for their intended purpose. Cortical area V2 is less well understood, but recent work shows that most V2 neurons respond selectively to 'visual texture' (Ziemba et al. 2016). Moreover, the responses of V2 populations can be used to distinguish between different textures, and to distinguish textures from their spectrally matched counterparts. Although a few computational models (Willmore et al. 2011; Hosoya and Hyvarinen, 2015) have been proposed for V2, the first of these is not learned from natural images, and neither has been compared to V2 responses to texture images.

In this work, we develop a parametric functional model for V2. The model uses a first stage of oriented linear filters (corresponding to cortical area V1), consisting of both rectified units (simple cells) and pooled phase-invariant units (complex cells).These responses are provided as input to a V2 stage that consists of a set of learned convolutional filters followed by half-wave rectification and pooling to generate V2 'complex cell' responses.

We optimize the filters in the V2 stage over a dataset of homogeneous texture images, using a novel learning objective that aims to separate texture image families in the V2 response space. Rather than use texture class labels as a supervision signal, we develop a more biologically plausible self-supervised objective function that is inspired by contrastive learning methods developed in machine learning. The objective aims to maximize the distance between the distribution of V2 responses to each individual image and the distribution of responses across all images We use this method to learn a single layer of V2 filters, but the layer-wise nature of the objective provides the potential to learn filters in multiple stages of a hierarchical model without requiring backpropagation of gradients across multiple layers.

Our trained model successfully captures texture family invariances in a low-dimensional representation, surpassing texture classification performance of deep supervised networks when trained in small data regimes. Moreover, we show that relative to deep nets, the learned model has a stronger representational similarity to both texture responses of neural populations (recorded in primate V2) and human texture perception.

Listing of all publications