Published in Computational and Systems Neuroscience (CoSyNe), Mar 2025.
Task-trained deep neural networks have emerged as leading models of neural responses in the primate visual system, especially for later stages of the ventral stream such as inferotemporal (IT) cortex. One major criticism of this approach is that the tasks used to train such networks rely on large numbers of labeled examples, and are thus not ecologically plausible. Recent approaches in representation learning circumvent the need for labeled examples, and match or surpass supervised learning methods on a variety of tasks. These approaches generally rely on supervision signals extracted {from the data}, rather than from human annotations, and are thus called "self-supervised". Many self-supervised learning methods aim to learn a representation that is (1) invariant to a set of transformations in the input space (often called augmentations) and (2) maximally discriminative across distinct input images. However, this training is not well aligned with known characteristics of visual perception: transformations for which the network is encouraged to be invariant are generally quite perceptible to humans (Feather et. al. 2023). Moreover, recent work has shown that the factorization of variability due to image transformations is more closely related to neural predictivity than the lack of variability (Lindsey et. al. 2024). In this work we present a novel self-supervsed learning method (Yerxa et. al. 2024) that trades off invariance (discarding information about the input transformation) and equivariance (maintaining information about the input transformation). Specifically, a representation is said to be equivariant with respect to some transformation of the inputs if the same transformation, applied to different inputs, results in the same change in the representation. We demonstrate that our equivariant learning approach produces representations that contain more ``category orthogonal" (Hong et. al. 2016) information, better factorize the sources of variability in the datasets, and better predict neural activity in visual area IT.