Towards aligning artificial and biological vision systems with self-supervised representation learning

Nikhil Parthasarathy.

PhD thesis, ,
Jan 2024.

Download:
  • Reprint (pdf)

  • This dissertation investigates self-supervised learning methods for training deep neural network (DNN) models that better align with both human behavior and neural responses in early visual areas. We first propose VITO, an attention-guided contrastive video-pretraining method, that improves dramatically on prior work to learn general, robust, and more human-aligned representations from natural video data. We specifically demonstrate that dynamic temporal content is required for the improved robustness and human-alignment. We next explore a complementary line of work focused on improving the alignment of intermediate DNN representations with early visual areas. We first provide a simple demonstration that selectivity for visual texture can be learned via optimizing a single-layer objective in a biologically-inspired architecture modeled off of areas V1 and V2. We then refine and extend this study to a more general layerwise learning paradigm, capable of learning features simultaneously in a two-layer network. We do so by leveraging a novel self-supervised layerwise complexity-matched learning paradigm. Our trained model provides better predictions of neural responses in early visual areas and partic- ularly achieves state-of-the-art predictions for cortical area V2. Finally, we provide some preliminary analyses probing the limitations of current regression-based evaluations for measuring alignment with neural responses. Taken together, this thesis lays the foundation for future research in using learned DNNs to reveal new organizing principles for how selectivities are formed in visual hierarchies, with potential implications for both neuroscience and machine learning.
  • Listing of all publications