Perceptual evaluation of artificial visual recognition systems using geodesics

O J Hénaff, R L T Goris and E P Simoncelli

Published in Computational and Systems Neuroscience (CoSyNe), Feb 2016.

Recognition is a demanding visual task because image-domain renderings of any given object vary widely. Successful recognition requires representations that distinguish different items, while exhibiting invariance (or at least "tolerance") to naturally occurring variations (DiCarlo & Cox 2007). A new class of artificial systems, deep neural networks, performs real-world visual recognition with surprising accuracy. However, the invariances of these representations have not been compared with those of the human visual system. Here, we propose a new synthesis methodology for comparing the invariances of machine representations to those of human vision. Given two reference images (typically, differing by some transformation), we synthesize a sequence of images that follow a minimal-length (geodesic) path in the response space of the representation. We hypothesize that if the human visual system has the same invariances, then this sequence should also represent a minimal-length perceptual path. Candidate representations can thus be compared by assessing which one produces a sequence that is shortest in perceptual terms. We apply this paradigm to the simple test case of a pair of translated images , and find that a current state-of-the-art model for object recognition generates unnatural-looking geodesic sequences, with easily discriminable successive frames. On the other hand, replacing the max-pooling operation in this network with a more physiologically-plausible quadratic pooling leads to smoother and more natural sequences, in which successive frames are difficult to discriminate. This can be quantified more precisely with formal psychophysical measurements, but even informal viewing demonstrates that the latter sequence follows a perceptually "shorter" path, and thus that the associated model better captures this invariance of human vision. Our method is general, and can be used to study transformations between arbitrary pairs of images, thus offering a thorough and principled means of comparing representations as models of biological vision.
  • Related Publications: Henaff16b
  • Listing of all publications