Metamers of the ventral streamJ Freeman and E P SimoncelliPublished in Computational and Systems Neuroscience (CoSyNe), (T-28), Feb 2010.DOI: 10.3389/conf.fnins.2010.03.00053 This paper has been superseded by:
|
Starting from any prototype image, we generate stimuli that match in terms of the responses of a simple model for extrastriate ventral computation. The model is based on measurements previously used to characterize visual texture (Portilla & Simoncelli, 2000). The model decomposes an image using a bank of V1-like filters tuned for local orientation and spatial frequency, computing both simple and complex-cell responses. Extrastriate responses are then computed by taking pairwise products amongst these V1 responses, and averaging within overlapping spatial regions that grow with eccentricity. Stimuli are generated by using gradient descent to adjust a random (white noise) image to match the model responses of the original prototype. Previous work showed that the same statistics, averaged over an entire image, allow for the analysis and synthesis of homogenous visual textures.
If this model accurately reflects representations in early extrastriate areas, then images synthesized to produce identical model responses should be metameric to a human observer. For each of several natural images and pooling region sizes, we generate multiple samples that are statistically-matched but otherwise as random as possible. We use a standard psychophysical task to measure observers' ability to discriminate between image samples, as a function of the rate at which the statistical pooling regions grow with eccentricity. When image samples are statistically matched within small pooling regions, observers perform at chance (50%), failing to notice substantial differences in the periphery. When images are matched within larger pooling regions, discriminability approaches 100%. We fit the psychometric function to estimate the pooling region over which the observer estimates statistics. The result is consistent with receptive field sizes in macaque mid-ventral areas (particularly V2).
Our model also fully instantiates a recently proposed explanation (Balas et al., 2009) of the phenomenon of ``visual crowding'', in which humans fail to recognize a peripheral target object surrounded by background clutter. In our model, crowding occurs because multiple objects fall within the same pooling region and the model responses cannot uniquely identify the target object. We synthesize images that are metameric to classic crowding stimuli (e.g. groups of letters), and find that stimulus configurations that produce crowding yield synthesized images with jumbled, unidentifiable objects.
References: