Perception Lecture Notes: Depth, Size, and Shape

Professor David Heeger

What you should know about this lecture

Stereo Vision

Stereopsis: greek for "solid sight".

Close one eye, and hold up your two index fingers, one fairly close to your face and one as far as you can reach. Fixate the more distance hand, and alternately view the scene with your left eye and then your right. As you can see, the distance between the two fingers is different in your left than in your right eye; their relative positions in the two retinae are disparate.

Binocular disparity is defined as the difference in the location of a feature between the right eye's and left eye's image. The amount of disparity depends on the depth (i.e., the difference in distance to the two object and the distance to the point of fixation), and hence it is a cue that the visual system uses to infer depth. Wheatstone (1838) was first to figure this out. Before that, people were confused, thought that having two eyes posed a problem because couldn't figure out how you could see only one image when viewing the world with two eyes. Wheatstone correctly pointed out the advantage of having two eyes to see objects in 3D depth. However, the disparity also depends on the distance to the fixation as well, so that disparities must be further interpreted using estimates of the fixation distance.

Horopter: imaginary 3D surface in the room in front of you that includes the object you are fixating on and all other points in 3D space that project to corresponding positions in the two retinae. The geometric horopter (the set of points with zero disparity) is a circle that includes the fixation point and the optical centers (lenses) of the two eyes.

Uncrossed disparity: An object farther away from you than the horoptor has uncrossed disparities. You'd have to uncross (diverge) your eyes to fixate on it. It lies further to the right from the right eye's viewpoint than from the left eye's viewpoint.

Crossed disparity: An object closer than the horoptor has crossed disparities. You'd have to cross (converge) your eyes to fixate on it. It is further to the left from the right eye's perspective.

Stereoscope: One way to view stereo image pairs is to use a mirror stereoscope. If you put your face in front of a pair of angled mirrors, and put two slightly different pictures off to the sides, your left eye will see the left picture (E') and your right eye will view the right-hand picture (E).

Stereogram: A pair of images (such as E/E' above) that are viewed using a stereoscope (or a red-green anaglyph). The two images in a stereogram are slightly different, with features in one image shifted to slightly different positions in the other image. The shifts mimic differences which ordinarily would exist between the views of genuine 3D objects.

There are lots of ways to make and view stereograms. The basic concept is to present slightly different images to the two eyes. One way is to superimpose two half images, one in red and one in green. Viewed through red-green glasses, one eye sees the red image and the other eye sees the green image.

Stereograms have been part of popular culture in each generation since Wheatstone. Brewster stereoscopes (different design, but same in concept) were popular around 1900 with photographed stereo pairs. 3D movies were viewed with red-green anaglyph glasses in the 1950s. Current 3D movies are usually viewed with polarized glasses instead of red-green so the movies can be in color. Another recent technique is the "magic eye" autostereograms.

Random-dot stereogram: The random-dot stereogram was invented by Bela Julesz, a perceptual psychologist who was very influential over the past 30 years. In the example below, with anaglyph glasses you would see a square-shaped surface floating in depth in front of a background. Both the foreground square and the background have little dots painted on them in random locations.

This has important consequences. It indicates that:

To construct a random-dot stereogram, you first place a bunch of dots randomly in an image. Then make two copies of it.  In one copy shift a central square region to the left and in the other copy shift the same central square region to the right. This leaves holes in each of the images (left over from where the square shifted from). Fill the holes with new random dots.  Why do you see it in 3D?  The shift mimics differences which ordinarily exist between the views of genuine 3D objects. The extra dots (X and Y above) correspond to those parts of the background that one eye can see, but which are occluded from the view of the other eye by the foreground square.

How does the visual system see depth in a random-dot stereogram?  One hypothesis is that the visual system matches up features of similar shape, size, contrast, etc. to estimate disparity. But, there can be lots of potential matches.  In principle each dot present in one row of one half-image could have a large number of matches in the other half-image.

This problem of resolving this ambiguity is known as the problem of global stereopsis because the brain must find the correct overall (global) set of matches. It can't just try to find a mate for each feature independently.  Global stereopsis is not just an issue for random-dot stereograms. Natural scenes (e.g., tree with leaves, carpet, etc.) have similar features. The visual system "solves" the global stereopsis problem by using additional constraints. For example, nearby points in the image are usually at nearby positions in depth, hence have nearly the same disparity.

Autostereogram: The autostereogram is also known as a "magic eye" stimulus. The trick is to display slightly different images to the two eyes. The autostereogram works by having repetitive patterns. To see depth in an autostereogram, you need to either cross or diverge your eyes so that they fixate separately on two different repeats of a repetitive pattern. In this way, you effectively get two different images to the two eyes. A simple example is the wallpaper illusion. If you view vertically striped wallpaper and fixate one eye on one stripe and the other eye on the another stripe, the stripes appear to pop out in depth in front of the wall. This is hard to do because you need to fixate on a point that is effectively in front of the picture while focusing/accomodating on the picture itself. Note that the depth you will perceive will be the opposite if you cross-fuse an autostereogram than if you diverge-fuse it.

Stereoblindness: 10% of people are stereoblind. Some are totally stereoblind, some are blind only to either crossed or uncrossed disparities. Some stereoblindness is caused by strabismus (wandering eye). If not treated/fixed at a very early age (infancy), binocular vision never develops properly. Some people with strabismus end up with amblyopia (sometimes called lazy eye). Amblyopia is a cortical blindness. Amblyopia is a general term for a visual deficit that has nothing to do with the optics or structure of eye and retina. In amblyopia, the brain basically ignores inputs from one eye. Other people with untreated strabismus end up as alternate fixators who can see with either eye, but never use them both at the same time. That is, they first look at you with their left eye (while the right eye is diverged), then switch and look at you with their right eye (while the left eye is diverged). In either case, there is no binocular vision and no stereopsis.

Stereovision in the brain

If you record with a micro-electrode from a V1 neuron while an animal views oriented lines presented separately to the two eyes and vary the disparity, some neurons are selective for particular disparities.

This neuron does not respond at all when a line is shown to one eye at a time. To get a response, the line must be presented simultaneously to both eyes, it must have the correct orientation, direction of motion and the correct binocular disparity, in this case a disparity of about 1/2 deg of visual angle.

Neurons differ in the binocular disparities to which they are tuned. This is a histogram graphing the distribution of disparity preferences.  Many neurons are tuned for 0 disparity (on the horoptor), but in addition there are neurons tuned for a range of crossed and uncrossed disparities.  The picture that emerges is that for each location in the visual field, there is a collection of neurons selective for all different orientations, directions of motion, and binocular disparities. One way this might happen is for a neuron to have a binocular receptive field which sums inputs from slightly displaced positions in the two eyes

Fusion

You have two eyes, and hence two visual fields. A big question in the early 19th century (before Wheatstone): we have two eyes, why don't we alway see two views of the world? Answer: the two images are combined in the brain to yield a single unified perceptual experience.

Panun's fusional area is the range of disparities, or equivalently the range of depths in 3D space on either side of the horoptor, over which the visual system can successfully fuse the two views. If disparity is small enough, within Panum's fusional area, then the visual system suceeds in fusing the two views. If disparity is too large then the neurons in the brain cannot cope with it to create single vision and you either get: diplopia or suppression or binocular rivalry.

Diplopia (double vision): Look at a distant object with boths eyes open. While fixating that object, put your index finger about 6 inches in front of your face. You will see two index fingers (one from the left eye's image and one from the right).

Suppression: This is what normally happens when the retinal disparity is too big (outside of Panum's fusional area). One eye's view dominates. That one is perceived. And the other eye's view is suppressed from awareness.

Binocular rivalry: This is a phenomenon we experience when the two eyes' views are very different from one another. One eye's view dominates for several seconds and is then replaced by that of the other eye. For example, if a horizontal grating presented to one eye and a vertical grating in the other eye, in the percept one might first see the horizontal for a few seconds, then a mixture, then the vertical for a few seconds, etc. The phenomenon of binocular rivalry is of particular interest in studying consciousness/visual awareness because the physical stimuli (the two gratings) do not change, yet the conscious percept changes dramatically over time. Moreover, we have no conscious control over the percept; you cannot by force of will cause the percept to switch from one to the other.

Summary: There are two ways to have single vision. (1) Small disparity yields fusion and stereopsis. (2) Large disparity often causes one eye's view to be suppressed. Binocular rivalry is a special case of suppression in which the suppression switches back and forth between the two views. Note that you can have stereopsis in part of the visual field, diplopia in another part of the visual field and rivalry/suppression in yet another part of the visual field, all simultaneously.

Binocular rivalry in the brain

Neural correlates of conscious perception can be measured experimentally using perceptual illusions in which the percept is dissociated from the physical stimulus. Binocular rivalry is an example of such perceptual illusion that has been used for research on the neural correlates of consciousness. Binocular rivalry occurs when different visual stimuli are presented simultaneously to each eye. Typically, awareness of one or the other stimulus is suppressed so that we are consciously aware of one stimulus at the time, never both. One eye’s view dominates consciousness for several seconds, only to be replaced by the other eye’s view. What makes binocular rivalry so remarkable is that the perceptual experience fluctuates while the physical stimulus remains constant. Because of this dissociation, binocular rivalry presents a unique opportunity for studying the neural correlates of consciousness.

According to one idea, binocular rivalry occurs because neurons in the early stages in visual processing respond to the physical stimulus of each eye, whereas neurons in later stages of visual processing are switched on and off and cause the perceptual alternations. Somewhere between these early and later stages the neuronal signals conveying one of the two stimuli are suppressed, as if there was a “gate” to visual consciousness. V1 is the first place in the visual processing pathways of the brain where this "gate" could be located, because it is the first place where the signals from the two eyes come together.

Does such a gate exist? If so, what neurons in the brain have this gating function? Are the neurons localized in particular brain areas? Are they a particular cell type? Does the gating occur through modulation of the cells’ firing rate or some other component of their responses (e.g., spike timing, synchronous firing)? What are the neural circuits and neural computations that support the competition between the two stimuli?

Although we do not yet have firm answers to these questions, we have begun to study this in my lab with fMRI. We capitalized on an interesting aspect of the perceptual phenomenon; during a perceptual alternation, one typically perceives a traveling wave in which the dominance of one pattern emerges initially at one location and expands progressively as it renders the other pattern invisible. Note again that there is no physical change in the stimulus while this conscious perceptual change is taking place – it is all “in your head”. This experiment established that waves of activity in primary visual cortex (V1) accompanied the perceptual changes during binocular rivalry. By relying on the fact that visual cortex is retinotopically organized (neurons at nearby locations in visual cortex respond to nearby locations in the visually field), we showed that neural activity propagated over subregions of visual cortex in a manner that correlated with the dynamic perceptual changes experienced during binocular rivalry.

This movie shows a demonstration of our results. Right, Example of the temporal sequence of an observer’s perceptual experience. Left, fMRI responses. Gray scale, anatomical image passing through the posterior occipital lobe, roughly perpendicular to the Calcarine sulcus. Yellow highlights, V1 gray matter regions exceeding 75% of the peak response. Pay attention to the left hemisphere. The lower lip of the Calcarine sulcus responds first followed by the upper lip of the sulcus. Then the activity in the lower lip of the sulcus subsides before the activity in the upper lip does so. This activity in the brain (left) corresponds nicely to the dynamics of the perceptual changes (right), given the retinotopic organization of V1 (lower lip of left-hemisphere Calcarine sulcus corresponds to upper-right quadrant of visual field and upper lip of left-hemisphere Calcarine sulcus corresponds to lower-right quadrant of visual field). The same would be evident in the right hemisphere, but in a different slice through the brain. Red curve, fMRI responses measured from a subregion of V1 (lower lip of the left hemisphere Calcarine sulcus) corresponding retinotopically to the upper-right quadrant of the stimulus annulus. Green curve, fMRI responses measured from a subregion of V1 (upper lip of the Calcarine sulcus) corresponding retinotopically to the lower-right quadrant of the stimulus annulus. Yellow again indicates when the responses in each subregion exceeded 75% of their respective peaks. The green curve is delayed in time and larger in amplitude than the red curve, as expected from a travelling wave.

Indeed, activity in a number of brain areas correlates with the perceptual alternations of binocular rivalry, including not only primary visual cortex but also higher-order visual cortical areas in the inferior temporal lobe, and areas in parietal and prefrontal cortex. It is likely that these different cortical areas play different roles in visual perception during binocular rivalry.

Pictorial Depth Cues

With only one eye open, you still see with a sense of depth, but there is inherent ambiguity between size and distance.

What does the visual system do to deal with this ambiguity? Your visual system relies on multiple cues for estimating/inferring distance, depth and 3D shape.There are a large set of such cues: relative size, occlusion, cast shadows, shading, dynamic shadows (shadow motion), aerial perspective, linear perspective, texture perspective, and height within the image. Most of these are based on the concept illustrated above: the size of the retinal image of an object is proportional to the object's size, but inversely proportional to the distance to the object.

The girl on the left is actually almost twice as far away from the observer as the man on the right. However, when the room is viewed through the peephole, the actual distances can not be seen. Since you perceive the two people to be at the same distance from you, the one who has the larger visual angle appears larger.

Texture provides 3 cues about shape/distance:

Brightness of a surface depends on its orientation with respect to the light source. The visual system assumes that the light comes from above. Brighter patches appear to be tilted up facing the light.

The interpretation of shape from shading interacts with the interpretation of shape from contours. These two images have the same shading, but different bounding contours, and you see different shapes.

The 3 white squares are identical to one another and they are placed in exactly the same way with respect to the checkerboard grid underneath. Only the shadows differ, giving the impression that the square on the right is floating higher above the checkerboard.

Linear perspective is another monocular depth cue. The distance between the rails is constant in the 3D scene but gets smaller and smaller in the image. This is a cue for distance. The visual system uses this to compare the sizes of objects. The two lines are the same length but the one on top appears bigger because it is seen as being further away and the visual system is compensating for the perspective. This compensation for distance in interpreting size is known as "size constancy".

Analogous to brightness constancy and brightness contrast, or color constancy and color contrast, we also experience size constancy and size contrast. The size of an object is interpreted relative to the objects around it and in the context of the other cues (e.g., linear perspective) for size and distance. The man in the pictures above is physically the same size in both photos (measure him) and he appears normal in size on the left but tiny on the right. The center circles in the drawings below are the same size but the one on the left looks bigger because it is surrounded by small circles and the one on the right looks smaller because it is surrounded by large circles.

A central principle of object perception is that we see objects in a three-dimensional world. If there is an opportunity to interpret a drawing or an image as a three-dimensional object, we do. The visual system compensates for perspective in making judgements about size. It is striking that we are so unaware of this. We have a tendency to interpret shape and size in 3D - often unaware of 2D size. The two table tops above have precisely the same two-dimensional shape on the page, except for a rigid rotation. Nobody believes this when they first look at the illusion. The illusion shows that we don't see the two-dimensional shape drawn on the page, but instead we see the three-dimensional shape of the object in space.

Monocular Physiological Cues

When we fixate an object, we typically accommodate to the object, i.e., change the power of the lens in our eyes to bring that object into focus. The accommodative effort is a weak cue to depth. Once we've accommodated to that distance, objects that are much closer or further from us than that distance are out of focus on our retina. Thus, blur is a cue that objects are at a different distance than the accommodative distance, although the cue is ambiguous as to whether the objects are closer or more distant. Even weaker still as depth cues (although theoretically useful) are the image distortions resulting from astigmatism (the cornea isn't a perfect sphere) and chromatic aberration (when yellow light is in focus, blue light is out of focus, from a given distance to the object).

Movement Cues

The next set of cues involve movement on the retinal image. There are two cases. If the observer moves through a stationary environment, the resulting movement is called motion parallax. Objects will move at different speeds on your retina (for a particular speed of observer movement and choice of fixation object) depending on their distance from the observer.

Motion parallax movie

In the second case, the observer is stationary but the object is in motion (e.g., it is rotating and/or moving in a straight path relative to the observer). The resulting retinal velocities will depend on the relative distance of each object feature from the observer resulting in the kinetic depth effect (the calculation of depth here is called structure-from-motion).


Copyright © 2006, Department of Psychology, New York University
David Heeger