Perception Lecture Notes: Recognition

Object recognition is used for a variety of tasks: to recognize a particular type of object (a moose), a particular exemplar (this moose), to recognize it (the moose I saw yesterday) or to match it (the same as that moose).

Visual recognition seems effortless but it is actually extremely difficult. Is difficult because the images of an object differ hugely as the exemplar, viewpoint or other viewing condition is varied.

One of the impressive accomplishments of visual recognition is that we can recognize an object from many different viewpoints, even though the images of that object are very different from one another. We can also very rapidly catagorize an object (e.g., as a duck) even though there are many different kinds of ducks including drawings of ducks and rubber ducks.

even if you have never seen them before.

Recognition normally works well in spite of variations in lighting, shading, reflection, shadows, etc.

Is Face Recognition is Special?

Much of the research on recognition is focused on whether or not there is a specialized visual process for recognizing faces as opposed to other objects. For example, infants show a tendency to track a moving face at just 30 minutes of age.

Prosopagnosia (face blindness)

Check out Bill's face blindness home page. Bill didn't realize he was face blind until age 49, but relates lots of anecdotes from his early life suggesting that he was born with it. He remembers when he was 6 years old at the movies, telling a friend that it was really dumb for the bank robbers to cover only their faces because you could still see all the rest of them. He passed his mom walking on the street one day and didn't notice her. He was in the Navy for four days and freaked out because he couldn't tell anyone apart, given that they all had the same haircuts and wore the same uniforms.

Prosopagnosia is more common than previously believed, affecting about 2.5% of the population. In other words, you probably know somebody who suffers to some degree from face blindness. An article about face blindness appeared in the New York Times on July 18, 2006. Similar articles have appeared in The Times, Time Magazine, The Boston Globe, and The Economist. There is a website (www.faceblind.org) devoted to research prosopagnosia and face perception.

People don't usually just have a problem with only faces. Rather, it's often recognition tasks that involve distinguishing between similar objects that are all in the same catagory (e.g., birds, flowers), a point that we'll come back to later

Object agnosia: Morris Moscovich studied a patient with object agnosia (that couldn't recognize common objects), but had no trouble with faces.

This patient can describe pictures (like those above) verbally but cannot name them correctly. Download an example description by someone with visual (object) agnosia (3 Mb mpeg).

Visual agnosia is not that uncommon; it affects a number of people who suffer a stroke to visual areas of the brain. What was remarkable about the Moscovich's report is that the patient failed to recognize objects but could still easliy recognize faces. It takes a long time for us to find the 13 faces hidden in above picture (by Doolittle). But this patient sees only the faces (not the objects), and can point them all out immediately.

Double dissociation: some patients show a specific deficit for recognizing faces, other subjects show deficits for recognizing all other objects. This suggests that faces and objects are processed in separate, perhaps non-overlapping, brain areas.

Face inversion effect: This suggests that upright faces are processed differently.

Turn the page upside-down (or stand up and turn your head upside-down if you are working at a computer screen) so that the faces are now rightside-up... You don't see upside-down faces in the usual way.

Careful experiments support this demo that rightside-up faces and upside-down faces are processed differently. In a sequential matching task, on each trial subjects were shown an unfamilar face, followed by brief interval, and then followed by second face. Subject responded "same'' or "different''. There were two key results:

Normal subjects performed better (94% correct) for upright than for inverted (82% correct) faces. This implies specialized processing for faces.
Patient LH suffers from prosopaganosia. LH did relatively poorly compared to control subjects on both upright and inverted faces. But interestingly, he performed better (72% correct) for inverted than for upright (58% correct). A disproportionate impairment on upright relative to inverted faces implies an impairment of a face-specific processor, which is engaged by the upright but not the inverted faces, and used despite being disadvantageous.

Face selectivity in IT: Desimone recorded from neurons in area IT of monkey cortex. He found strong responses to faces.

Each panel plots the response of an IT neuron (firing rate over time) in response to each of the pictures. Top row:

strong response to intact face
scramble features randomly - very little response
another intact monkey face - strong response
blank out eyes - response is a bit weaker.
blank out mouth - again, a bit weaker
remove color from the picture - weaker response than with color
Bob Desimone's face
hand - complex stimulus but very weak response

Bottom row:

change angle of face systematically - response decreases systematically.
blank out eyes
toilet brush

This was a controversial result when it was first published. No matter how many manipulations you do, a determined skeptic can always claim that there's some kind of simple feature in there that's triggering the cell. E.g., there are lots of oriented contours in there. Could it be that the IT neuron is responding to these simple features like a V1 cell would? What "features" are most important? I believe that these neurons probably are involved in face recognition. But are they involved in recognizing/categorizing that it is a face, or in recognition of specific faces, or in recognition of facial expressions?

Functional (columnar) architecture for faces: Tanaka used optical imaging in IT to demonstrate columnar organization.

The figure below shows images of the same cortical area obtained for five different views of a doll face. Systematic movement of the (dark) activation spot with rotation of the face. Colors at the bottom show outlines of the activation spots for several different experiments (in different brains or hemispheres). Through this and other experiments, this supplies evidence that neurons in nearby columns respond to similar complex feature patterns.

fMRI and face selectivity: One human brain area responds most to faces. A neighboring brain area responds most to pictures of buildings and familiar places. Yet another brain area responds to pictures of objects that are neither faces nor places.

This figure is from a study in which subjects viewed pictures of faces in comparison to non-face pictures. Left: slices through each subject's brain showing a region in the ventral temporal lobe that responds to faces. Right: Amplitude of brain activity (numbers) in the "face area" in response to each picture. The response is greater for faces than for places (compare the two pictures outlined in red).

This figure is from a study in which subjects viewed pictures of buildings and familiar places. Left: slices through each subject's brain showing a region in the ventral temporal lobe that responds to faces. Right: Brain activity in the "place area". The response is greater for places and buildings than for other objects. The conclusion from these (and a number of other) fMRI experiments is that faces and objects are processed in separate brain areas, separated by several centimeters.

Role of expertise: Mike Tarr had subjects practice for many weeks to learn to make fine discriminations in a novel category of objects, that Tarr called Greebles. The Greebles had names (which the subjects learned), genders, and families.

At first, performance was like that for objects, e.g., no inversion effect. After lots of practice, performance was like that for faces, i.e., better at recognizing upright greebles.

Tarr followed up on this with an fMRI study. The left two columns are from before practice: when subjects viewed pictures of greebles there was no activity in the "face" area (small white outline). The right two colums were made after lots of practice: the brain activity shifted to the "face" area so that the brain activity was very similar for Greebles and for faces.

Human single-neuron electrophysiology: Patients with severe epilepsy are often studied in the hospital for a period of several days to carefully locate the abnormal brain tissue that is causing seizures, before removing any brain tissue. These patients undergo surgery to have some electrodes implanted in their brain to record the electrical activity of the brain. In some cases, it is possible to record action potentials from individual neurons. Then they sit in a hospital bed waiting for seizures to occur. When the seizures do occur, the electrical measurements from the implanted electrodes help to locate the diseased tissue very precisely. Once the location has been determined, the patients undergo another operation to have the electrodes removed along with the diseased brain tissue that is causing the seizures. While in the hospital with the electrodes implanted, in between seizures, some of these patients have agreed to participate in perceptual and cognitive neuroscience experiments. Hence, there are a handful of reports in the neuroscience journals of the responses of individual neurons in human brains.

Below is an example of the data from such an experiment. This particular neuron was in the medial temporal lobe and responded very selectively to pictures of Jennifer Aniston. Each panel shows a picture and below that is a graph of the neuron's response to that photo (firing rate over time). The two vertical dashed lines in each graph indicate when the photo was presented in time. The patient was shown hundreds of photos but the neuron responded only to Jennifer. It responded to many different photos of Jennifer, but not Jennifer when she was photographed with Brad Pitt. It appears to be the "Jennnifer Aniston" cell, the neuron in the brain that is responsible for recognizing Jennifer. An important caveat, however, is that this part of the brain is known to be involved in memory. This particular neuron be activated by some memory that is triggered by the photo (perhaps the memory of a "Friends" episode), and may have nothing per se to do with the recognition.

Summary

Evidence for brain areas that are functionally specialized for face recognition:

Infants show a tendency to track a moving face at just 30 minutes of age.
Face agnosia without object agnosia in some patients.
Object agnosia without face agnosia in other patients.
Inversion effect: normal subjects are better at recognizing upright than inverted faces.
Inverted inversion effect: a prosopagnosic was better at recognizing inverted than upright faces.
In fMRI experiments, faces "light up" a separate/neighboring brain region.
Face selective cells in inferior temporal cortex (IT) and in human medial temporal lobe (MTL).

Evidence that face recognition is not functionally specialized in a particular brain area:

Face columns are intermingled with other feature columns in most areas of the monkey brain where face-selectivity has been found.
Face agnosics often show impaired recognition/discrimination ability for non-face objects within a catagory (e.g., birds).
After lots of training with non-face stimuli (greebles), subjects exhibit face-like recognition performance (e.g., face inversion effect).
After lots of training with greebles, fMRI activity shifts from the "object" area to the "face" area.
The human MTL neurons that exhibit selective responses to individual faces may in fact be more involved in memory than they are in perception.

Reading and Letter Recognition

Letter recognition is, like face recognition, believed to be handled by special-purpose circuitry in the brain.

Object Representation

Object recognition is hard - it's amazing that we can do it at all. The image on the retina is vastly different for different viewpoints of the same object. The image is also vastly different under different lighting conditions (e.g., shadows) or when an object is partially occluded by other objects in a cluttered scene. Despite this sensory variability, we can rapidly recognize any familiar object from almost viewpoint an under almost any lighting conditions.

Recognition by parts: Biederman suggested that recognition works by decomposing objects into a fixed set of primitive shapes (Biederman suggests that there are 36 such primitives), that he calls geons.

Observe that many objects can be constructed by combining just a few of these primitives. The idea is that we first recognition each of the constituent geons and then the spatial relationship between them.

What is the evidence for Biederman's theory? One example is that we can no longer identify an object when we occlude the object's geons by obscuring their intersections. Do you find this compelling?

View-dependent recognition: This is an alternative theory. You store in your head a bunch of characteristic views (mental images) of objects. You recognize a new image by finding the closest match. That is, you don't use 3D shape to recognize objects. Only the 2D views of the objects

Each panel shows different views of the same objects. It's hard to see that they are all the same. Bülthoff/Edelman did an experiment to quantify this. They trained subjects on one set of views (like one of the panels above) of each object. After training, subjects looked at similar views of the training objects, different views of the training objects, and images of totally different objects. The task was simply to report on each trial if it was one of the training objects. Performance was good on similar views to those used in training, but quite bad for very different views.

Recognition in the Brain: Logothetis trained monkeys in an object recognition task like the Bülthoff and Edelman experiment (show object rotating a bit, test recognition of that object for similar views, different views, and distractor objects).

Top panel: View-selective responses were found in IT. Insets show many views of the same target. Firing rate reached a maximum for presentation of one particular object view (indicated by the box), and declined as the object was rotated away from this preferred view.

Bottom panel: Responses to distractors are much lower, even when the distractors contain features that are similar to the target, e.g., inverted V's indicated by dashed circles.

The conclusion from these studies and many others like them is that object recognition is viewpoint dependent.

Perceptual Organization

So far, we have been discussing recognition of isolated objects. Recognition of objects in real scenes is much more complicated. How do you segment the visual field into parts that correspond to entire objects? How do you integrate visual information across different patches of the visual field? How do you fill in the missing parts?

Segmentation: divide the visual field into distinct parts/objects. Grouping: integrate visual information across different patches (fill in the missing parts). You can't recognize an object until it is properly grouped/segmented.

Sometimes boundaries/edges are not evidenced by a simple change in retinal image intensity/color, even in real scenes. So the visual system must make an unconscious inference to "complete" the missing bits of the boundaries.

Some V2 neurons respond to illusory contours. In this example, the neuron responds to the orientation of the illusory contour, not the orientation of the real lines that define it. Each white dot in the graphs represents a spike. Each row of the graph plots neural activity over time for a particular stimulus orientation. Orientation varies from row to row. There were lots of spikes for the vertically oriented line (right panel) and lots of spikes for the vertical illusory contour in (left panel). Note: V1 neurons don't respond in this way. Rather, V1 neurons always respond to the orientation of the real lines, ignoring the orientation of the illusory contour.

Responses to illusory contours have also been measured in the human brain with fMRI.

Grouping and crowding:

Fixate on the plus sign at the top right and you have no trouble recognizing the "D". But fixate on the minus sign at the bottom right and you have great difficulty recognizing any of the 3 letters. It looks as if the features from each of the letters get scrambled with those from the other letters. This is called crowding. To explain crowding... [More Here]

The visual system sometimes struggles (trying multiple groupings and segmentations) to perceptually organize a stimulus, to figure out which of the features belong together (grouping) and which of them do not, resulting in a dynamic percept in which different organizations compete with one another.

Grouping is based on a number of visual features, first described by the Gestalt psychologists and known as the Gestalt laws. These laws include:

Similarity
Proximity
Common region
Good continuation
Symmetry
Closure
Common fate.

Regions can be defined by differences in brightness (an edge), differences in items on either side of the edge (e.g., the aligned line terminations of the illusory contours above), or by other differences in texture.

Impossible objects are images that have no globally consistent interpretation. Usually, impossible objects are just drawings with no consistent interpretation. But, one can construct real physical "sculptures" that when viewed from a certain angle look like impossible objects.

Many visual illusions rely on tricks with perceptual organization (segmentation, grouping, figure versus background, multiple possible perceptual organizations, impossible figures, etc.):

David Heeger