Auditory cortex: Located laterally near top of temporal lobe.
Diagram of brain showing location of auditory cortex, Wernicke's and Broca's areas
Auditory cortex critical for speech perception and language
comprehension. Aphasia refers to the collective deficits in
language comprehension
and production that accompany brain damage. Brain damage to primary
auditory
cortex and/or adjacent Wernicke's area causes a certain kind of
aphasia,
a disorder of language comprehension. Damage to Broca's area, located
near
motor cortex, causes a different kind of aphasia, a disorder of speech
production.
Speech analysis: To
understand something about speech perception, one must begin with the
elements of speech itself. Spoken languages consist of a sequence of
discrete objects. The largest orthographic unit in English is the word, which you might think of as
typically the thing you write surrounded by spaces. However, words can
consist of more than one meaning-containing segment. For example
"childhood" consists of "child" (young person) plus "hood" (the state
of being). These subunits are called morphemes.
Next, each morpheme can be split into a sequence of sounds, called phonemes. Phonemes are defined as
the smallest unit that, if changed, can potentially change the word's
meaning. For example, the "i" sound in "hit" is a phoneme, because if
you change it to "a" you get "hat". Finally, phonemes correspond to
produced speech sounds (although the correspondence is complicated, and
is studies by phonologists and phoneticians!). These speech sounds are
characterized primarily by the manner in which the sounds are produced.
The primary distinction is between vowels (produced with no air
constriction) and consonants (which involve a partial or total
constriction of air flow). All vowels are voiced, meaning that the vocal
cords vibrate while producing the sound, so that the sound has a
well-defined pitch and can be sung. Some consonants are voiced as well.
Vowels are described based on two main distinctions. First, one
specifies the position of the highest part of the tongue in the mouth
used in pronouncing the particular vowel sound (high, mid or low in the
mouth, front/near the teeth or back). Second, the vowel sound may
require the mouth to be wide open, or for the lips to be closed and
rounded (compare how you say "ah" vs. "oh"). The last distinction for
vowels is that some sounds involve a pair of vowel sounds pronounced
smoothly in succession, known as a diphthong
(e.g., the sound "ou" as in "house", which consists of "ah" followed by
long "oo").
There are many different kinds of consonants, depending on the way
in which they are articulated. Stop
consonants (or plosives) involve a complete closure of the air
stream, followed by an explosive release of air. These include the
English consonants p/b, t/d, k/hard-g. Each pair in that list differs
from the next in the place of articulation, i.e., where the air
constriction occurs (between the lips, tongue to teeth, or tongue to
soft palate). The two consonants in each pair differ in their voicing
(b/d/g are voiced: the vocal chords vibrate as soon as the air is
released). Another major class of consonants includes the fricatives
and sibilants. These involve an incomplete closure of the air stream,
resulting in a noisy rush of air through a small opening. Included are
f/v, s/z, and sh/zh (hush vs. azure). Again, the two sounds in each
pair differ as to whether they are voiced (v/z/zh) or not. The voiced
fricatives/sibilants can be sung, as they have a well-defined pitch
defined by the vibrations of the vocal chords. Other consonantal sounds
include the laterals (l/r), glides (y/w), and nasals (m/n).
Spectrogram:
Reading a spectrogram
A spectrogram is a graph of frequency vs. time. Each row is a
frequency band. The intensity displayed at any point in a sound
spectrogram indicates the amplitude of that frequency band (specified
by the vertical position) at that time (specified by the horizontal
position). A spectrogram is computed using
the Fourier Transform. This is very similar to the processing done by
the cochlea. Many of the phonetic distinctions discussed above are
visible in a
sound spectrogram. People have even attempted to train themselves to
"read" the speech in spectrograms, although this turns out to be an
extraordinarily difficult task. However, certain of the phonetic
features discussed above are easily visible in spectrograms.
Vowel sounds are voiced, as well as some consonants that are
extended in time (m, n, z, zh, v, etc.). For such sounds, the vocal
cords are vibrating, producing a fundamental frequency, and the sound
consists of that fundamental as well as integer multiples of it (f, 2f, 3f, 4f, ...). This voicing is visible in
spectrograms (although not in the examples shown here or in your book,
unfortunately) as a series of thin, horizontal bands corresponding to
the separate harmonics. This means the prosody is also visible in a
spectrogram (prosody is the extra-verbal indication in speech, called
the pitch contour, that indicates such things as word stress or whether
a sentence is a statement, and imperative or a question).
Stop consonants are visible as a complete cessation of airflow, and
hence of any sound. That is, there is a brief silence, visible as a
vertical, blank band in the spectrogram. Sibilants and fricatives
involve a rush of air through a near-constriction (with different
places of articulation for different consonants). These are visible in
the spectrogram as a wide band of frequencies (or "noise"), with
different bands for different consonants ("s" is higher in frequency
than "sh", for example).
Vowels are voiced, of course (unless you whisper). But, the manner
of articulation (how you shape your tongue and lips) results in a
filtering of the sound. This is represented in the sound as three
distince peaks in the spectrum, typically between 800 Hz and 3.5 KHz.
These peaks are called formants.
The combination of the three spectral positions of the formants is
indicative of which vowel has been spoken. The spectrograms
(first artificial, then real) below are indicative of a two-phoneme
utterance (a stop consonant followed by a vowel):
Phonemes: formants and format transitions
The formants (after the initial transient shifts) are the same in all examples, indicating that all three are the same vowel sound ("ah"). However, they are preceded by a formant transition. It turns out that the form of the formant transition (upward or downward for each formant, and where in the spectrum these transitions arise) is one of the spectral features that listeners use to discriminate which stop consonant preceded the vowel. Thus, the difference between "ba", "da", and "ga" is in the formant transitions. Some formant transitions are very brief (10-50 msec), like "ba" and "da". Others are relatively long like "pa" and "ga". The length of the formant transition and the time at which voicing begins following the stop are indications of whether the stop consonant is voiced (b, d, g) or unvoiced (p, t, k). These distinctions have been studied perceptually by generating artifical spectrograms (such as those in the top half of the figure above) and asking listeners to identify the utterance. These artificial spectrograms were originally "played" on a machine called a vocoder.
Language learning impairment: Paula Tallal (at Rutgers University) has spent her career studying language learning disabilities, kids that have difficulty learning to understand and produce language. Tallal has demonstrated that these kids have difficulty with speech because they have deficits in the fast (10's of msec) temporal processing needed to distinguish brief formant transistions (like "ba" and "da").
Languange learning impairment caused by deficit in fast temporal processing
In these experiments, subjects had to discriminate between two stimuli, either a high tone followed by a low tone or a low tone followed by a high tone. If the tones are separated in time by more than half a second then both normal and language learning impaired (LLI) subjects have no problem performing the task (100% correct). But for shorter separations (shorter inter-stimulus intervals), LLI children show a dramatic deficit in performance. Tallal believes that this is cause of their language disability. Because they can't hear the differences between rapidly changing sounds, they can't discriminate one formant transition and another, so they have trouble understanding and producing speech.
Tallal and Michael Merzenich (a neurobiologist at UCSF) developed a software systems to help kids with language learning disabilities. The software is kind of like a computer game, but is really an auditory psychophysics experiment in disguise. The idea is to give the kids lots of practice making threshold discriminations between sounds. Over time with practice, their performance gets better, and results in better language comprehension and production. Can find out more about this company and their "FastForword" software at the Scientific Learning Corporation website.
Cochlear implants: The cochlear implant is a wonderful example of how we can take the results of basic research, our understanding of how the peripheral auditory system (cochlea, 8th nerve) codes sound signals, and put it to use.
Photo of cochlear implant
Diagrams of how it is implanted
Several electrodes mounted on a carefully designed support that is matched to the shape of the cochlea. The design of this support structure is critical because it places the electrodes very near the nerve cells. The computer decomposes a sound signal into its frequency components via fourier transform and sends the separate frequency components to the corresponding electrodes. In other words, it computes a spectrogram mimicking the frequency decomposition performed by the cochlea. Then the implant transforms the spectrogram into a series of current pulses for each of the stimulating electrodes. This transformation into the current pulses is based on what we know about the coding of information in the auditory nerve. Both the temporal and place codes are important for signalling pitch. Both the nerve firing rates and the number of active neurons are important for signalling loudness. The goal is to accurately replicate the neural code that would naturally be communicated along the 8th nerve. Note the need to maintain proper timing of information to within 10's of msec for formant transitions.
How well does it work? In some patients, cochlear implants restore speech nearly perfectly. But that is not the case for most patients (at least at this time). When it doesn't work so well, it can be a detriment for some kids who can't hear well enough to succeed in the hearing community. Consequently, there is some controversy... Discussion...
SF Chronicle article (9/23/2001) about a deaf father, mother and daughter gained hearing together through cochlear implants.