Sound texture perception via synthesis

J H McDermott, A Oxenham and E P Simoncelli

Published in Computational and Systems Neuroscience (CoSyNe), (I-74), Feb 2010.

DOI: 10.3389/conf.fnins.2010.03.00122

This paper has been superseded by:
Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis
J H McDermott and E P Simoncelli.
Neuron, vol.71(5), pp. 926--940, Sep 2011.

  • Official (pdf)
  • Supplementary Materials

  • Many natural sounds, such as those produced by rainstorms, fires, and swarms of insects, result from the superposition of many rapidly occurring acoustic events. We refer to these sounds as "auditory textures", and their temporal homogeneity suggests that their defining characteristics are statistical. To explore the statistics that might underlie the perception of natural sound textures, we designed an algorithm to synthesize sounds from statistics extracted from real sounds. The algorithm was inspired by those used to synthesize visual textures, in which statistical measurements from a photographic texture image are imposed on a sample of white noise (Heeger & Bergen, 1995; Portilla & Simoncelli, 2000). Because we are interested in biologically plausible representations, we studied statistics of responses of a standard auditory filterbank that approximates the information available in the auditory nerve. Statistics were first measured from the subbands of a natural sound texture; the subbands of a noise sample were then adjusted using a gradient descent method until their statistics matched those measured in the original. If the imposed statistics capture the perceptually important properties of the texture in question, the synthesized result ought to sound like the original sound.

    We found that simply matching the marginal statistics (variance, skew, kurtosis) of individual filter responses and their envelopes was generally insufficient to yield perceptually satisfactory results, producing compelling synthetic examples only for certain water sounds. We observed that many sound textures contained structure in frequency and time, evident in pair-wise envelope correlations (between different subbands, and between different time points within each band). Imposing these envelope correlations greatly improved the results, frequently producing synthetic textures that sounded natural and that listeners could reliably recognize. Sounds signals that were successfully synthesized in this way included bubbling water, thunder, insect, frog, and bird choruses, applause, running animals, and frying eggs, among many others.

    Despite these successes, there were cases for which synthesized sounds sounded notably different from the corresponding original sound, despite having the same marginal statistics and envelope correlations. Examples of failures included sounds with abrupt broadband onsets, pitch-varying harmonic structure, or strong reverberation. These failures indicate that the statistics we imposed are insufficient to capture these sound qualities, and that the auditory system must be utilizing additional measurements. Our current efforts are directed towards identifying new statistics to account for these sound properties.

    Our results suggest that statistical representations could underlie sound texture perception, and that in many cases the auditory system may rely on fairly simple statistics. Although we lack definitive evidence that the precise set of statistics used in our model are instantiated in the auditory system, we note that they are of a form that could plausibly be computed with simple neural circuitry. Our method provides a means of testing the perceptual importance of such statistics, and of generating new forms of experimental stimuli that are precisely characterized, yet share important properties with real-world sounds.


    References:


  • Listing of all publications