Sound texture perception via statistics of peripheral auditory representations

J H McDermott and E P Simoncelli

Published in Computational and Systems Neuroscience (CoSyNe), (III-95), Feb 2011.

This paper has been superseded by:
Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis
J H McDermott and E P Simoncelli.
Neuron, vol.71(5), pp. 926--940, Sep 2011.


The sounds of rainstorms, fires, insect swarms, and galloping horses are a superposition of many similar acoustic events, and are distinguished by temporal homogeneity. We propose (McDermott, Oxenham, & Simoncelli, WASPAA, 2009) that the perception of such "sound textures" is mediated by statistics - time-averages of the acoustic measurements made in the early auditory system. We have explored this hypothesis by synthesizing novel sounds with statistics matching those extracted from real-world textures, on the grounds that this should produce realistic texture examples if the statistics used are responsible for perception. Here we extend our statistical model to be compatible with the known physiological structure of the auditory system, and test the role of different statistics and representational properties with experiments on human listeners.

Natural sounds were processed with a cascade of two filter banks, representing cochlear channels and modulation bands computed from their compressed envelopes. We measured marginal moments and pair-wise correlations of these filter responses, capturing spectral and temporal structure, and sparsity. Our synthesis algorithm then imposed these statistics on samples of noise. Although the statistics in our model were not hand-tuned to specific natural sounds, their imposition produced compelling synthetic examples of a large set of real-world sound textures (available at http://www.cns.nyu.edu/~jhm/texture_examples/). Omitting any individual class of statistic audibly impaired the results. Moreover, sounds synthesized using representations qualitatively distinct from those in the auditory system (linear- instead of log-spaced filter banks, or without cochlear compression) generally did not resemble their real-world counterparts, indicating that successful synthesis depends on, and reflects, the use of a biologically plausible representation. The results show that relatively simple statistics can underlie sound texture percepts, and illustrate how sound textures and their synthesis can serve as engines for the investigation of audition.


  • Listing of all publications