Visual Temporal Prediction: Representation, Estimation, and Modeling

Fiquet,  Pierre-Etienne

Visual Temporal Prediction: Representation, Estimation, and Modeling

Pierre-Etienne Fiquet.

PhD thesis, ,
Sep 2024.

Download:

All organisms make temporal predictions, and their evolutionary fitness generally depends on the accuracy of these predictions. Understanding what structure enables computing temporal predictions in complex natural scenarios is central to computational neuroscience and machine learning. This thesis focuses on visual processing and describes a unified approach for the representation and estimation of dynamic visual signals, as computed with neural elements in brains or machines. We propose that temporal prediction can serve as a general objective function, both for unsupervised learning of visual representations and for estimat- ing probable future frames under uncertainty. Optimizing for next-frame prediction leverages the order of time, especially visual motion, and extracts predictive information from image sequences, without requiring labeled data. The architecture of a predictive system plays a critical role and we hypothesize that it should reflect the symmetry properties of the physical world. Specifically, a model that discovers the transformations acting in a visual scene should exploit these transformations to predict accurately and should remain agnostic when these transformations are ambiguous. Such interpretable models can serve as guides to explain the responses of neurons in sensory cortices and their functional role in visual perception.

We first describe the empirical structure of dynamic visual scenes and then develop a mathematical theory for exploiting that structure. The movement of observers and objects creates distinct temporal structures in visual signals, allowing for partial prediction of future signals based on past ones. Motivated by group representation theory, we propose a method to discover and utilize the transformation structures of image sequences and show that local phase measurements play a fundamental role. The proposed model extrapolates visual signals in a local polar representation, this representation is learned via next-frame prediction. This polar prediction model successfully recovers simple transformations in synthetic datasets and scales to natural image sequences. The architecture is simple yet effective: it contains a single hidden stage with one non-linearity that factorizes slow form and steady motion signals. We demonstrate that polar prediction achieves better prediction performance than traditional approaches based on motion compensation, and that it rivals conventional deep networks trained on prediction.

We then confront the inherent uncertainty of visual temporal prediction and develop a framework for learning and sampling the conditional density of the next frame given the past few observed frames. Casting prediction as a probabilistic inference problem is motivated by the need to cope with ambiguity in natural image sequences. We describe a regression- based framework that implicitly estimates the distribution of the next frame, effectively learning a conditional image density from high-dimensional signals. This is achieved with a simple resilience-to-noise objective function: a deep neural network is trained to map to past conditioning frames and a noisy observation of the next frame to an estimated denoised next frame. The network is trained over a range of noise levels without access to that noise level, i.e., it is blind and universal. This denoising objective has the desirable property of being local in the space of densities, and training across noise levels forces the network to extract information about the stable underlying distribution of probable next frame given past conditioning. We consider synthetic image sequences composed of moving disks that occlude each other and demonstrate that trained networks can handle challenging cases of bifurcating temporal trajectories -- effectively choosing one occlusion or another when the observation is ambiguous. Furthermore, local linear analysis of a network trained on natural image sequences reveals that the model automatically weights evidence by reliability: the model integrates information from past conditioning and noisy observation, adapting to the amount of predictive information in the conditioning, and the noise level in the observation.

Finally, we discuss the implications of this work for understanding biological vision. Starting from the polar prediction model, we derive a circuit algorithm composed of local neural computations for predicting natural image sequences. The circuit is composed of canonical computational elements that have received ample physiological evidence: its components resemble the normalized simple and direction-selective complex cell models of primate V1 neurons. Unlike the polar prediction model, this circuit algorithm does not impose a polar factorization, instead, it lets complex-cell-like units learn a combination of quadratic responses. Furthermore, we outline a method for gradually extracting slower and more abstract features by cascading this biologically plausible mechanism. These models offer a normative framework for understanding how the visual system represents sensory inputs in a form that simplifies temporal prediction. Together, our work on visual temporal prediction builds connections between computational modeling and brain sciences. These connections can guide the design and analysis of physiological and perceptual experiments, and can also motivate further developments in machine learning.

Related Publications: Fiquet23c, Fiquet23a

Listing of all publications