All organisms make temporal predictions, and their evolutionary fitness level generally scales with the accuracy of these predictions. In visual perception, observer motion and continuous deformations of objects and textures imbue our visual input with distinct temporal structures, enabling partial prediction of future inputs from past ones. Inspired by recent hypotheses that primate visual representations support prediction by ``straightening'' the temporal trajectories of naturally-occurring input, we formulate an architecture for image representation that facilitates prediction by linearizing the temporal trajectories of frames of natural video. To facilitate this goal, we note that many deformations can be described as linear advances in the phase of a complex-valued representation. The most well-nown example is that of the Fourier transform, whose complex coefficients have constant amplitude and shifting phase under translational motion, but the concept generalizes to all compact commutative Lie groups. We train a network to optimize next-frame predictions over large natural video datasets and show that it achieves performance on par with (or better than) that of traditional motion compensation and conventional deep networks while remaining interpretable and fast. The learned filters come in conugate phase-shifted pairs and are selective for orientation and scale, which is reminiscent of the models used to describe selectivity of primary visual cortex neurons. Our results demonstrate the potential of a principled video processing framework for modeling visual processing and eventually linking behavioral performance with neural computations.