An important principle that guides our understanding of brain representations is that of efficient coding, in which neurons maximize the information they carry about external stimuli, subject to energetic or activity constraints (Barlow, 1961). While this idea has proven powerful in explaining various features of early sensory representations, technical limitations in estimating or bounding mutual information for complex distributions have hindered the use of these principles for learning abstract semantic features of the natural world, of the kind that can support planning, decisions or actions. Here, we exploit recent advances in machine learning that allow estimation of mutual information for complex high dimensional densities, in order to learn representations of natural images with an InfoMax objective, under traditional constraints on overall activity. We find that such learning yields semantic disentangled representations in which individual neural axes encode meaningful features of the sensory world in a combinatorial fashion. Moreover, although the model is trained only on static images, the resulting representation enables linear predictability over time for natural videos, by straightening video trajectories. Overall, our approach points to the general utility of information maximization as a basis for building useful world models and denoising objectives as a potential circuit level mechanism supporting such learning.