Deep neural networks provide strong predictive models of primate visual cortex, but are typically trained with end-toend objectives requiring global backpropagation. We introduce ST-MMCR, a layerwise self-supervised learning scheme that trains successive stages of a convolutional hierarchy with local objectives matched to their receptivefield scale. The objective is based on maximum manifold capacity representations: temporally nearby views are pooled into compact, discriminable manifolds using projection, local spatial pooling, temporal pooling, and a nuclear-norm loss. This implements complexity matching through the architecture rather than through separate hand-crafted augmentation streams as was done in previous work. Evaluated on macaque V1, V2, and V4 datasets and human-aligned object-classification benchmarks, ST-MMCR matches or exceeds architecturematched supervised and self-supervised baselines in neural predictivity, approaches adversarially robust models, and improves out-of-distribution and human-aligned behavior when used as a visual front end.