Humans perceive and understand real-world spaces through a stream of visual observations. The ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time.
We propose Spatial-TTT for streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. We design a hybrid architecture with large-chunk updates in parallel with sliding-window attention for efficient spatial video processing. We introduce a spatial-predictive mechanism in TTT layers using 3D spatiotemporal convolution to capture geometric correspondence and temporal continuity. We further construct a dataset with dense 3D spatial descriptions to guide fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments show that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks.