Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Fangfu Liu*, Diankun Wu*, Jiawei Chi*, Yimo Cai1, Yi-Hsin Hung1, Xumin Yu2, Hao Li3,
Han Hu2, Yongming Rao†,2, Yueqi Duan†,1
1Tsinghua University   2Tencent Hunyuan   3NTU
*Equal contribution   Corresponding author
Spatial-TTT Teaser

Spatial-TTT maintains and updates spatial state from streaming video chunks, then answers spatial questions. Test-time training with fast weights acts as compact non-linear memory over long-horizon egocentric video.

Abstract

Humans perceive and understand real-world spaces through a stream of visual observations. The ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time.

We propose Spatial-TTT for streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. We design a hybrid architecture with large-chunk updates in parallel with sliding-window attention for efficient spatial video processing. We introduce a spatial-predictive mechanism in TTT layers using 3D spatiotemporal convolution to capture geometric correspondence and temporal continuity. We further construct a dataset with dense 3D spatial descriptions to guide fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments show that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks.

Video

Spatial-TTT maintains and updates spatial evidence from streaming video, enabling accurate spatial reasoning over long-horizon egocentric observations.

Key Contributions

Spatial-TTT introduces test-time training for streaming spatial intelligence, using adaptive fast weights as compact non-linear memory to accumulate 3D evidence from unbounded video streams.

🧠

Fast Weights as Spatial Memory

We introduce TTT for streaming spatial understanding: adaptive fast weights function as compact non-linear memory, continuously encoding spatial evidence through online updates—enabling sublinear memory growth over unbounded video streams.

📊

Spatial-Predictive Mechanism

Conventional TTT uses point-wise projections that ignore spatial structure. We apply depthwise 3D convolutions so fast weights learn predictive mappings between spatiotemporal contexts rather than isolated tokens, capturing geometric correspondence.

📚

Dense Scene Supervision

Existing spatial QA provides sparse, local supervision with weak gradient signals. We construct a dense scene-description dataset (global context, objects & counts, spatial relations) for rich supervision of fast-weight update dynamics.

Method Overview

Spatial-TTT pipeline

Overview of Spatial-TTT. The model uses a hybrid architecture that interleaves TTT layers with self-attention anchor layers at a 3:1 ratio. Within each TTT layer, sliding-window attention (SWA) and the TTT branch run in parallel with shared Q/K/V; the TTT branch applies a spatial-predictive mechanism with depthwise spatiotemporal convolution to capture geometric structure and temporal continuity.

Experimental Results

Spatial-TTT achieves state-of-the-art on video spatial benchmarks while scaling efficiently in memory and compute with input length.

VSI-Bench

General spatial understanding on egocentric video (ScanNet, ScanNet++, ARKitScenes). We report Accuracy (ACC) for multiple-choice and Mean Relative Accuracy (MRA) for numerical questions.

VSI-Bench comparison

VSI-SUPER: Recall & Count

Streaming spatial sensing on long-horizon video. VSR tests recall of temporal order of inserted objects; VSC tests object counting across extended sequences. Spatial-TTT maintains stable performance as video length increases via online fast-weight updates.

VSR VSC comparison

Memory & Compute Scaling

Peak decoding memory and theoretical TFLOPs vs input length (352×480). Spatial-TTT scales near-linearly; at 1024 frames it achieves over 40% reduction in both TFLOPs and memory compared to Qwen3-VL-2B.

Peak memory and TFLOPs

Citation

@article{spatialttt2026,
  title     = {Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training},
  author    = {Liu, Fangfu and Wu, Diankun and Chi, Jiawei and Cai, Yimo and Hung, Yi-Hsin and Yu, Xumin and Li, Hao and Hu, Han and Rao, Yongming and Duan, Yueqi},
  journal   = {arXiv preprint arXiv:2603.12255},
  year      = {2026}
}