LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Fangfu Liu1, Hao Li2, Jiawei Chi1, Hanyang Wang1,3, Minghui Yang3, Fudong Wang3, Yueqi Duan1†
1Tsinghua University, 2Nanyang Technological University, 3Ant Group
ICCV 2025 🔥
Indicates Corresponding Author
Teaser

LangScene-X: We propose LangScene-X, An unified model that generates RGB, segmentation map, and normal map and the following reconstructed 3D field from sparse views input.

Abstract

Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability.

Method Overview

Pipeline of LangScene-X. Our model is composed of a TriMap Video Diffusion model which generates RGB, segmentation map, and normal map videos, an Auto Encoder that compress the language feature, and a field constructor that reconstruct 3DGS from the generates videos.

Video Demos from TriMap Video Diffusion

More Results from TriMap Video Diffusion

Activation Demos

Click the buttons to switch between queries.


BibTeX

@misc{liu2025langscenexreconstructgeneralizable3d,
    title={LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion},
    author={Fangfu Liu and Hao Li and Jiawei Chi and Hanyang Wang and Minghui Yang and Fudong Wang and Yueqi Duan},
    year={2025},
    eprint={2507.02813},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2507.02813}, 
}