Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

Fangfu Liu1, Diankun Wu1, Yi Wei1, Yongming Rao2, Yueqi Duan1,
1Tsinghua University, 2BAAI
CVPR 2024

Sherpa3D is a new text-to-3D framework that can generate high fidelity, diverse, and 3D-consistent results within 25 minutes.


Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.

Figure 1. Gallery of Sherpa3D: Blender rendering for various textured meshes from Sherpa3D, which is able to generate high-fidelity, diverse, and multi-view consistent 3D contents with input text prompts. Our method is also compatible with popular graphics engines.

Zero-shot 3D Generation

Iron Man in his state-of-the-art suit, confidently standing, looking ahead, ready for action.

Spaceship,futuristic design,sleek metal,glowing thrusters, flying in space.

A detailed and realistic 3D model of a vintage camera

Detailed portrait of a noble knight, full armor, intricate helmet design.

A DSLR photo of an adorable Corgi dog with a wagging tail.

A statue of a angel.

A futuristic battle robot, heavily armed, amidst a post- apocalyptic urban wasteland.

A 3D model of A Darth Vader helmet, highly detailed.

A cybernetic biomechanical arm, with a blend of organic and mechanical elements.

Hyper-realistic image of a snow leopard, capturing its camouflage and majestic stance.

Mesh Animations

The mesh generated by Sherpa3D can be animated by Mixamo.


Given a text as input, we first prompt 3D diffusion to build a coarse 3D prior M encoded in the geometry model (e.g., DMTet). Next, we render the normal map of the extracted mesh in DMTet and derive two guiding strategies from M: Structural Guidance: We utilize the structural descriptor to compute salient geometric features for preserving geometry fidelity (e.g., without a pockmarked face problem); Semantic Guidance: We leverage a semantic encoder (e.g., CLIP) to extract high-level information for keeping 3D consistency (e.g., without multi-face issues). Employing the two guidance in 2D lifting process, we use the normal map as shape encoding of the 2D diffusion model and unleash its power to generate high-quality and diversified results with 3D coherence. Then we achieve the final 3D results via photorealistic rendering through appearance modeling. ("Everest's summit eludes many without Sherpa.")


      title={Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior},
      author={Fangfu Liu and Diankun Wu and Yi Wei and Yongming Rao and Yueqi Duan},
      journal={arXiv preprint arXiv:2312.06655},