Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

Abstract

Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.

Figure 1. Gallery of Sherpa3D: Blender rendering for various textured meshes from Sherpa3D, which is able to generate high-fidelity, diverse, and multi-view consistent 3D contents with input text prompts. Our method is also compatible with popular graphics engines.

Zero-shot 3D Generation

Iron Man in his state-of-the-art suit, confidently standing, looking ahead, ready for action.

Spaceship,futuristic design,sleek metal,glowing thrusters, flying in space.

A detailed and realistic 3D model of a vintage camera

Detailed portrait of a noble knight, full armor, intricate helmet design.

A DSLR photo of an adorable Corgi dog with a wagging tail.

A statue of a angel.

A futuristic battle robot, heavily armed, amidst a post- apocalyptic urban wasteland.

A 3D model of A Darth Vader helmet, highly detailed.

A cybernetic biomechanical arm, with a blend of organic and mechanical elements.

Hyper-realistic image of a snow leopard, capturing its camouflage and majestic stance.

Mesh Animations

The mesh generated by Sherpa3D can be animated by Mixamo.

Method

Given a text as input, we first prompt 3D diffusion to build a coarse 3D prior M encoded in the geometry model (e.g., DMTet). Next, we render the normal map of the extracted mesh in DMTet and derive two guiding strategies from M: Structural Guidance: We utilize the structural descriptor to compute salient geometric features for preserving geometry fidelity (e.g., without a pockmarked face problem); Semantic Guidance: We leverage a semantic encoder (e.g., CLIP) to extract high-level information for keeping 3D consistency (e.g., without multi-face issues). Employing the two guidance in 2D lifting process, we use the normal map as shape encoding of the 2D diffusion model and unleash its power to generate high-quality and diversified results with 3D coherence. Then we achieve the final 3D results via photorealistic rendering through appearance modeling. ("Everest's summit eludes many without Sherpa.")

BibTeX

@article{liu2023sherpa3d,
      title={Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior},
      author={Fangfu Liu and Diankun Wu and Yi Wei and Yongming Rao and Yueqi Duan},
      journal={arXiv preprint arXiv:2312.06655},
      year={2023}
}