Chat Edit 3D: Interactive 3D Scene Editing via Text Prompts

ECCV 2024

1Beihang University; 2Google; 3Megvii; 4University of California, Merced

CE3D is an interactive 3D scene editing framework that leverages an LLM to integrate dozens of different models.


Recent work on image content manipulation based on vision-language pre-training models has been effectively extended to text-driven 3D scene editing. However, existing schemes for 3D scene editing still exhibit certain shortcomings, hindering their further interactive design. Such schemes typically adhere to fixed input patterns, limiting users' flexibility in text input. Moreover, their editing capabilities are constrained by a single or a few 2D visual models and require intricate pipeline design to integrate these models into 3D reconstruction processes. To address the aforementioned issues, we propose a dialogue-based 3D scene editing approach, termed CE3D, which is centered around a large language model that allows for arbitrary textual input from users and interprets their intentions, subsequently facilitating the autonomous invocation of the corresponding visual expert models. Furthermore, we design a scheme utilizing Hash-Atlas to represent 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images. This design achieves complete decoupling between the 2D editing and 3D reconstruction processes, enabling CE3D to flexibly integrate a wide range of existing 2D or 3D visual models without necessitating intricate fusion designs. Experimental results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects, possessing strong scene comprehension and multi-round dialog capabilities.


We propose a novel editing paradigm by fully decoupling the 2D model from the 3D reconstruction process: A neural network that maps the different views of a 3D scene to plane atlases is initially learned. Subsequently, it transforms the editing of the 3D scene into operations performed within the 2D atlas space. Since the atlases inherently accommodate 2D visual models, there is no need for additional design to adapt 2D visual models to the 3D reconstruction process, which ensures the flexible integration of multiple visual experts. Finally, by leveraging LLMs for parsing user text input and managing various visual models, our framework achieves editing of a 3D scene through chatting, which we refer to as Chat-Edit-3D (CE3D).



      title={Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts}, 
      author={Shuangkang Fang and Yufeng Wang and Yi-Hsuan Tsai and Yi Yang and Wenrui Ding and Shuchang Zhou and Ming-Hsuan Yang},