Editing 3D Scenes via Text Prompts without Retraining

Shuangkang Fang, Yufeng Wang, Yi Yang, Yi-Hsuan Tsai, Wenrui Ding, Shuchang Zhou, Ming-Hsuan Yang

Megvii Inc , Google Inc , UC Merced

Paper Code(coming soon)


Numerous diffusion models have recently been applied to image synthesis and editing. However, editing 3D scenes is still in its early stages. It poses various challenges, such as the requirement to design specific methods for different editing types, retraining new models for various 3D scenes, and the absence of convenient human interaction during editing. To tackle these issues, we introduce a text-driven editing method, termed DN2N, which allows for the direct acquisition of a NeRF model with universal editing capabilities, eliminating the requirement for retraining. Our method employs off-the-shelf text-based editing models of 2D images to modify the 3D scene images, followed by a filtering process to discard poorly edited images that disrupt 3D consistency. We then consider the remaining inconsistency as a problem of removing noise perturbation, which can be solved by generating training data with similar perturbation characteristics for training. We further propose cross-view regularization terms to help the generalized NeRF model mitigate these perturbations. Our text-driven method allows users to edit a 3D scene with their desired description, which is more friendly, intuitive, and practical than prior works. Empirical results show that our method achieves multiple editing types, including but not limited to appearance editing, weather transition, material changing, and style transfer. Most importantly, our method generalizes well with editing abilities shared among a set of model parameters without requiring a customized editing model for some specific scenes, thus inferring novel views with editing effects directly from user input.


We initially utilized a 2D editing model to perform the preliminary editing on the images of a 3D scene. We subsequently apply a designed content filter to remove images with poor editing results that cause significant 3D inconsistency. However, the remaining images after selection may still contain inconsistent 3D results, in which we consider as noise perturbation to the consistent edited images due to the inherent stochastic and diverse nature of the 2D editing model. Thus, we leverage this characteristic to create training data pairs by generating image captions through the BLIP model and target captions via GPT, then applying minor perturbations associated with these captions to a 3D scene. Therefore, these perturbations can be viewed as noise, as well as unedited images as pseudo-ground truth. Based on this produced training data, we introduce two cross-view regularization terms during training, including the self and neighboring views, to improve the 3D editing consistency. The former requires the NeRF model to generate consistent results for the same target view that derives from two different source views, while the latter enforces the overlapping pixel values between the target and adjacent views to be approximately close. Finally, both the perturbation dataset and regularization terms are incorporated into our generalizable NeRF model training to facilitate its 3D consistency.


Video Display






author = {Fang, Shuangkang and Wang, Yufeng and Yang, Yi and Yang, Yi and Tsai, Yi-Hsuan and Ding, Wenrui and Yang, Ming-Hsuan and Zhou, Shuchang},
title = {Text-driven Editing of 3D Scenes without Retraining},
journal = {Arxiv},
year = {2023}