3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

CVPR 2026

1Nankai University    2Nanjing University    3Horizon Robotics    4D-Robotics
Project Leader    Corresponding Author

Create high-fidelity 3D Scene from a single image using the In-Place Completion diagram from 3D-Fixer.
teaser

Abstract

Compositional 3D scene generation from a single view requires simultaneous recovery of the layout and 3D as- sets, which are currently achieved by feed-forward gener- ation methods and per-instance generation methods domi- nantly. The former methods directly predict 3D assets with explicit 6DoF poses via efficient network inference, but they lack generalization ability to complex scenes, whereas the latter methods gain generalization via divide-and-conquer strategy, but they suffer from time-consuming pose opti- mization. To bridge the gap, we introduce 3D-Fixer via a novel in-place completion paradigm. Specifically, 3D- Fixer extends 3D object generative priors to generate com- plete 3D assets conditioning on the partially visible point cloud at the same location, which is cropped from the frag- mented geometry obtained from the geometry estimation methods. Unlike prior works that require explicit pose alignment, 3D-Fixer explicitly utilizes the fragmented ge- ometry as the spatial anchor to preserve layout fidelity. At its core, we propose a coarse-to-fine generation scheme to resolve boundary ambiguity under occlusion, supported by a dual-branch conditioning network and an Occlusion- Robust Feature Alignment (ORFA) strategy for stable train- ing. Furthermore, to dismantle the data scarcity bottleneck, we present ARSG-110K, the largest scene-level dataset to date, comprising over 110K diverse scenes and 3M anno- tated images with high-fidelity 3D ground truth. Extensive experiments demonstrate that 3D-Fixer achieves state-of- the-art geometric accuracy, which significantly outperforms baselines like MIDI and Gen3DSR, while maintaining the efficiency of diffusion process

How it works


Given an input image of a scene, we segment it into multiple parts and use a multi-instance diffusion model conditioned on those images to generate compositional 3D instances of the scene. These 3D instances can be directly composed into a scene. The total processing time runs in as little as 40 seconds.

Interactive Results


Comparisons to Other Methods


Method Overview



Architecture of the 3D-Fixer pipeline and dataset. (Top) Scene Decomposition extracts instance-level partial geometry from the input. (Bottom-left) Progressive Completion generates the full asset via three stages: 1) The Coarse Structure Completer hallucinates topology within a loose bound; 2) The Fine Shape Refiner sharpens geometry within a fine boundary; and 3) The Occlusion-Aware 3D Texturer applies observation-aligned textures. (Bottom-right) Our ARSG-110K Dataset provides high-quality assets and rich scene compositions for training.

BibTeX

@inproceedings{yin20263dfixer,
  title={3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image},
  author={Yin, Ze-Xin and Liu, Liu and Wang, Xinjie and Sui, Wei and Su, Zhizhong and Yang, Jian and Xie, jin},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  year={2026}
}