SGAM: Building a Virtual 3D World through Simultaneous Generation and Mapping

  • \(^1\) University of Illinois at Urbana-Champaign
  • \(^2\) Massachusetts Institute of Technology

NeurIPS 2022 @Poster session I on Tuesday Nov 29, 11AM -1 PM CST at Hall J #522


Our goal is to generate a large-scale 3D world from a single RGB-D inital pose. We present a new 3D scene generation framework that simultaneously generates sensor data at novel viewpoints and builds a 3D map. Our framework is illustrated in the diagram below. The GIF animation above is generated via SGAM with only the first RGB-D frame known.



We present simultaneous generation and mapping (SGAM), a novel 3D scene generation algorithm. Our goal is to produce a realistic, globally consistent 3D world on a large scale. Achieving this goal is challenging and goes beyond the capacities of existing 3D generation or video generation approaches, which fail to scale up to create large, globally consistent 3D scene structures. Towards tackling the challenges, we take a hybrid approach that integrates generative sensor modeling with 3D reconstruction. Our proposed approach is an autoregressive generative framework that simultaneously generates sensor data at novel viewpoints and builds a 3D map at each timestamp. Given an arbitrary camera trajectory, our method repeatedly applies this generation-and-mapping process for thousands of steps, allowing us to create a gigantic virtual world. Our model can be trained from RGB-D sequences without having access to the complete 3D scene structure. The generated scenes are readily compatible with various interactive environments and rendering engines. Building upon the CLEVR dataset, we propose a large-scale 3D scene generation benchmark, CLEVR-Infinite dataset, and demonstrate ours can generate consistent, realistic, and geometrically-plausible scenes that compare favorably to existing view synthesis methods.


Results on our CLEVR-Infinite Dataset

The following videos are 3D scene examples sampled by varying trajectory orders.

Results on our GoogleEarth-Infinite Dataset

The following are color point cloud visualization of the global mapping from four generated 3D scenes.


The following videos are 3D scene GIF examples sampled with different initial RGB-D images.



We thank Vlas Zyrianov for his feedback on our paper drafts. Besides, our codebase is modified on top of VQGAN codebase. Many thanks to Patrick Esser and Robin Rombach, who makes their code available.