We propose an online 3D Gaussian-based dense mapping framework for photorealistic details reconstruction from a monocular image stream. Our approach addresses two key challenges in monocular online reconstruction: distributing Gaussians without relying on depth maps and ensuring both local and global consistency in the reconstructed maps.
To achieve this, we introduce two key modules: the Hierarchical Gaussian Management Module for effective Gaussian distribution and the Global Consistency Optimization Module for maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians for capturing details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity.
Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency. Moreover, it integrates seamlessly with various tracking systems, ensuring generality and scalability.
The Multi-level Occupancy Hash Voxels (MOHV) structure regularizes Gaussians to capture details across multiple levels of granularity. Given the target scale of the current camera view — which reflects the scale of the captured image in world space — MOHV determines whether a given position in world space is occupied by a Gaussian.
Leveraging our hierarchical structure, Gaussians are distributed in a coarse-to-fine manner—densely in detailed, close-view regions and sparsely in distant regions with simple textures or geometry. This approach avoids redundancy and ensures that the number of Gaussians is proportional to the scene's information content, enabling effective reconstruction details across different levels.
After Gaussians are distributed in world space, their optimization is guided by the selected views. The global consistency optimization module balances the refinement of newly observed local regions with the maintenance of global consistency across the entire scene.
To achieve this, we sample two sets of optimization views in addition to the current view. Local views are selected based on covisibility to enforce multi-view consistency in newly observed regions, while global views are sampled based on distance and reconstruction error to focus on under-optimized or recently observed areas, avoiding redundant optimization of well-reconstructed historical regions.
Incrementally reconstructing photorealistic scenes from a monocular RGB input stream. (Video is compressed for web display.)
For more comparison and results, please refer to our Supplementary Video.
Videos here are compressed for web display.