RemixFusion

We propose a mixed residual-based representation for dense RGB-D reconstruction of large-scale scenes, which preserves fine-grained details with relatively low memory and computational cost.
We propose a residual-based bundle adjustment technique that employs a tiny MLP for residual-based pose refinement. Compared to traditional BAs, our method improves pose estimation in terms of both efficiency and robustness.
We have implemented an efficient system of online RGB-D dense reconstruction which realizes robust and fine-grained real-time reconstruction for large scenes over 1000𝑚² with an affordable GPU memory footprint.

The introduction of the neural implicit representation has notably propelled the advancement of online dense reconstruction techniques. Compared to traditional explicit representations, such as TSDF, it substantially improves the mapping completeness and memory efficiency. However, the lack of reconstruction details and the time-consuming learning of neural representations hinder the widespread application of neural-based methods to large-scale online reconstruction. We introduce RemixFusion, a novel residual-based mixed representation for scene reconstruction and camera pose estimation dedicated to high-quality and large-scale online RGB-D reconstruction. In particular, we propose a residual-based map representation comprised of an explicit coarse TSDF grid and an implicit neural module that produces residuals representing fine-grained details to be added to the coarse grid. Such mixed representation allows for detail-rich reconstruction with bounded time and memory budget, contrasting with the overly-smoothed results by the purely implicit representations, thus paving the way for high-quality camera tracking. Furthermore, we extend the residual-based representation to handle multi-frame joint pose optimization via bundle adjustment (BA). In contrast to the existing methods, which optimize poses directly, we opt to optimize pose changes. Combined with a novel technique for adaptive gradient amplification, our method attains better optimization convergence and global optimality. Furthermore, we adopt a local moving volume to factorize the whole mixed scene representation with a divide-and-conquer design to facilitate efficient online learning in our residual-based framework. Extensive experiments demonstrate that our method surpasses all state-of-the-art ones, including those based either on explicit or implicit representations, in terms of the accuracy of both mapping and tracking on large-scale scenes.

Method Overview. (a) Given RGB-D inputs, the pose estimation is based on the frame-to-model randomized optimization on a scalable moving volume, providing the initial pose estimation. (b) Based on the initial poses, a global MLP is utilized to output the residuals for multi-view consistent pose refinement, using the rendering loss and geometric loss, which are backward propagated through the global reconstruction model. (c) For reconstruction, RemixFusion consists of a coarse TSDF grid, which records the low-frequency scene structure, and an implicit neural map including the hash embedding and tiny decoders, which encode the high-frequency geometry details. TSDF and RGB residuals are decoded based on these embeddings, which are added to the coarse grid to recover the final reconstruction. The residual-based BA and mapping are parallel to the front-end tracking. The residual designs in both pose estimation and reconstruction ensure efficiency and accuracy.

Qualitative Results. Gallery of 3D reconstruction and camera tracking on dining of BS3D, building2 of FastCaMo-Large, and office of self-captured sequences. These three sequences are composed of 5572, 7259, and 8656 images, which correspond to over 1000𝑚² , 200𝑚² , and 180𝑚², respe-ctively. The colorized trajectory indicates the estimated poses from the beginning (red) to the end (blue). Zoom-in comparisons are marked with red rectangles. Our method achieves the most accurate and robust performance in real-time, while there are failures or severe and obvious drifts for other approaches.

Representative Image. We present RemixFusion, a residual-based RGB-D framework by virtue of both explicit and implicit representations for large-scale online dense reconstruction. RemixFusion can support real-time fine-grained reconstruction in a memory-efficient way. It only costs 9.8GB GPU memory with 12 FPS for the about 400𝑚² reconstruction above, while other methods [Johari et al. 2023; Tang et al. 2023; Zhu et al. 2022] struggle in both tracking and reconstruction in real time. Traditional explicit methods fail for this scene. GS-ICP SLAM [Ha et al. 2024] is the SOTA 3DGS-based SLAM. The average results of reconstruction and tracking on the BS3D dataset as well as the system FPS and GPU memory usage on the above scene are shown on the right, which illustrates RemixFusion obtains better performance and efficiency. RemixFusion-lite denotes the lightweight version and achieves decent performance with about 25 FPS.

BibTeX

@article{lan2025remixfusion,
  title={RemixFusion: Residual-based Mixed Representation for Large-scale Online RGB-D Reconstruction},
  author={Lan, Yuqing and Zhu, Chenyang and Zhi, Shuaifeng and Zhang, Jiazhao and Wang, Zhoufeng and Yi, Renjiao and Wang, Yijie and Xu, Kai},
  journal={ACM Transactions on Graphics},
  volume={45},
  number={1},
  pages={1--19},
  year={2025},
  publisher={ACM New York, NY}
}

RemixFusion: Residual-based Mixed Representation for Large-scale Online RGB-D Reconstruction

Introduction Video

Contributions

Abstract

Method Overview

Experimental Results

BibTeX