BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

Yuqing Lan1, Chenyang Zhu1, Zhirui Gao1, Jiazhao Zhang3, Yihan Cao1, Renjiao Yi1, Yijie Wang1,†, Kai Xu1,2,†
1National University of Defense Technology    2Xiangjiang Laboratory     3Peking University      
Corresponding Authors

Real-world Online Demo

Contributions

  • We propose a novel reconstruction-free paradigm for online open-vocabulary 3D object detection, which models structural object layouts with desirable running and memory efficiency.
  • We propose a multi-view box fusion technique based on particle filtering using random optimization, enabling real-time and consistent 3D bounding box detection.
  • We have implemented an efficient system of online open-vocabulary 3D object detection. Extensive experiments validate the superior performance and robustness of our method in various challenging scenarios.

Abstract

Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.

Method Overview

Method Overview of BoxFusion. (a) Given online RGB-D images with camera poses, we use Cubify Anything to generate bounding box proposals for each keyframe, and back-project them to image planes to get the image crops to obtain the open-vocabulary semantics with CLIP. (b) We employ an association module to perform 3D NMS for global boxes with proposals from the new keyframe, removing redundant boxes and associating those belonging to the same object. (b) We employ an association module to perform 3D NMS for global boxes with proposals from the new keyframe, removing redundant boxes and associating those belonging to the same object. (c) With the associated candidate boxes of an object, we adopt random optimization based on particle filtering to fuse these candidate boxes into a multi-view consistent one using the IoU of the convex hulls of projected box corners. In this way, our method can efficiently detect fine-grained objects in real time without reconstruction.

Experimental Results

Visual Comparisons

Right Image Left Image
Ours
OnlineAnySeg
Right Image Left Image
Ours
SpatialLM
Right Image Left Image
Ours
FCAF
Right Image Left Image
Ours
OnlineAnySeg
Right Image Left Image
Ours
SpatialLM
Right Image Left Image
Ours
FCAF

Notes: our method does not rely on dense reconstruction and the mesh is only for visualization here.

Online Demos on Public Datasets

BibTeX

@article{lan2025boxfusion,
  title={BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion},
  author={Lan, Yuqing and Zhu, Chenyang and Gao, Zhirui and Zhang, Jiazhao and Cao, Yihan and Yi, Renjiao and Wang, Yijie and Xu, Kai},
  journal={arXiv preprint arXiv:2506.15610},
  year={2025}
}

Acknowledgements

We are deeply grateful for the invaluable support provided by Yuanhong Yu and Sida Peng from State Key Lab of CAD & CG, Zhejiang University.