RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding


Novel Catgory Segmentation
Backpack
Keyboard
Pillow
Ladder
Novel Catgory Heatmap
paper
Keyboard
shoes
Trash Can
Grounded 3D Reasoning

Abstract

TL;DR: We propose a lightweight and scalable regional point-language contrastive learning framework for open-world 3D scene understanding.

We propose a lightweight and scalable Regional Point-Language Contrastive learning framework, namely RegionPLC, for open-world 3D scene understanding, aiming to identify and recognize open-set objects and categories. Specifically, based on our empirical studies, we introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models, yielding high-quality, dense region-level language descriptions without human 3D annotations. Subsequently, we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from dense regional language supervision. We carry out extensive experiments on ScanNet, ScanNet200, and nuScenes datasets, and our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2% and 9.1% for semantic and instance segmentation, respectively, while maintaining greater scalability and lower resource demands. Furthermore, our method has the flexibility to be effortlessly integrated with language models to enable open-ended grounded 3D reasoning without extra task-specific training.

Approach

Regional 3D-Language Association

We begin by conducting a comprehensive examination of various 2D foundation models (e.g., image captioning, dense captioning, and detection models along with visual prompting techniques for their capability to generate region-level 3D-language pairs. Based on our examination, we propose a 3D-aware supplementary-oriented fusion strategy to alleviate ambiguities and conflicts encountered when combining paired 3D-language data from multiple 2D models, ultimately delivering high-quality dense region-level 3D-language pairs.

Point-discriminative Region-aware Contrastive Learning

We introduce a region-aware point-discriminative contrastive loss that prevents the optimization of point-wise embeddings from being disturbed by nearby points from unrelated semantic categories, enhancing the discriminativeness of learned point-wise embeddings. The region-aware design further normalizes the contribution of multiple region-level 3D-language pairs, regardless of their region sizes, making feature learning more robust.

Quantitative Results

Base-annotated Open-World Segmentation


Annotation-free Open-World Segmentation


Citation

@article{yang2023regionplc,
        title={RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding},
        author={Yang, Jihan and Ding, Runyu and Wang, Zhe and Qi, Xiaojuan},
        journal={arXiv preprint arXiv:2304.00962},
        year={2023}
      }
}