MUVA: A New Large-Scale Benchmark for Multi-view Amodal Instance Segmentation in the Shopping Scenario

1 National Engineering Research Center of Visual Technology, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing 100871, China 2 AiFi Inc., California 94010, United States
3 Beijing Academy of Artifcial Intelligence, Beijing 100084, China
*Corresponding author.


International Conference on Computer Vision (ICCV), 2023

Abstract

Amodal Instance Segmentation (AIS) endeavors to accurately deduce complete object shapes that are partially or fully occluded. However, the inherent ill-posed nature of single-view datasets poses challenges in determining occluded shapes. A multi-view framework may help alleviate this problem, as humans often adjust their perspective when encountering occluded objects. At present, this approach has not yet been explored by existing methods and datasets. To bridge this gap, we propose a new task called Multi-view Amodal Instance Segmentation (MAIS) and introduce the MUVA dataset, the first MUlti-View AIS dataset that takes the shopping scenario as instantiation. MUVA provides comprehensive annotations, including multi-view amodal/visible segmentation masks, 3D models, and depth maps, making it the largest image-level AIS dataset in terms of both the number of images and instances. Additionally, we propose a new method for aggregating representative features across different instances and views, which demonstrates promising results in accurately predicting occluded objects from one viewpoint by leveraging information from other viewpoints. Besides, we also demonstrate that MUVA can benefit the AIS task in real-world scenarios.

Targeting at the ill-posed problem in the AIS task

Fig.1: Comparison of the impact of ill-posed problems on amodal prediction in single-view and multi-view input settings. (a) In single-view input, ambiguity arises due to multiple candidates for the occluded object. (b) Multi-view input helps alleviate ambiguity and improves amodal prediction accuracy.

Dataset Generation Pipeline

Fig. 2: The pipeline of dataset generation. (a) For each object, 2D images are captured from up, down, left, right, front, and back, respectively. Then 3D artists use the collected images to reconstruct the 3D models. (b) For each scene, 3D models are randomly selected and placed with different amounts and orientations. (c) For each scene, six views are used to capture the data, including the RGB images and various annotations.

Datasets Comparison

Tab 1: Comparison with existing amodal instance segmentation datasets. # means the number of this item. Bold numbers denote the largest one in each column among image-level datasets.

Segmentation Results Comparison

Fig 3: Visualization comparisons between BCNet and ours on MUVA, trained with one viewpoint (1V) and six viewpoints (6V). For masks, different colors denote different instances, and the same instance in different angles has the same color. Red circles indicate regions should be focused. Zoom in for a better view. In the first and third columns, even if a bottle is severely occluded, it can be predicted accurately by our method. Moreover, our method trained with 6 views performs better than training with a single view.

BibTeX


        @inproceedings{li2023muva,
            author={Li, Zhixuan and Ye, Weining and Terven, Juan and Bennett, Zachary and Zheng, Ying and Jiang, Tingting and Huang, Tiejun},
            title={{MUVA}: A New Large-Scale Benchmark for Multi-view Amodal Instance Segmentation in the Shopping Scenario},
            booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
            pages={23504--23513},
            year={2023}
        }