OAFormer: Learning Occlusion Distinguishable Feature for Amodal Instance Segmentation

1Advanced Institute of Information Technology, Peking University, Hangzhou, China,
2National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, Beijing, China

*Corresponding author.

ICASSP 2023

Typical cases of the occlusion confusing problem. (a) Occlude bottle is regarded as unoccluded and resulting in wrong prediction. (b) Unoccluded chocolate is regarded as occluded and predicted wrongly.

Abstract

The Amodal Instance Segmentation (AIS) task aims to infer the complete mask of occluded instance. Under many circumstances, existing methods treat occluded objects as unoccluded ones, and vice versa, leading to inaccurate predictions. This is because existing AIS methods do not explicitly utilize the occlusion rates of each object as supervision. However, occlusion information is critical for the methods to recognize whether the target objects are occluded. Hence we believe it is vital for the method to be distinguishable about the degree of occlusion for each instance. In this paper, a simple yet effective Occlusion-aware transformer-based model, OAFormer, is proposed for accurate amodal instance segmentation. The goal of OAFormer is to learn the occlusion discriminative features. Novel components are proposed to enable OAFormer to be occlusion distinguishable. We conduct extensive experiments on two challenging AIS datasets to evaluate the effectiveness of our method. OAFormer outperforms state-of-the-art methods by large margins.

The Proposed OAFormer Approach

Overview of the proposed OAFormer. OAFormer takes an image as the input. After extracting the features by the Encoder and the Cascaded Global Decoder, the Occlusion Distinguish Module predicts the occlusion rates of each target objects and embeds occlusion information into the attention masks. Finally, the Amodal Decoder takes the occlusion-aware attention masks and queries as input, and outputs the predicted amodal masks.

Experimental Results

Comparison with state-of-the-art methods on the D2SA and COCOA-cls datasets. For supervision, “bbox” means amodal bounding box, “mask” means amodal masks, and “cls” means class labels. For each metric, the bold performance is the best, and the second-best is underlined.

BibTeX


      @inproceedings{li2023oaformer,
          title={{OAFormer}: Learning Occlusion Distinguishable Feature for Amodal Instance Segmentation},
          author={Li, Zhixuan and Shi, Ruohua and Huang, Tiejun and Jiang, Tingting},
          booktitle={International Conference on Acoustics, Speech, and Signal Processing},
          pages={1--5},
          year={2023}
      }