Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA

1Nanyang Technological University, Singapore,
2Department of Electrical Electronic Engineering, Yonsei University, Korea

*Corresponding author.

Preprint

We introduce AURA, a multi-modal approach designed for reasoning the amodal segmentation mask including both visible and occlusion regions based on the user's question. AURA can deduct the implicit purpose underneath the question and response with the textual answer along with predicted amodal masks for various objects.

Abstract

Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region's appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA's effectiveness on the proposed dataset.

The Proposed AURA Approach

Overall architecture of the proposed AURA.

Overall architecture of the proposed AURA. (a) Given an input image and the input question from the user, the Vision Backbone extracts the visual features of the input image, and the Multi-Modal LLM equipped with LoRA is utilized for understanding the input image and textual questions simultaneously and responding with textual answers including the [SEG] tokens indicating the segmentation masks. (b) For each [SEG], the Prompt Encoder takes its embedding of the Multi-Modal LLM and outputs the refined embeddings corresponding to the [SEG]. (c) Finally, the Visible Mask Decoder predicts the visible mask using each [SEG]'s refined embeddings. The Amodal Decoder predicts the amodal mask using the occlusion-aware embedding predicted by the Occlusion Condition Encoder. A Spatial Occlusion Encoder is designed to constrain the spatial occlusion information of the predicted visible and amodal segmentation masks to be accurate.

Experimental Results Visualization

Experimental Results Visualization

Visualization results of AURA. Three cases are shown from top to bottom. GT denotes the ground truth. The first two cases show the images, ground-truth amodal and visible masks, and the predicted masks of AURA. In the last case, the predicted amodal mask by AURA for the occludee, which is the occluded toilet, is shown replacing the ground-truth visible mask.

BibTeX


          @inproceedings{li2025aura,
              title={Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA},
              author={Li, Zhixuan and Yoon, Hyunse and Lee, Sanghoon and Lin, Weisi},
              booktitle={arXiv:2503.10225},
              year={2025}
          }