VELA

Abstract

Amodal segmentation aims to recover complete object shapes, including occluded regions with no visual appearance, whereas conventional segmentation focuses solely on visible areas. Existing methods typically rely on strong prompts, such as visible masks or bounding boxes, which are costly or impractical to obtain in real-world settings. While recent approaches such as the Segment Anything Model (SAM) support point-based prompts for guidance, they often perform direct mask regression without explicitly modeling shape evolution, limiting generalization in complex occlusion scenarios. Moreover, most existing methods suffer from a black-box nature, lacking geometric interpretability and offering limited insight into how occluded shapes are inferred. To deal with these limitations, we propose VELA, an end-to-end VElocity-driven Level-set Amodal segmentation method that performs explicit contour evolution from point-based prompts. VELA first constructs an initial level set function from image features and the point input, which then progressively evolves into the final amodal mask under the guidance of a shape-specific motion field predicted by a fully differentiable network. This network learns to generate evolution dynamics at each step, enabling geometrically grounded and topologically flexible contour modeling. Extensive experiments on COCOA-cls, D2SA, and KINS benchmarks demonstrate that VELA outperforms existing strongly prompted methods while requiring only a single-point prompt, validating the effectiveness of interpretable geometric modeling under weak guidance.

Motivation

Comparison of amodal segmentation paradigms. (a) Conventional methods rely on strong prompts like bounding boxes or visible masks. (b) Recent approaches support point prompts but directly regress masks without explicit shape reasoning, resulting in limited robustness under complex occlusions. (c) VELA introduces explicit shape evolution via Level Set, effectively exploiting a single-point prompt for interpretable amodal segmentation.

The Proposed VELA Method

Overall architecture of the proposed VELA. (a) Level Set Function (LSF) Initialization: Given an input image $I$, the Vision Backbone extracts global image features, while the Prompt Encoder processes the point-level prompt $P$ to generate point embeddings $e_P$ that encode coarse localization of the target object. The Initial LSF Decoder then predicts the initial level set function $\phi_0$ using both the image features and point embeddings. (b) Velocity-Guided Shape Evolution: Starting from $\phi_0$, the shape evolves over $T$ steps. At each step $i$, the current level set function $\phi_i$ is fed into the Shape-Specific Velocity Generator to dynamically produce a velocity field $V^n_i$, which guides the contour’s deformation towards the final shape. The updated $\phi_{i+1}$ is then regularized and used for the next step. After $T$ steps, the final level set function $\phi_T$ is thresholded to obtain the predicted amodal mask $\hat{M}_a$.

Qualitative Results

Visualization results of VELA. The green dot denotes the input point prompt. Best viewed in color.

BibTeX


          @inproceedings{li2025vela,
                title={Single Point, Full Mask: Velocity-Guided Level Set Evolution for End-to-End Amodal Segmentation},
                author={Li, Zhixuan and Liu, Yujia and Hui, Chen and Lin, Weisi},
                booktitle={arXiv:2508.01661},
                year={2025}
          }