Abstract
Hunting for blocked objects is complicated in crowded scenes due to the frequent occlusions. However, creating an effective occlusion detector remains challenging for the following inherent reasons: (1) the limited feature extraction capacity of encoders and (2) the loss of highly overlapped objects by decoders. We propose a spectral-angular ensemble-based Transformer network, OBhunter, to address these two issues. In OBhunter, an effective encoder with robust feature extraction performance is constructed through the ensemble spectral-angular self-attention (ESA) mechanism, extending the original softmax-based attention to the spectral characteristic dimension. To tackle the second issue, we branch the decoder using our crowded region generator (CRG). These two branches undergo differential processing by ensemble spectral-angular region (ESR) loss, a multi-task training loss function, to prevent erroneous suppression of proposal boxes. Extensive experiments demonstrate that our OBhunter is effective in occlusion detection based on CrowdHuman, CityPersons, and Caltech-Pedestrians datasets. With OBhunter, the occlusion detection performance achieves 33.52% MR−2. Additionally, we validate the robustness of our OBhunter on a less crowded dataset such as MS-COCO.
•Transformer’s potential for occlusion detection is demonstrated.•The ESA mechanism extends the attention to the spectral characteristic dimension.•The CRG parts two branches to adapt crowded and sparse ROIs in decoder-stage.•The ESR Loss yields complementary outputs by these two branches separately.