Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Attention-Guided Disentangled Feature Aggregation for Video Object Detection
Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany; Mindgarage, Technical University of Kaiserslautern, Kaiserslautern, Germany.ORCID iD: 0000-0001-7942-4698
Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany; Mindgarage, Technical University of Kaiserslautern, Kaiserslautern, Germany; German Research Institute for Artificial Intelligence (DFKI), Kaiserslautern, Germany.ORCID iD: 0000-0003-0456-6493
German Research Institute for Artificial Intelligence (DFKI), Kaiserslautern, Germany.
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0003-4029-6574
Show others and affiliations
2022 (English)In: Sensors, E-ISSN 1424-8220, Vol. 22, no 21, article id 8583Article in journal (Refereed) Published
Abstract [en]

Object detection is a computer vision task that involves localisation and classification of objects in an image. Video data implicitly introduces several challenges, such as blur, occlusion and defocus, making video object detection more challenging in comparison to still image object detection, which is performed on individual and independent images. This paper tackles these challenges by proposing an attention-heavy framework for video object detection that aggregates the disentangled features extracted from individual frames. The proposed framework is a two-stage object detector based on the Faster R-CNN architecture. The disentanglement head integrates scale, spatial and task-aware attention and applies it to the features extracted by the backbone network across all the frames. Subsequently, the aggregation head incorporates temporal attention and improves detection in the target frame by aggregating the features of the support frames. These include the features extracted from the disentanglement network along with the temporal features. We evaluate the proposed framework using the ImageNet VID dataset and achieve a mean Average Precision (mAP) of 49.8 and 52.5 using the backbones of ResNet-50 and ResNet-101, respectively. The improvement in performance over the individual baseline methods validates the efficacy of the proposed approach.

Place, publisher, year, edition, pages
MDPI, 2022. Vol. 22, no 21, article id 8583
Keywords [en]
attention, computer vision, deep learning, object detection, video object detection
National Category
Computer graphics and computer vision Computer Sciences
Research subject
Machine Learning
Identifiers
URN: urn:nbn:se:ltu:diva-94179DOI: 10.3390/s22218583ISI: 000884031900001PubMedID: 36366281Scopus ID: 2-s2.0-85141621212OAI: oai:DiVA.org:ltu-94179DiVA, id: diva2:1712745
Funder
EU, Horizon 2020, 883293 INFINITY
Note

Validerad;2022;Nivå 2;2022-11-22 (hanlid)

Available from: 2022-11-22 Created: 2022-11-22 Last updated: 2025-02-01Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Authority records

Liwicki, Marcus

Search in DiVA

By author/editor
Muralidhara, ShishirHashmi, Khurram AzeemLiwicki, MarcusAfzal, Muhammad Zeshan
By organisation
Embedded Internet Systems Lab
In the same journal
Sensors
Computer graphics and computer visionComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 53 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf