Enhancing CRNN HTR Architectures with Transformer Blocks
2024 (English)In: Document Analysis and Recognition, ICDAR 2024: 18th International Conference, Athens, Greece, August 30–September 4, 2024, Proceedings, Part IV / [ed] Elisa H. Barney Smith; Marcus Liwicki; Liangrui Peng, Springer Science and Business Media Deutschland GmbH , 2024, Vol. 4, p. 425-440Conference paper, Published paper (Refereed)
Abstract [en]
Handwritten Text Recognition (HTR) is a challenging problem that plays an essential role in digitizing and interpreting diverse handwritten documents. While traditional approaches primarily utilize CNN-RNN (CRNN) architectures, recent advancements based on Transformer architectures have demonstrated impressive results in HTR. However, these Transformer-based systems often involve high-parameter configurations and rely extensively on synthetic data. Moreover, they lack focus on efficiently integrating the ability of Transformer modules to grasp contextual relationships within the data. In this paper, we explore a lightweight integration of Transformer modules into existing CRNN frameworks to address the complexities of HTR, aiming to enhance the context of the sequential nature of the task. We present a hybrid CNN image encoder with intermediate MobileViT blocks that effectively combines the different components in a resource-efficient manner. Through extensive experiments and ablation studies, we refine the integration of these modules and demonstrate that our proposed model enhances HTR performance. Our results on the line-level IAM and RIMES datasets suggest that our proposed method achieves competitive performance with significantly fewer parameters and without integrating synthetic data compared to existing systems.
Place, publisher, year, edition, pages
Springer Science and Business Media Deutschland GmbH , 2024. Vol. 4, p. 425-440
Series
Lecture Notes in Computer Science, ISSN 1611-3349, E-ISSN 0302-9743
Keywords [en]
Handwriting Text Recognition, Transformer Modules
National Category
Computer Sciences Computer graphics and computer vision
Research subject
Machine Learning
Identifiers
URN: urn:nbn:se:ltu:diva-110168DOI: 10.1007/978-3-031-70546-5_25ISI: 001336396200025Scopus ID: 2-s2.0-85204589724OAI: oai:DiVA.org:ltu-110168DiVA, id: diva2:1901965
Conference
18th International Conference on Document Analysis and Recognition (ICDAR 2024), Athens, Greece, August 30–September 4, 2024
Note
ISBN for host publication: 978-3-031-70545-8, 978-3-031-70546-5
2024-09-302024-09-302025-10-21Bibliographically approved