Open this publication in new window or tab >>Show others...
2024 (English)In: Document Analysis and Recognition, ICDAR 2024: 18th International Conference, Athens, Greece, August 30 – September 4, 2024, Proceedings, Part III / [ed] Elisa H. Barney Smith; Marcus Liwicki; Liangrui Peng, Springer Science and Business Media Deutschland GmbH , 2024, Vol. 3, p. 23-38Conference paper, Published paper (Refereed)
Abstract [en]
This paper introduces a new OCR dataset for historical handwritten Ethiopic script, characterized by a unique syllabic writing system, low-resource availability, and complex orthographic diacritics. The dataset consists of roughly 80,000 annotated text-line images from 1700 pages of 18th to 20th century documents, including a training set with text-line images from the 19th to 20th century and two test sets. One is distributed similarly to the training set with nearly 6,000 text-line images, and the other contains only images from the 18th century manuscripts, with around 16,000 images. The former test set allows us to check baseline performance in the classical IID setting (Independently and Identically Distributed), while the latter addresses a more realistic setting in which the test set is drawn from a different distribution than the training set (Out-Of-Distribution or OOD). Multiple annotators labeled all text-line images for the HHD-Ethiopic dataset, and an expert supervisor double-checked them. We assessed human-level recognition performance and compared it with state-of-the-art (SOTA) OCR models using the Character Error Rate (CER) and Normalized Edit Distance (NED) metrics. Our results show that the model performed comparably to human-level recognition on the 18th century test set and outperformed humans on the IID test set. However, the unique challenges posed by the Ethiopic script, such as detecting complex diacritics, still present difficulties for the models. Our baseline evaluation and dataset will encourage further research on Ethiopic script recognition. The dataset and source code can be accessed at https://github.com/bdu-birhanu/HHD-Ethiopic.
Place, publisher, year, edition, pages
Springer Science and Business Media Deutschland GmbH, 2024
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 14806
Keywords
Historical Ethiopic script, Human-level recognition performance, HHD-Ethiopic, Normalized edit distance, Text recognition
National Category
Computer Sciences Computer Vision and Robotics (Autonomous Systems)
Research subject
Machine Learning
Identifiers
urn:nbn:se:ltu:diva-110171 (URN)10.1007/978-3-031-70543-4_2 (DOI)001336394400002 ()2-s2.0-85204650159 (Scopus ID)
Conference
18th International Conference on Document Analysis and Recognition (ICDAR 2024), Athens, Greece, August 30–September 4, 2024
Funder
EU, Horizon 2020, 952215
Note
Funder: ANR Chair of ArtificialIntelligence HUMANIA (ANR-19-CHIA-0022); ChaLearn; ICT4D Research Center of Bahir Dar Institute of Technology;
ISBN for host publication: 978-3-031-70542-7, 978-3-031-70543-4
2024-10-022024-10-022024-12-12Bibliographically approved