Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Historical Handwritten Dataset for Ethiopic OCR with Baseline Models and Human-Level Performance
LISN, Université Paris-Saclay, Gif-sur-Yvette, France.ORCID iD: 0009-0000-6709-7773
LISN, Université Paris-Saclay, Gif-sur-Yvette, France; Google Brain, Mountain View, USA; ChaLearn, Berkeley, USA.ORCID iD: 0000-0002-9266-1783
Bahir Dar University, Bahir Dar, Ethiopia.ORCID iD: 0009-0006-7739-2462
Bahir Dar University, Bahir Dar, Ethiopia.ORCID iD: 0009-0007-7360-3231
Show others and affiliations
2024 (English)In: Document Analysis and Recognition, ICDAR 2024: 18th International Conference, Athens, Greece, August 30 – September 4, 2024, Proceedings, Part III / [ed] Elisa H. Barney Smith; Marcus Liwicki; Liangrui Peng, Springer Science and Business Media Deutschland GmbH , 2024, Vol. 3, p. 23-38Conference paper, Published paper (Refereed)
Abstract [en]

This paper introduces a new OCR dataset for historical handwritten Ethiopic script, characterized by a unique syllabic writing system, low-resource availability, and complex orthographic diacritics. The dataset consists of roughly 80,000 annotated text-line images from 1700 pages of 18th to 20th century documents, including a training set with text-line images from the 19th to 20th century and two test sets. One is distributed similarly to the training set with nearly 6,000 text-line images, and the other contains only images from the 18th century manuscripts, with around 16,000 images. The former test set allows us to check baseline performance in the classical IID setting (Independently and Identically Distributed), while the latter addresses a more realistic setting in which the test set is drawn from a different distribution than the training set (Out-Of-Distribution or OOD). Multiple annotators labeled all text-line images for the HHD-Ethiopic dataset, and an expert supervisor double-checked them. We assessed human-level recognition performance and compared it with state-of-the-art (SOTA) OCR models using the Character Error Rate (CER) and Normalized Edit Distance (NED) metrics. Our results show that the model performed comparably to human-level recognition on the 18th century test set and outperformed humans on the IID test set. However, the unique challenges posed by the Ethiopic script, such as detecting complex diacritics, still present difficulties for the models. Our baseline evaluation and dataset will encourage further research on Ethiopic script recognition. The dataset and source code can be accessed at https://github.com/bdu-birhanu/HHD-Ethiopic.

Place, publisher, year, edition, pages
Springer Science and Business Media Deutschland GmbH , 2024. Vol. 3, p. 23-38
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 14806
Keywords [en]
Historical Ethiopic script, Human-level recognition performance, HHD-Ethiopic, Normalized edit distance, Text recognition
National Category
Computer Sciences Computer graphics and computer vision
Research subject
Machine Learning
Identifiers
URN: urn:nbn:se:ltu:diva-110171DOI: 10.1007/978-3-031-70543-4_2ISI: 001336394400002Scopus ID: 2-s2.0-85204650159OAI: oai:DiVA.org:ltu-110171DiVA, id: diva2:1902655
Conference
18th International Conference on Document Analysis and Recognition (ICDAR 2024), Athens, Greece, August 30–September 4, 2024
Funder
EU, Horizon 2020, 952215
Note

Funder: ANR Chair of ArtificialIntelligence HUMANIA (ANR-19-CHIA-0022); ChaLearn; ICT4D Research Center of Bahir Dar Institute of Technology;

ISBN for host publication: 978-3-031-70542-7, 978-3-031-70543-4

Available from: 2024-10-02 Created: 2024-10-02 Last updated: 2025-02-01Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Liwicki, Marcus

Search in DiVA

By author/editor
Belay, Birhanu HailuGuyon, IsabelleMengiste, TadeleTilahun, BezaworkLiwicki, MarcusTegegne, TesfaEgele, Romain
By organisation
Embedded Internet Systems Lab
Computer SciencesComputer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 103 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf