Open this publication in new window or tab >>2026 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]
This licentiate thesis investigates machine learning under data-related challenges, mainly limited data, imbalanced data, and noisy data with label errors. These conditions are common in real-world applications and significantly affect both model performance and evaluation.
The thesis examines these challenges across two domains: Natural Language Processing (NLP) and Arabic Handwritten Text Recognition (HTR). It considers both data-centric approaches, such as data cleaning and the use of additional data, and model-centric approaches, such as model selection and the use of pretrained models. This difference in approach reflects the availability of resources across the two domains, where NLP benefits from a large number of pretrained language models (PLM), while HTR remains more constrained by data availability and data quality, and by the more limited and script‑specific nature of available pretrained HTR models compared to the broad, cross‑lingual PLMs used in NLP.
In the HTR setting, particular emphasis is placed on data quality, where errors in annotations and content are shown to have a direct impact on recognition performance. A human-in-the-loop framework is proposed to detect and correct such errors, demonstrating that improving dataset quality leads to measurable performance gains.
In the NLP setting, the work investigates strategies for improving performance under imbalanced and low-resource conditions, including data augmentation, semi-supervised learning, and model selection. The results show that data augmentation is most effective in low-resource and highly imbalanced settings, while its impact depends on both the amount of available data and the choice of model. In addition, experiments on low-resource African languages highlight the importance of pretraining data alignment, showing that models perform better when their pretraining data is closely related to the target language or domain.
Beyond data quality, the thesis explores the use of auxiliary data in HTR through cross-script transfer. Joint training with related Arabic-script languages is shown to improve performance under low-resource conditions, with gains concentrated on shared character structures. A controlled architectural study further indicates that sequence modeling plays an important role in enabling these transfer gains, suggesting that effective transfer depends not only on visual similarity but also on the ability to model contextual information.
Overall, the findings demonstrate that addressing data-related challenges—through both data-centric approaches and model-centric strategies—is essential for improving performance in low-resource settings across domains.
Place, publisher, year, edition, pages
Luleå University of Technology, 2026
Series
Licentiate thesis / Luleå University of Technology, ISSN 1402-1757
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Machine Learning
Identifiers
urn:nbn:se:ltu:diva-117195 (URN)978-91-8142-054-8 (ISBN)978-91-8142-055-5 (ISBN)
Presentation
2026-06-12, C305, Luleå University of Technology, Luleå, 09:00 (English)
Opponent
Supervisors
2026-04-202026-04-172026-05-05Bibliographically approved