Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
HaT5: Hate Language Identification using Text-to-Text Transfer Transformer
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0001-7924-4953
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-5582-2031
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-5922-7889
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-0546-116x
Show others and affiliations
2022 (English)In: 2022 International Joint Conference on Neural Networks (IJCNN): Conference Proceedings, Institute of Electrical and Electronics Engineers (IEEE), 2022Conference paper, Published paper (Refereed)
Abstract [en]

We investigate the performance of a state-of-the-art (SoTA) architecture T5 (available on the SuperGLUE) and compare it with 3 other previous SoTA architectures across 5 different tasks from 2 relatively diverse datasets. The datasets are diverse in terms of the number and types of tasks they have. To improve performance, we augment the training data by using a new autoregressive conversational AI model checkpoint. We achieve near-SoTA results on a couple of the tasks - macro F1 scores of 81.66% for task A of the OLID 2019 dataset and 82.54% for task A of the hate speech and offensive content (HASOC) 2021 dataset, where SoTA are 82.9% and 83.05%, respectively. We perform error analysis and explain why one of the models (Bi-LSTM) makes the predictions it does by using a publicly available algorithm: Integrated Gradient (IG). This is because explainable artificial intelligence (XAI) is essential for earning the trust of users. The main contributions of this work are the implementation method of T5, which is discussed; the data augmentation, which brought performance improvements; and the revelation on the shortcomings of the HASOC 2021 dataset. The revelation shows the difficulties of poor data annotation by using a small set of examples where the T5 model made the correct predictions, even when the ground truth of the test set were incorrect (in our opinion). We also provide our model checkpoints on the HuggingFace hub1. https://huggingface.co/sana-ngu/HaT5_augmentation https://huggingface.co/sana-ngu/HaT5.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2022.
Keywords [en]
Hate Speech, Data Augmentation, Transformer, T5
National Category
Natural Language Processing
Research subject
Machine Learning
Identifiers
URN: urn:nbn:se:ltu:diva-93432DOI: 10.1109/IJCNN55064.2022.9892696ISI: 000867070906060Scopus ID: 2-s2.0-85140754070OAI: oai:DiVA.org:ltu-93432DiVA, id: diva2:1701023
Conference
IEEE World Congress on Computational Intelligence (IEEE WCCI 2022), Padua, Italy, July 18-23, 2022
Note

ISBN för värdpublikation: 978-1-7281-8671-9

Available from: 2022-10-04 Created: 2022-10-04 Last updated: 2026-04-20Bibliographically approved
In thesis
1. Learning under Data Scarcity: Evidence from NLP and Handwritten Text Recognition
Open this publication in new window or tab >>Learning under Data Scarcity: Evidence from NLP and Handwritten Text Recognition
2026 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

This licentiate thesis investigates machine learning under data-related challenges, mainly limited data, imbalanced data, and noisy data with label errors. These conditions are common in real-world applications and significantly affect both model performance and evaluation.

The thesis examines these challenges across two domains: Natural Language Processing (NLP) and Arabic Handwritten Text Recognition (HTR). It considers both data-centric approaches, such as data cleaning and the use of additional data, and model-centric approaches, such as model selection and the use of pretrained models. This difference in approach reflects the availability of resources across the two domains, where NLP benefits from a large number of pretrained language models (PLM), while HTR remains more constrained by data availability and data quality, and by the more limited and script‑specific nature of available pretrained HTR models compared to the broad, cross‑lingual PLMs used in NLP.

In the HTR setting, particular emphasis is placed on data quality, where errors in annotations and content are shown to have a direct impact on recognition performance. A human-in-the-loop framework is proposed to detect and correct such errors, demonstrating that improving dataset quality leads to measurable performance gains.

In the NLP setting, the work investigates strategies for improving performance under imbalanced and low-resource conditions, including data augmentation, semi-supervised learning, and model selection. The results show that data augmentation is most effective in low-resource and highly imbalanced settings, while its impact depends on both the amount of available data and the choice of model. In addition, experiments on low-resource African languages highlight the importance of pretraining data alignment, showing that models perform better when their pretraining data is closely related to the target language or domain.

Beyond data quality, the thesis explores the use of auxiliary data in HTR through cross-script transfer. Joint training with related Arabic-script languages is shown to improve performance under low-resource conditions, with gains concentrated on shared character structures. A controlled architectural study further indicates that sequence modeling plays an important role in enabling these transfer gains, suggesting that effective transfer depends not only on visual similarity but also on the ability to model contextual information.

Overall, the findings demonstrate that addressing data-related challenges—through both data-centric approaches and model-centric strategies—is essential for improving performance in low-resource settings across domains.

Place, publisher, year, edition, pages
Luleå University of Technology, 2026
Series
Licentiate thesis / Luleå University of Technology, ISSN 1402-1757
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Machine Learning
Identifiers
urn:nbn:se:ltu:diva-117195 (URN)978-91-8142-054-8 (ISBN)978-91-8142-055-5 (ISBN)
Presentation
2026-06-12, C305, Luleå University of Technology, Luleå, 09:00 (English)
Opponent
Supervisors
Available from: 2026-04-20 Created: 2026-04-17 Last updated: 2026-05-05Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Sabry, Sana SabahAdewumi, TosinAbid, NosheenKovács, GyörgyLiwicki, FoteiniLiwicki, Marcus

Search in DiVA

By author/editor
Sabry, Sana SabahAdewumi, TosinAbid, NosheenKovács, GyörgyLiwicki, FoteiniLiwicki, Marcus
By organisation
Embedded Internet Systems Lab
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 304 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf