Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
HaT5: Hate Language Identification using Text-to-Text Transfer Transformer
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0001-7924-4953
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-5582-2031
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-5922-7889
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-0546-116x
Show others and affiliations
2022 (English)In: 2022 International Joint Conference on Neural Networks (IJCNN): Conference Proceedings, Institute of Electrical and Electronics Engineers (IEEE), 2022Conference paper, Published paper (Refereed)
Abstract [en]

We investigate the performance of a state-of-the-art (SoTA) architecture T5 (available on the SuperGLUE) and compare it with 3 other previous SoTA architectures across 5 different tasks from 2 relatively diverse datasets. The datasets are diverse in terms of the number and types of tasks they have. To improve performance, we augment the training data by using a new autoregressive conversational AI model checkpoint. We achieve near-SoTA results on a couple of the tasks - macro F1 scores of 81.66% for task A of the OLID 2019 dataset and 82.54% for task A of the hate speech and offensive content (HASOC) 2021 dataset, where SoTA are 82.9% and 83.05%, respectively. We perform error analysis and explain why one of the models (Bi-LSTM) makes the predictions it does by using a publicly available algorithm: Integrated Gradient (IG). This is because explainable artificial intelligence (XAI) is essential for earning the trust of users. The main contributions of this work are the implementation method of T5, which is discussed; the data augmentation, which brought performance improvements; and the revelation on the shortcomings of the HASOC 2021 dataset. The revelation shows the difficulties of poor data annotation by using a small set of examples where the T5 model made the correct predictions, even when the ground truth of the test set were incorrect (in our opinion). We also provide our model checkpoints on the HuggingFace hub1. https://huggingface.co/sana-ngu/HaT5_augmentation https://huggingface.co/sana-ngu/HaT5.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2022.
Keywords [en]
Hate Speech, Data Augmentation, Transformer, T5
National Category
Language Technology (Computational Linguistics)
Research subject
Machine Learning
Identifiers
URN: urn:nbn:se:ltu:diva-93432DOI: 10.1109/IJCNN55064.2022.9892696ISI: 000867070906060Scopus ID: 2-s2.0-85140754070OAI: oai:DiVA.org:ltu-93432DiVA, id: diva2:1701023
Conference
IEEE World Congress on Computational Intelligence (IEEE WCCI 2022), Padua, Italy, July 18-23, 2022
Note

ISBN för värdpublikation: 978-1-7281-8671-9

Available from: 2022-10-04 Created: 2022-10-04 Last updated: 2023-09-05Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Sabry, Sana SabahAdewumi, TosinAbid, NosheenKovács, GyörgyLiwicki, FoteiniLiwicki, Marcus

Search in DiVA

By author/editor
Sabry, Sana SabahAdewumi, TosinAbid, NosheenKovács, GyörgyLiwicki, FoteiniLiwicki, Marcus
By organisation
Embedded Internet Systems Lab
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 225 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf