Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Challenges of Hate Speech Detection in Social Media: Data Scarcity, and Leveraging External Resources
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-0546-116x
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-6785-4356
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0001-8532-0895
2021 (English)In: SN Computer Science, ISSN 2662-995X, Vol. 2, no 2, article id 95Article in journal (Refereed) Published
Abstract [en]

The detection of hate speech in social media is a crucial task. The uncontrolled spread of hate has the potential to gravely damage our society, and severely harm marginalized people or groups. A major arena for spreading hate speech online is social media. This significantly contributes to the difficulty of automatic detection, as social media posts include paralinguistic signals (e.g. emoticons, and hashtags), and their linguistic content contains plenty of poorly written text. Another difficulty is presented by the context-dependent nature of the task, and the lack of consensus on what constitutes as hate speech, which makes the task difficult even for humans. This makes the task of creating large labeled corpora difficult, and resource consuming. The problem posed by ungrammatical text has been largely mitigated by the recent emergence of deep neural network (DNN) architectures that have the capacity to efficiently learn various features. For this reason, we proposed a deep natural language processing (NLP) model—combining convolutional and recurrent layers—for the automatic detection of hate speech in social media data. We have applied our model on the HASOC2019 corpus, and attained a macro F1 score of 0.63 in hate speech detection on the test set of HASOC. The capacity of DNNs for efficient learning, however, also means an increased risk of overfitting. Particularly, with limited training data available (as was the case for HASOC). For this reason, we investigated different methods for expanding resources used. We have explored various opportunities, such as leveraging unlabeled data, similarly labeled corpora, as well as the use of novel models. Our results showed that by doing so, it was possible to significantly increase the classification score attained.

Place, publisher, year, edition, pages
Switzerland: Springer, 2021. Vol. 2, no 2, article id 95
Keywords [en]
Hate speech, Deep language processing, Transfer learning, BERT, Vocabulary augmentation
National Category
Computer Sciences
Research subject
Machine Learning
Identifiers
URN: urn:nbn:se:ltu:diva-82964DOI: 10.1007/s42979-021-00457-3Scopus ID: 2-s2.0-85122607484OAI: oai:DiVA.org:ltu-82964DiVA, id: diva2:1528789
Projects
Language models for Swedish authorities
Funder
Vinnova, 2019-02996
Note

Validerad;2021;Nivå 1;2021-02-18 (alebob)

Available from: 2021-02-16 Created: 2021-02-16 Last updated: 2023-09-05Bibliographically approved

Open Access in DiVA

fulltext(1145 kB)507 downloads
File information
File name FULLTEXT01.pdfFile size 1145 kBChecksum SHA-512
5f4ef9df24c964aa4b8cfaea5fc993ed61be5c426ac74be3f52c06fb2f9e9f9ed3c512a491d19728d79886b84abf57270a46246aeb0aede5f7fd813a0b9b3195
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Kovács, GyorgyAlonso, PedroSaini, Rajkumar

Search in DiVA

By author/editor
Kovács, GyorgyAlonso, PedroSaini, Rajkumar
By organisation
Embedded Internet Systems Lab
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 510 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 400 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf