Endre søk
Link to record
Permanent link

Direct link
Publikasjoner (10 av 10) Visa alla publikasjoner
Kovács, G., Alonso, P., Saini, R. & Liwicki, M. (2022). Leveraging external resources for offensive content detection in social media. AI Communications, 35(2), 87-109
Åpne denne publikasjonen i ny fane eller vindu >>Leveraging external resources for offensive content detection in social media
2022 (engelsk)Inngår i: AI Communications, ISSN 0921-7126, E-ISSN 1875-8452, Vol. 35, nr 2, s. 87-109Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Hate speech is a burning issue of today’s society that cuts across numerous strategic areas, including human rights protection, refugee protection, and the fight against racism and discrimination. The gravity of the subject is further demonstrated by António Guterres, the United Nations Secretary-General, calling it “a menace to democratic values, social stability, and peace”. One central platform for the spread of hate speech is the Internet and social media in particular. Thus, automatic detection of hateful and offensive content on these platforms is a crucial challenge that would strongly contribute to an equal and sustainable society when overcome. One significant difficulty in meeting this challenge is collecting sufficient labeled data. In our work, we examine how various resources can be leveraged to circumvent this difficulty. We carry out extensive experiments to exploit various data sources using different machine learning models, including state-of-the-art transformers. We have found that using our proposed methods, one can attain state-of-the-art performance detecting hate speech on Twitter (outperforming the winner of both the HASOC 2019 and HASOC 2020 competitions). It is observed that in general, adding more data improves the performance or does not decrease it. Even when using good language models and knowledge transfer mechanisms, the best results were attained using data from one or two additional data sets.

sted, utgiver, år, opplag, sider
IOS Press, 2022
Emneord
Hateful and offensive language, deep language processing, transfer learning, vocabulary augmentation, RoBERTa
HSV kategori
Forskningsprogram
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-90607 (URN)10.3233/aic-210138 (DOI)000828016100004 ()2-s2.0-85135231173 (Scopus ID)
Merknad

Validerad;2022;Nivå 2;2022-07-20 (sofila)

Tilgjengelig fra: 2022-05-11 Laget: 2022-05-11 Sist oppdatert: 2023-09-05bibliografisk kontrollert
Kovács, G., Alonso, P. & Saini, R. (2021). Challenges of Hate Speech Detection in Social Media: Data Scarcity, and Leveraging External Resources. SN Computer Science, 2(2), Article ID 95.
Åpne denne publikasjonen i ny fane eller vindu >>Challenges of Hate Speech Detection in Social Media: Data Scarcity, and Leveraging External Resources
2021 (engelsk)Inngår i: SN Computer Science, ISSN 2662-995X, Vol. 2, nr 2, artikkel-id 95Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

The detection of hate speech in social media is a crucial task. The uncontrolled spread of hate has the potential to gravely damage our society, and severely harm marginalized people or groups. A major arena for spreading hate speech online is social media. This significantly contributes to the difficulty of automatic detection, as social media posts include paralinguistic signals (e.g. emoticons, and hashtags), and their linguistic content contains plenty of poorly written text. Another difficulty is presented by the context-dependent nature of the task, and the lack of consensus on what constitutes as hate speech, which makes the task difficult even for humans. This makes the task of creating large labeled corpora difficult, and resource consuming. The problem posed by ungrammatical text has been largely mitigated by the recent emergence of deep neural network (DNN) architectures that have the capacity to efficiently learn various features. For this reason, we proposed a deep natural language processing (NLP) model—combining convolutional and recurrent layers—for the automatic detection of hate speech in social media data. We have applied our model on the HASOC2019 corpus, and attained a macro F1 score of 0.63 in hate speech detection on the test set of HASOC. The capacity of DNNs for efficient learning, however, also means an increased risk of overfitting. Particularly, with limited training data available (as was the case for HASOC). For this reason, we investigated different methods for expanding resources used. We have explored various opportunities, such as leveraging unlabeled data, similarly labeled corpora, as well as the use of novel models. Our results showed that by doing so, it was possible to significantly increase the classification score attained.

sted, utgiver, år, opplag, sider
Switzerland: Springer, 2021
Emneord
Hate speech, Deep language processing, Transfer learning, BERT, Vocabulary augmentation
HSV kategori
Forskningsprogram
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-82964 (URN)10.1007/s42979-021-00457-3 (DOI)2-s2.0-85122607484 (Scopus ID)
Prosjekter
Language models for Swedish authorities
Forskningsfinansiär
Vinnova, 2019-02996
Merknad

Validerad;2021;Nivå 1;2021-02-18 (alebob)

Tilgjengelig fra: 2021-02-16 Laget: 2021-02-16 Sist oppdatert: 2023-09-05bibliografisk kontrollert
Alonso, P., Shridhar, K., Kleyko, D., Osipov, E. & Liwicki, M. (2021). HyperEmbed: Tradeoffs Between Resources and Performance in NLP Tasks with Hyperdimensional Computing Enabled Embedding of n-gram Statistics. In: 2021 International Joint Conference on Neural Networks (IJCNN) Proceedings: . Paper presented at The International Joint Conference on Neural Networks (IJCNN 2021), virtual, July 18-22, 2021. IEEE
Åpne denne publikasjonen i ny fane eller vindu >>HyperEmbed: Tradeoffs Between Resources and Performance in NLP Tasks with Hyperdimensional Computing Enabled Embedding of n-gram Statistics
Vise andre…
2021 (engelsk)Inngår i: 2021 International Joint Conference on Neural Networks (IJCNN) Proceedings, IEEE, 2021Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Recent advances in Deep Learning have led to a significant performance increase on several NLP tasks, however, the models become more and more computationally demanding. Therefore, this paper tackles the domain of computationally efficient algorithms for NLP tasks. In particular, it investigates distributed representations of n -gram statistics of texts. The representations are formed using hyperdimensional computing enabled embedding. These representations then serve as features, which are used as input to standard classifiers. We investigate the applicability of the embedding on one large and three small standard datasets for classification tasks using nine classifiers. The embedding achieved on par F1 scores while decreasing the time and memory requirements by several times compared to the conventional n -gram statistics, e.g., for one of the classifiers on a small dataset, the memory reduction was 6.18 times; while train and test speed-ups were 4.62 and 3.84 times, respectively. For many classifiers on the large dataset, memory reduction was ca. 100 times and train and test speed-ups were over 100 times. Importantly, the usage of distributed representations formed via hyperdimensional computing allows dissecting strict dependency between the dimensionality of the representation and n-gram size, thus, opening a room for tradeoffs.

sted, utgiver, år, opplag, sider
IEEE, 2021
Serie
International Joint Conference on Neural Networks (IJCNN), E-ISSN 2161-4407
Emneord
hyperdimensional computing, gram statistics, intent classification, embedding
HSV kategori
Forskningsprogram
Maskininlärning; Kommunikations- och beräkningssystem
Identifikatorer
urn:nbn:se:ltu:diva-87288 (URN)10.1109/IJCNN52387.2021.9534359 (DOI)000722581708054 ()2-s2.0-85108654382 (Scopus ID)
Konferanse
The International Joint Conference on Neural Networks (IJCNN 2021), virtual, July 18-22, 2021
Forskningsfinansiär
EU, Horizon 2020, 839179
Merknad

ISBN för värdpublikation: 978-1-6654-3900-8;

Funder: DARPA

Tilgjengelig fra: 2021-09-30 Laget: 2021-09-30 Sist oppdatert: 2022-01-28bibliografisk kontrollert
Kovács, G., Saini, R., Faridghasemnia, M., Mokayed, H., Adewumi, T., Alonso, P., . . . Liwicki, M. (2021). Pedagogical Principles in the Online Teaching of NLP: A Retrospection. In: David Jurgens; Varada Kolhatkar; Lucy Li; Margot Mieskes; Ted Pedersen (Ed.), Teaching NLP: Proceedings of the Fifth Workshop. Paper presented at Annual Conference of the North American Chapter of the Association for Computational Linguistics, 5th Workshop on Teaching NLP, Online, 10-11 June, 2021 (pp. 1-12). Association for Computational Linguistics (ACL)
Åpne denne publikasjonen i ny fane eller vindu >>Pedagogical Principles in the Online Teaching of NLP: A Retrospection
Vise andre…
2021 (engelsk)Inngår i: Teaching NLP: Proceedings of the Fifth Workshop / [ed] David Jurgens; Varada Kolhatkar; Lucy Li; Margot Mieskes; Ted Pedersen, Association for Computational Linguistics (ACL) , 2021, s. 1-12Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

The ongoing COVID-19 pandemic has brought online education to the forefront of pedagogical discussions. To make this increased interest sustainable in a post-pandemic era, online courses must be built on strong pedagogical foundations. With a long history of pedagogic research, there are many principles, frameworks, and models available to help teachers in doing so. These models cover different teaching perspectives, such as constructive alignment, feedback, and the learning environment. In this paper, we discuss how we designed and implemented our online Natural Language Processing (NLP) course following constructive alignment and adhering to the pedagogical principles of LTU. By examining our course and analyzing student evaluation forms, we show that we have met our goal and successfully delivered the course. Furthermore, we discuss the additional benefits resulting from the current mode of delivery, including the increased reusability of course content and increased potential for collaboration between universities. Lastly, we also discuss where we can and will further improve the current course design.

sted, utgiver, år, opplag, sider
Association for Computational Linguistics (ACL), 2021
Emneord
NLP, Constructive Alignment, LTU’s Pedagogical Principles, Student Activation, Online Pedagogy, COVID19, Canvas, Blackboard, Zoom
HSV kategori
Forskningsprogram
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-86548 (URN)10.18653/v1/2021.teachingnlp-1.1 (DOI)2-s2.0-85138700433 (Scopus ID)
Konferanse
Annual Conference of the North American Chapter of the Association for Computational Linguistics, 5th Workshop on Teaching NLP, Online, 10-11 June, 2021
Merknad

ISBN för värdpublikation: 978-1-954085-36-7

Tilgjengelig fra: 2021-08-11 Laget: 2021-08-11 Sist oppdatert: 2023-09-05bibliografisk kontrollert
Alonso, P. (2020). Faster and More Resource-Efficient Intent Classification. (Licentiate dissertation). Luleå, Sweden: Luleå University of Technology
Åpne denne publikasjonen i ny fane eller vindu >>Faster and More Resource-Efficient Intent Classification
2020 (engelsk)Licentiatavhandling, med artikler (Annet vitenskapelig)
Abstract [en]

Intent classification is known to be a complex problem in Natural Language Processing (NLP) research. This problem represents one of the stepping stones to obtain machines that can understand our language. Several different models recently appeared to tackle the problem. The solution has become reachable with deep learning models. However, they have not achieved the goal yet.Nevertheless, the energy and computational resources of these modern models (especially deep learning ones) are very high. The utilization of energy and computational resources should be kept at a minimum to deploy them on resource-constrained devices efficiently.Furthermore, these resource savings will help to minimize the environmental impact of NLP.

This thesis considers two main questions.First, which deep learning model is optimal for intent classification?Which model can more accurately infer a written piece of text (here inference equals to hate-speech) in a short text environment. Second, can we make intent classification models to be simpler and more resource-efficient than deep learning?.

Concerning the first question, the work here shows that intent classification in written language is still a complex problem for modern models.However, deep learning has shown successful results in every area it has been applied.The work here shows the optimal model that was used in short texts.The second question shows that we can achieve results similar to the deep learning models by more straightforward solutions.To show that, when combining classical machine learning models, pre-processing techniques, and a hyperdimensional computing approach.

This thesis presents a research done for a more resource-efficient machine learning approach to intent classification. It does this by first showing a high baseline using tweets filled with hate-speech and one of the best deep learning models available now (RoBERTa, as an example). Next, by showing the steps taken to arrive at the final model with hyperdimensional computing, which minimizes the required resources.This model can help make intent classification faster and more resource-efficient by trading a few performance points to achieve such resource-saving.Here, a hyperdimensional computing model is proposed. The model is inspired by hyperdimensional computing and its called ``hyperembed,'' which shows the capabilities of the hyperdimensional computing paradigm.When considering resource-efficiency, the models proposed were tested on intent classification on short texts, tweets (for hate-speech where intents are to offend or not to), and questions posed to Chatbots.

In summary, the work proposed here covers two aspects. First, the deep learning models have an advantage in performance when there are sufficient data. They, however, tend to fail when the amount of available data is not sufficient. In contrast to the deep learning models, the proposed models work well even on small datasets.Second, the deep learning models require substantial resources to train and run them while the models proposed here aim at trading off the computational resources spend to obtaining and running the model against the classification performance of the model.

sted, utgiver, år, opplag, sider
Luleå, Sweden: Luleå University of Technology, 2020. s. 86
Serie
Licentiate thesis / Luleå University of Technology, ISSN 1402-1757
HSV kategori
Forskningsprogram
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-81178 (URN)978-91-7790-689-6 (ISBN)978-91-7790-690-2 (ISBN)
Presentation
2020-12-18, A3580, Luleå, 09:00 (engelsk)
Opponent
Veileder
Tilgjengelig fra: 2020-10-19 Laget: 2020-10-19 Sist oppdatert: 2020-11-27bibliografisk kontrollert
Alonso, P., Saini, R. & Kovács, G. (2020). Hate Speech Detection using Transformer Ensembles on the HASOC Dataset. In: Alexey Karpov, Rodmonga Potapova (Ed.), Speech and Computer: 22nd International Conference, SPECOM 2020, St. Petersburg, Russia, October 7–9, 2020, Proceedings. Paper presented at 22nd International Conference on Speech and Computer (SPECOM 2020), 7-9 October, 2020, St. Petersburg, Russia (pp. 13-21). Springer
Åpne denne publikasjonen i ny fane eller vindu >>Hate Speech Detection using Transformer Ensembles on the HASOC Dataset
2020 (engelsk)Inngår i: Speech and Computer: 22nd International Conference, SPECOM 2020, St. Petersburg, Russia, October 7–9, 2020, Proceedings / [ed] Alexey Karpov, Rodmonga Potapova, Springer, 2020, s. 13-21Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

With the ubiquity and anonymity of the Internet, the spread of hate speech has been a growing concern for many years now. The language used for the purpose of dehumanizing, defaming or threatening individuals and marginalized groups not only threatens the mental health of its targets, as well as their democratic access to the Internet, but also the fabric of our society. Because of this, much effort has been devoted to manual moderation. The amount of data generated each day, particularly on social media platforms such as Facebook and twitter, however makes this a Sisyphean task. This has led to an increased demand for automatic methods of hate speech detection.

Here, to contribute towards solving the task of hate speech detection, we worked with a simple ensemble of transformer models on a twitter-based hate speech benchmark. Using this method, we attained a weighted F1-score of 0.8426, which we managed to further improve by leveraging more training data, achieving a weighted F1-score of 0.8504. Thus markedly outperforming the best performing system in the literature.

sted, utgiver, år, opplag, sider
Springer, 2020
Serie
Lecture Notes in Artificial Intelligence, ISSN 0302-9743, E-ISSN 1611-3349 ; 12335
Emneord
Natural Language Processing, Hate Speech Detection, Transformers, RoBERTa, Ensemble
HSV kategori
Forskningsprogram
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-80629 (URN)10.1007/978-3-030-60276-5_2 (DOI)2-s2.0-85092898876 (Scopus ID)
Konferanse
22nd International Conference on Speech and Computer (SPECOM 2020), 7-9 October, 2020, St. Petersburg, Russia
Forskningsfinansiär
Vinnova, 2019-02996
Merknad

ISBN för värdpublikation: 978-3-030-60275-8, 978-3-030-60276-5

Tilgjengelig fra: 2020-08-31 Laget: 2020-08-31 Sist oppdatert: 2023-09-05bibliografisk kontrollert
Alonso, P., Saini, R. & Kovács, G. (2020). TheNorth at SemEval-2020 Task 12: Hate Speech Detection using RoBERTa. In: The International Workshop on Semantic Evaluation: Proceedings of the Fourteenth Workshop. Paper presented at 14th International Workshop on Semantic Evaluation (SemEval-2020), Virtual, December 12-13, 2020 (pp. 2197-2202). International Committee for Computational Linguistics
Åpne denne publikasjonen i ny fane eller vindu >>TheNorth at SemEval-2020 Task 12: Hate Speech Detection using RoBERTa
2020 (engelsk)Inngår i: The International Workshop on Semantic Evaluation: Proceedings of the Fourteenth Workshop, International Committee for Computational Linguistics , 2020, s. 2197-2202Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Hate speech detection on social media platforms is crucial as it helps to avoid severe harm to marginalized people and groups. The application of Natural Language Processing (NLP) and Deep Learning has garnered encouraging results in the task of hate speech detection. The expressionof hate, however, is varied and ever-evolving. Thus better detection systems need to adapt to this variance. Because of this, researchers keep on collecting data and regularly come up with hate speech detection competitions. In this paper, we discuss our entry to one such competition,namely the English version of sub-task A for the OffensEval competition. Our contribution can be perceived through our results, that was first an F1-score of 0.9087, and with further refinementsdescribed here climb up to 0.9166. It serves to give more support to our hypothesis that one ofthe variants of BERT, namely RoBERTa can successfully differentiate between offensive and non-offensive tweets, given the proper preprocessing steps

sted, utgiver, år, opplag, sider
International Committee for Computational Linguistics, 2020
Emneord
Natural Language Processing, Roberta, Hate speech, Deep Learning
HSV kategori
Forskningsprogram
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-80631 (URN)2-s2.0-85119198242 (Scopus ID)
Konferanse
14th International Workshop on Semantic Evaluation (SemEval-2020), Virtual, December 12-13, 2020
Forskningsfinansiär
Vinnova, 2019-02996
Merknad

ISBN för värdpublikation: 978-1-952148-31-6

Tilgjengelig fra: 2020-08-31 Laget: 2020-08-31 Sist oppdatert: 2023-09-05bibliografisk kontrollert
Kovács, G., Balogh, V., Mehta, P., Shridhar, K., Alonso, P. & Liwicki, M. (2019). Author Profiling Using Semantic and Syntactic Features: Notebook for PAN at CLEF 2019. In: Linda Cappellato, Nicola Ferro, David E. Losada, Henning Müller (Ed.), CLEF 2019 Working Notes: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. Paper presented at 10th CLEF Conference and Labs of the Evaluation Forum (CLEF 2019), 9-12 September, 2019, Lugano, Switzerland. RWTH Aachen University, Article ID 244.
Åpne denne publikasjonen i ny fane eller vindu >>Author Profiling Using Semantic and Syntactic Features: Notebook for PAN at CLEF 2019
Vise andre…
2019 (engelsk)Inngår i: CLEF 2019 Working Notes: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum / [ed] Linda Cappellato, Nicola Ferro, David E. Losada, Henning Müller, RWTH Aachen University , 2019, artikkel-id 244Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

In this paper we present an approach for the PAN 2019 Author Profiling challenge. The task here is to detect Twitter bots and also to classify the gender of human Twitter users as male or female, based on a hundred select tweets from their profile. Focusing on feature engineering, we explore the semantic categories present in tweets. We combine these semantic features with part of speech tags and other stylistic features – e.g. character floodings and the use of capital letters – for our eventual feature set. We have experimented with different machine learning techniques, including ensemble techniques, and found AdaBoost to be the most successful (attaining an F1-score of 0.99 on the development set). Using this technique, we achieved an accuracy score of 89.17% for English language tweets in the bot detection subtask

sted, utgiver, år, opplag, sider
RWTH Aachen University, 2019
Serie
CEUR Workshop Proceedings, E-ISSN 1613-0073 ; 2380
HSV kategori
Forskningsprogram
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-76936 (URN)2-s2.0-85070487977 (Scopus ID)
Konferanse
10th CLEF Conference and Labs of the Evaluation Forum (CLEF 2019), 9-12 September, 2019, Lugano, Switzerland
Tilgjengelig fra: 2019-11-28 Laget: 2019-11-28 Sist oppdatert: 2022-10-31bibliografisk kontrollert
Shridhar, K., Dash, A., Sahu, A., Grund Pihlgren, G., Alonso, P., Pondenkandath, V., . . . Liwicki, M. (2019). Subword Semantic Hashing for Intent Classification on Small Datasets. In: 2019 International Joint Conference on Neural Networks (IJCNN): . Paper presented at 2019 International Joint Conference on Neural Networks (IJCNN), 14-19 July, 2019, Budapest, Hungary. IEEE, Article ID N-19329.
Åpne denne publikasjonen i ny fane eller vindu >>Subword Semantic Hashing for Intent Classification on Small Datasets
Vise andre…
2019 (engelsk)Inngår i: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, 2019, artikkel-id N-19329Konferansepaper, Publicerat paper (Annet vitenskapelig)
Abstract [en]

In this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and achieve state-of-the-art performance on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art Deep Learning based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classification. Current word embedding based methods [11], [13], [14] are dependent on vocabularies. One of the major drawbacks of such methods is out-of-vocabulary terms, especially when having small training datasets and using a wider vocabulary. This is the case in Intent Classification for chatbots, where typically small datasets are extracted from internet communication. Two problems arise with the use of internet communication. First, such datasets miss a lot of terms in the vocabulary to use word embeddings efficiently. Second, users frequently make spelling errors. Typically, the models for intent classification are not trained with spelling errors and it is difficult to think about ways in which users will make mistakes. Models depending on a word vocabulary will always face such issues. An ideal classifier should handle spelling errors inherently. With Semantic Hashing, we overcome these challenges and achieve state-of-the-art results on three datasets: Chatbot, Ask Ubuntu, and Web Applications [3]. Our benchmarks are available online.

sted, utgiver, år, opplag, sider
IEEE, 2019
Serie
International Joint Conference on Neural Networks (IJCNN), ISSN 2161-4407, E-ISSN 2161-4393
Emneord
Natural Language Processing, Intent Classification, Chatbots, Semantic Hashing, Machine Learning, State-of-the-art
HSV kategori
Forskningsprogram
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-76841 (URN)10.1109/IJCNN.2019.8852420 (DOI)2-s2.0-85073258046 (Scopus ID)
Konferanse
2019 International Joint Conference on Neural Networks (IJCNN), 14-19 July, 2019, Budapest, Hungary
Merknad

ISBN för värdpublikation: 978-1-7281-1985-4, 978-1-7281-1986-1

Tilgjengelig fra: 2019-11-25 Laget: 2019-11-25 Sist oppdatert: 2022-10-31bibliografisk kontrollert
Alonso, P., Saini, R. & Kovács, G. (2019). TheNorth at HASOC 2019: Hate Speech Detection in Social Media Data. In: Parth Mehta, Paolo Rosso, Prasenjit Majumder, Mandar Mitra, (Ed.), Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation: . Paper presented at 11th Forum for Information Retrieval Evaluation (FIRE 2019), Kolkata, India, December 12-15, 2019 (pp. 293-299). RWTH Aachen University
Åpne denne publikasjonen i ny fane eller vindu >>TheNorth at HASOC 2019: Hate Speech Detection in Social Media Data
2019 (engelsk)Inngår i: Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation / [ed] Parth Mehta, Paolo Rosso, Prasenjit Majumder, Mandar Mitra,, RWTH Aachen University , 2019, s. 293-299Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

The detection of hate speech in social media is a crucial task. The uncontrolled spread of hate speech can be detrimental to maintaining the peace and harmony in society. Particularly when hate speech is spread with the intention to defame people, or spoil the image of a person, a community, or a nation. A major ground for spreading hate speech is that of social media. This significantly contributes to the difficultyof the task, as social media posts not only include paralinguistic tools (e.g. emoticons, and hashtags), their linguistic content contains plenty of poorly written text that does not adhere to grammar rules. With the recent development in Natural Language Processing (NLP), particularly with deep architecture, it is now possible to anlayze unstructured composite natural language text. For this reason, we propose a deep NLP model for the detection of automatic hate speech in social media data. We have applied our model on the HASOC2019 hate speech corpus, and attained a macro F1 score of 0.63 in the detection of hate speech.

sted, utgiver, år, opplag, sider
RWTH Aachen University, 2019
Serie
CEUR Workshop Proceedings, E-ISSN 1613-0073 ; 2517
HSV kategori
Forskningsprogram
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-77403 (URN)2-s2.0-85076905278 (Scopus ID)
Konferanse
11th Forum for Information Retrieval Evaluation (FIRE 2019), Kolkata, India, December 12-15, 2019
Tilgjengelig fra: 2020-01-14 Laget: 2020-01-14 Sist oppdatert: 2023-09-05bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-6785-4356