Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Exploring Swedish & English fastText Embeddings for NER with the Transformer
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab. (Machine Learning)ORCID iD: 0000-0002-5582-2031
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab. (Machine Learning)ORCID iD: 0000-0002-6756-0147
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab. (Machine Learning)ORCID iD: 0000-0003-4029-6574
(English)Manuscript (preprint) (Other academic)
Abstract [en]

In this paper, our main contributions are that embeddings from relatively smaller corpora can outperform ones from far larger corpora and we present the new Swedish analogy test set. To achieve a good network performance in natural language processing (NLP) downstream tasks, several factors play important roles: dataset size, the right hyper-parameters, and well-trained embeddings. We show that, with the right set of hyper-parameters, good network performance can be reached even on smaller datasets. We evaluate the embeddings at the intrinsic level and extrinsic level, by deploying them on the Transformer in named entity recognition (NER) task and conduct significance tests. This is done for both Swedish and English. We obtain better performance in both languages on the downstream task with far smaller training data, compared to recently released, common crawl versions; and character n-grams appear useful for Swedish, a morphologically rich language.

Keywords [en]
Embeddings, Transformer, Analogy, Dataset, NER, Swedish
National Category
Language Technology (Computational Linguistics)
Research subject
Machine Learning
Identifiers
URN: urn:nbn:se:ltu:diva-80622OAI: oai:DiVA.org:ltu-80622DiVA, id: diva2:1462641
Funder
Vinnova, 2019-02996Available from: 2020-08-31 Created: 2020-08-31 Last updated: 2022-10-28
In thesis
1. Word Vector Representations using Shallow Neural Networks
Open this publication in new window or tab >>Word Vector Representations using Shallow Neural Networks
2021 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

This work highlights some important factors for consideration when developing word vector representations and data-driven conversational systems. The neural network methods for creating word embeddings have gained more prominence than their older, count-based counterparts.However, there are still challenges, such as prolonged training time and the need for more data, especially with deep neural networks. Shallow neural networks with lesser depth appear to have the advantage of less complexity, however, they also face challenges, such as sub-optimal combination of hyper-parameters which produce sub-optimal models. This work, therefore, investigates the following research questions: "How importantly do hyper-parameters influence word embeddings’ performance?" and "What factors are important for developing ethical and robust conversational systems?" In answering the questions, various experiments were conducted using different datasets in different studies. The first study investigates, empirically, various hyper-parameter combinations for creating word vectors and their impact on a few natural language processing (NLP) downstream tasks: named entity recognition (NER) and sentiment analysis (SA). The study shows that optimal performance of embeddings for downstream \acrshort{nlp} tasks depends on the task at hand.It also shows that certain combinations give strong performance across the tasks chosen for the study. Furthermore, it shows that reasonably smaller corpora are sufficient or even produce better models in some cases and take less time to train and load. This is important, especially now that environmental considerations play prominent role in ethical research. Subsequent studies build on the findings of the first and explore the hyper-parameter combinations for Swedish and English embeddings for the downstream NER task. The second study presents the new Swedish analogy test set for evaluation of Swedish embeddings. Furthermore, it shows that character n-grams are useful for Swedish, a morphologically rich language. The third study shows that broad coverage of topics in a corpus appears to be important to produce better embeddings and that noise may be helpful in certain instances, though they are generally harmful. Hence, relatively smaller corpus can show better performance than a larger one, as demonstrated in the work with the smaller Swedish Wikipedia corpus against the Swedish Gigaword. The argument is made, in the final study (in answering the second question) from the point of view of the philosophy of science, that the near-elimination of the presence of unwanted bias in training data and the use of foralike the peer-review, conferences, and journals to provide the necessary avenues for criticism and feedback are instrumental for the development of ethical and robust conversational systems.

Place, publisher, year, edition, pages
Luleå: Luleå University of Technology, 2021. p. 93
Keywords
Word vectors, NLP, Neural networks, Embeddings
National Category
Language Technology (Computational Linguistics)
Research subject
Machine Learning
Identifiers
urn:nbn:se:ltu:diva-83578 (URN)978-91-7790-810-4 (ISBN)978-91-7790-811-1 (ISBN)
Presentation
2021-05-26, A109, LTU, Luleå, 09:00 (English)
Opponent
Supervisors
Available from: 2021-04-12 Created: 2021-04-10 Last updated: 2021-05-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

https://arxiv.org/pdf/2007.16007.pdf

Authority records

Adewumi, OluwatosinLiwicki, FoteiniLiwicki, Marcus

Search in DiVA

By author/editor
Adewumi, OluwatosinLiwicki, FoteiniLiwicki, Marcus
By organisation
Embedded Internet Systems Lab
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 353 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf