Skip to main navigation Skip to search Skip to main content

A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Currently, there is a boom in introducing Machine Learning models to various aspects of everyday life. A relevant field consists of Natural Language Processing (NLP) that seeks to model human language. A key and basic component for these models to learn properly consists of the data. This article proposes a methodological framework for constructing a large-scale corpus to feed NLP models. The development of this framework emerges from the problem of finding inputs in languages other than English to feed NLP models. With an approach focused on producing a high-quality resource, the construction phases were designed along with the considerations that must be taken. The stages implemented consist of the corpus characterization to be obtained, collecting documents, cleaning, translation, storage, and evaluation. The proposed approach implemented automatic translators to take advantage of the vast amount of English literature and implemented through non-cost libraries. Finally, a case study was developed, resulting in a corpus in Spanish with more than 170,000 documents within a specific domain, i.e., opinions on textile products. Through the evaluations carried out, it is established that the proposed framework can build a large-scale and high-quality corpus.

Original languageEnglish
Title of host publicationInformation and Communication Technologies - 9th Conference of Ecuador, TICEC 2021, Proceedings
EditorsJuan Pablo Salgado Guerrero, Janneth Chicaiza Espinosa, Mariela Cerrada Lozada, Santiago Berrezueta-Guzman
PublisherSpringer Science and Business Media Deutschland GmbH
Pages87-100
Number of pages14
ISBN (Print)9783030899400
DOIs
StatePublished - 2021
Event9th Conference on Information and Communication Technologies of Ecuador, TICEC 2021 - Virtual, Online
Duration: 24 Nov 202126 Nov 2021

Publication series

NameCommunications in Computer and Information Science
Volume1456 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference9th Conference on Information and Communication Technologies of Ecuador, TICEC 2021
CityVirtual, Online
Period24/11/2126/11/21

Keywords

  • Corpus construction
  • Corpus in Spanish
  • Large-scale corpus
  • Methodological framework
  • Supplies for NLP

Fingerprint

Dive into the research topics of 'A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models'. Together they form a unique fingerprint.

Cite this