A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

Resumen

Currently, there is a boom in introducing Machine Learning models to various aspects of everyday life. A relevant field consists of Natural Language Processing (NLP) that seeks to model human language. A key and basic component for these models to learn properly consists of the data. This article proposes a methodological framework for constructing a large-scale corpus to feed NLP models. The development of this framework emerges from the problem of finding inputs in languages other than English to feed NLP models. With an approach focused on producing a high-quality resource, the construction phases were designed along with the considerations that must be taken. The stages implemented consist of the corpus characterization to be obtained, collecting documents, cleaning, translation, storage, and evaluation. The proposed approach implemented automatic translators to take advantage of the vast amount of English literature and implemented through non-cost libraries. Finally, a case study was developed, resulting in a corpus in Spanish with more than 170,000 documents within a specific domain, i.e., opinions on textile products. Through the evaluations carried out, it is established that the proposed framework can build a large-scale and high-quality corpus.

Idioma originalInglés
Título de la publicación alojadaInformation and Communication Technologies - 9th Conference of Ecuador, TICEC 2021, Proceedings
EditoresJuan Pablo Salgado Guerrero, Janneth Chicaiza Espinosa, Mariela Cerrada Lozada, Santiago Berrezueta-Guzman
EditorialSpringer Science and Business Media Deutschland GmbH
Páginas87-100
Número de páginas14
ISBN (versión impresa)9783030899400
DOI
EstadoPublicada - 2021
Evento9th Conference on Information and Communication Technologies of Ecuador, TICEC 2021 - Virtual, Online
Duración: 24 nov. 202126 nov. 2021

Serie de la publicación

NombreCommunications in Computer and Information Science
Volumen1456 CCIS
ISSN (versión impresa)1865-0929
ISSN (versión digital)1865-0937

Conferencia

Conferencia9th Conference on Information and Communication Technologies of Ecuador, TICEC 2021
CiudadVirtual, Online
Período24/11/2126/11/21

Huella

Profundice en los temas de investigación de 'A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models'. En conjunto forman una huella única.

Citar esto