TY - GEN
T1 - A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models
AU - Santos, David
AU - Auquilla, Andrés
AU - Siguenza-Guzman, Lorena
AU - Peña, Mario
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - Currently, there is a boom in introducing Machine Learning models to various aspects of everyday life. A relevant field consists of Natural Language Processing (NLP) that seeks to model human language. A key and basic component for these models to learn properly consists of the data. This article proposes a methodological framework for constructing a large-scale corpus to feed NLP models. The development of this framework emerges from the problem of finding inputs in languages other than English to feed NLP models. With an approach focused on producing a high-quality resource, the construction phases were designed along with the considerations that must be taken. The stages implemented consist of the corpus characterization to be obtained, collecting documents, cleaning, translation, storage, and evaluation. The proposed approach implemented automatic translators to take advantage of the vast amount of English literature and implemented through non-cost libraries. Finally, a case study was developed, resulting in a corpus in Spanish with more than 170,000 documents within a specific domain, i.e., opinions on textile products. Through the evaluations carried out, it is established that the proposed framework can build a large-scale and high-quality corpus.
AB - Currently, there is a boom in introducing Machine Learning models to various aspects of everyday life. A relevant field consists of Natural Language Processing (NLP) that seeks to model human language. A key and basic component for these models to learn properly consists of the data. This article proposes a methodological framework for constructing a large-scale corpus to feed NLP models. The development of this framework emerges from the problem of finding inputs in languages other than English to feed NLP models. With an approach focused on producing a high-quality resource, the construction phases were designed along with the considerations that must be taken. The stages implemented consist of the corpus characterization to be obtained, collecting documents, cleaning, translation, storage, and evaluation. The proposed approach implemented automatic translators to take advantage of the vast amount of English literature and implemented through non-cost libraries. Finally, a case study was developed, resulting in a corpus in Spanish with more than 170,000 documents within a specific domain, i.e., opinions on textile products. Through the evaluations carried out, it is established that the proposed framework can build a large-scale and high-quality corpus.
KW - Corpus construction
KW - Corpus in Spanish
KW - Large-scale corpus
KW - Methodological framework
KW - Supplies for NLP
UR - https://www.scopus.com/pages/publications/85121636802
U2 - 10.1007/978-3-030-89941-7_7
DO - 10.1007/978-3-030-89941-7_7
M3 - Contribución a la conferencia
AN - SCOPUS:85121636802
SN - 9783030899400
T3 - Communications in Computer and Information Science
SP - 87
EP - 100
BT - Information and Communication Technologies - 9th Conference of Ecuador, TICEC 2021, Proceedings
A2 - Salgado Guerrero, Juan Pablo
A2 - Chicaiza Espinosa, Janneth
A2 - Cerrada Lozada, Mariela
A2 - Berrezueta-Guzman, Santiago
PB - Springer Science and Business Media Deutschland GmbH
T2 - 9th Conference on Information and Communication Technologies of Ecuador, TICEC 2021
Y2 - 24 November 2021 through 26 November 2021
ER -