Lexical databases for computational analyses: A linguistic perspective
- Robert Malouf (San Diego State University)
- Farrell Ackerman (University of California, San Diego)
- Artrus Semenuks (University of California, San Diego)
Abstract
Large typological databases have permitted new ways of studying cross-linguistic morphological variation. Recently, computational modelers with typological interests have begun to turn to broad multilingual text databases. In this paper, we will focus particularly on the UniMorph database, a collection of morphological paradigms, mostly gathered automatically from the crowd-sourced multi-lingual dictionary Wiktionary. It was designed to make the large quantity of data contained in Wiktionary available for NLP researchers by standardizing the data and putting it into a form that is easy to access. For typological studies, however, the requirements for a linguistically informed view of morphological variation are quite different. They involve using a morphological database as a scientific instrument to both formulate and test hypotheses about the nature and organization of language systems. The requirements are, accordingly, much higher. In this paper, we survey some of the methodological challenges and pitfalls involved in using corpora for typological research, and we end with a proposal for best practices and directions for further research.
Keywords: Morphology, typology, corpora, UniMorph
How to Cite:
Malouf, R., Ackerman, F. & Semenuks, A., (2020) “Lexical databases for computational analyses: A linguistic perspective”, Society for Computation in Linguistics 3(1), 297-307. doi: https://doi.org/10.7275/scil.1228
Downloads:
Download PDF