Lexical databases for computational analyses: A linguistic perspective

Robert Malouf; Farrell Ackerman; Artrus Semenuks

doi:10.7275/scil.1228

Options

Paper

Lexical databases for computational analyses: A linguistic perspective

Authors

Robert Malouf (San Diego State University)
Farrell Ackerman (University of California, San Diego)
Artrus Semenuks (University of California, San Diego)

Abstract

Large typological databases have permitted new ways of studying cross-linguistic morphological variation. Recently, computational modelers with typological interests have begun to turn to broad multilingual text databases. In this paper, we will focus particularly on the UniMorph database, a collection of morphological paradigms, mostly gathered automatically from the crowd-sourced multi-lingual dictionary Wiktionary. It was designed to make the large quantity of data contained in Wiktionary available for NLP researchers by standardizing the data and putting it into a form that is easy to access. For typological studies, however, the requirements for a linguistically informed view of morphological variation are quite different. They involve using a morphological database as a scientific instrument to both formulate and test hypotheses about the nature and organization of language systems. The requirements are, accordingly, much higher. In this paper, we survey some of the methodological challenges and pitfalls involved in using corpora for typological research, and we end with a proposal for best practices and directions for further research.

Keywords: Morphology, typology, corpora, UniMorph

How to Cite:

Malouf, R., Ackerman, F. & Semenuks, A., (2020) “Lexical databases for computational analyses: A linguistic perspective”, Society for Computation in Linguistics 3(1), 297-307. doi: https://doi.org/10.7275/scil.1228

Downloads:
Download PDF

479 Views

753 Downloads

Published on
2020-01-01

License

Creative Commons Attribution 4.0

Authors

Robert Malouf (San Diego State University)
Farrell Ackerman (University of California, San Diego)
Artrus Semenuks (University of California, San Diego)

Publication details

Pages: 297-307
Submitted on: 2019-10-16

File Checksums (MD5)

PDF: fbe515c696e8df450c1f9cebf6966f61

Lexical databases for computational analyses: A linguistic perspective

Abstract

Harvard-Style Citation

Vancouver-Style Citation

APA-Style Citation

Non Specialist Summary