The stability of segmental properties across genre and corpus types in low-resource languages
- Uriel Cohen Priva (Brown University)
- Shiying Yang (Brown University)
- Emily Strand (Brown University)
Abstract
Are written corpora useful for phonological research? Word frequency lists for low-resource languages have become ubiquitous in recent years (Scannell, 2007). For many languages there is direct correspondence between their written forms and their alphabets, but it is not clear whether written corpora can adequately represent language use. We use 15 low-resource languages and compare several information-theoretic properties across three corpus types. We show that despite differences in origin and genre, estimates in one corpus are highly correlated with estimates in other corpora.
Keywords: corpus, phonology, low-resource, stability
How to Cite:
Cohen Priva, U., Yang, S. & Strand, E., (2020) “The stability of segmental properties across genre and corpus types in low-resource languages”, Society for Computation in Linguistics 3(1), 1-9. doi: https://doi.org/10.7275/fttf-fq95
Downloads:
Download PDF