Are All Languages Equally Hard to Language-Model?
- Ryan Cotterell (Johns Hopkins University)
- Sebastian J Mielke (Johns Hopkins University)
- Jason Eisner (Johns Hopkins University)
- Brian Roark (Google)
Abstract
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison between languages is tricky: not only do training corpora in different languages have different sizes and topics, some of which may be harder to predict than others, but standard metrics for language modeling depend on the orthography of a language. We argue for a fairer metric based on the bits per utterance using utterance-aligned multi-text. We conduct a study on 21 languages, training and testing both n-gram and LSTM language models on “the same” set of utterances in each language (modulo translation), demonstrating that in some languages, especially those with complex inflectional morphology, the textual expression of the information is harder to predict.
Keywords: nlp, language, model, language-model, language model, multilingual, comparison, rnn, complexity
How to Cite:
Cotterell, R., Mielke, S. J., Eisner, J. & Roark, B., (2019) “Are All Languages Equally Hard to Language-Model?”, Society for Computation in Linguistics 2(1), 361-362. doi: https://doi.org/10.7275/hptt-f406
Downloads:
Download PDF