Are All Languages Equally Hard to Language-Model?

Ryan Cotterell; Sebastian J Mielke; Jason Eisner; Brian Roark

doi:10.7275/hptt-f406

Options

Abstract

Are All Languages Equally Hard to Language-Model?

Authors

Ryan Cotterell (Johns Hopkins University)
Sebastian J Mielke (Johns Hopkins University)
Jason Eisner (Johns Hopkins University)
Brian Roark (Google)

Abstract

How cross-linguistically applicable are NLP models, specifically language models? A fair comparison between languages is tricky: not only do training corpora in different languages have different sizes and topics, some of which may be harder to predict than others, but standard metrics for language modeling depend on the orthography of a language. We argue for a fairer metric based on the bits per utterance using utterance-aligned multi-text. We conduct a study on 21 languages, training and testing both n-gram and LSTM language models on “the same” set of utterances in each language (modulo translation), demonstrating that in some languages, especially those with complex inflectional morphology, the textual expression of the information is harder to predict.

Keywords: nlp, language, model, language-model, language model, multilingual, comparison, rnn, complexity

How to Cite:

Cotterell, R., Mielke, S. J., Eisner, J. & Roark, B., (2019) “Are All Languages Equally Hard to Language-Model?”, Society for Computation in Linguistics 2(1), 361-362. doi: https://doi.org/10.7275/hptt-f406

Downloads:
Download PDF

258 Views

98 Downloads

Published on
2019-01-01

License

Creative Commons Attribution 4.0

Authors

Ryan Cotterell (Johns Hopkins University)
Sebastian J Mielke (Johns Hopkins University)
Jason Eisner (Johns Hopkins University)
Brian Roark (Google)

Publication details

Pages: 361-362
Submitted on: 2018-10-26

File Checksums (MD5)

PDF: f73cbe7d0b3f67bc33bf1017c2740e07

Are All Languages Equally Hard to Language-Model?

Abstract

Harvard-Style Citation

Vancouver-Style Citation

APA-Style Citation

Non Specialist Summary