Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data

Mark Granroth-Wilding; Hannu Toivonen

doi:10.7275/wx64-ea83

Options

Paper

Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data

Authors

Mark Granroth-Wilding (University of Helsinki)
Hannu Toivonen (University of Helsinki)

Abstract

We present a new method for unsupervised learning of multilingual symbol (e.g. character) embeddings, without any parallel data or prior knowledge about correspondences between languages. It is able to exploit similarities across languages between the distributions over symbols\' contexts of use within their language, even in the absence of any symbols in common to the two languages. In experiments with an artificially corrupted text corpus, we show that the method can retrieve character correspondences obscured by noise. We then present encouraging results of applying the method to real linguistic data, including for low-resourced languages. The learned representations open the possibility of fully unsupervised comparative studies of text or speech corpora in low-resourced languages with no prior knowledge regarding their symbol sets.

Keywords: languistic typology, unsupervised learning, multilingual, embeddings, neural networks

How to Cite:

Granroth-Wilding, M. & Toivonen, H., (2019) “Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data”, Society for Computation in Linguistics 2(1), 19-28. doi: https://doi.org/10.7275/wx64-ea83

Downloads:
Download PDF

203 Views

51 Downloads

Published on
2019-01-01

License

Creative Commons Attribution 4.0

Authors

Mark Granroth-Wilding (University of Helsinki)
Hannu Toivonen (University of Helsinki)

Publication details

Pages: 19-28
Submitted on: 2018-10-30

File Checksums (MD5)

PDF: 52723c3d5ac2a863242ffc3b06850255

Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data

Abstract

Harvard-Style Citation

Vancouver-Style Citation

APA-Style Citation

Non Specialist Summary