Meaning-Informed Low-Resource Segmentation of Agglutinative Morphology
Abstract
Morphological segmentation is both an interesting acquisition problem and an important task for natural language processing. Most current computational approaches either use supervised machine learningówhich tends to lead to the best-performing modelsóor operate over bare surface forms of words. However, the empirical conditions of language acquisition seem to fall somewhere in between: children do not have access to pre-segmented input, yet their knowledge of morphological structure develops alongside semantic knowledge. Inspired by this, we suggest a simple computational model, which builds on experimental evidence that children can strip a suffix off of closely-related word forms. The model is unsupervised, but is able to exploit features to identify how differences between closely-related surface forms are marked. Trained on hundreds to a few thousand words from languages with agglutinative morphology, the resulting model outperforms an unsupervised model that does not exploit such features, and in some settings even outperforms a supervised model trained on both features and ground-truth segmentations.
Keywords: morphological segmentation, agglutinative morphology, low-resource learning
How to Cite:
Belth, C., (2024) “Meaning-Informed Low-Resource Segmentation of Agglutinative Morphology”, Society for Computation in Linguistics 7(1), 96–106. doi: https://doi.org/10.7275/scil.2134
Downloads:
Download PDF