Dependency Lengths in Speech and Writing: A Cross-Linguistic Comparison via YouDePP, a Pipeline for Scraping and Parsing YouTube Captions
- Alex Kramer (University of Michigan, Ann Arbor)
Abstract
Recording, transcribing, and annotating naturalistic spoken data is typically difficult and time-intensive. Online sources, however, are a rich and relatively untapped source of naturalistic speech. Using corpora of 7 languages gathered via YouDePP, a pipeline for scraping and dependency-parsing pre-transcribed speech from YouTube, I investigate how dependency length minimization (DLM) varies across written and spoken modalities. I compare the dependency length growth rates in these corpora to those in Universal Dependencies 2.6 and find that dependency lengths in writing are not consistently longer than those in speech. Rather, the dependency lengths of more head-initial, SVO languages grew at a slightly faster rate in speech than in writing, while the reverse pattern held for more head-final, SOV languages.
Keywords: typology, spoken corpora, dependency length minimization, dependency locality, YouTube
How to Cite:
Kramer, A., (2021) “Dependency Lengths in Speech and Writing: A Cross-Linguistic Comparison via YouDePP, a Pipeline for Scraping and Parsing YouTube Captions”, Society for Computation in Linguistics 4(1), 359-365. doi: https://doi.org/10.7275/pz9g-d780
Downloads:
Download PDF