Extended Abstract

Dependency Lengths in Speech and Writing: A Cross-Linguistic Comparison via YouDePP, a Pipeline for Scraping and Parsing YouTube Captions

Author
  • Alex Kramer (University of Michigan, Ann Arbor)

Abstract

Recording, transcribing, and annotating naturalistic spoken data is typically difficult and time-intensive. Online sources, however, are a rich and relatively untapped source of naturalistic speech. Using corpora of 7 languages gathered via YouDePP, a pipeline for scraping and dependency-parsing pre-transcribed speech from YouTube, I investigate how dependency length minimization (DLM) varies across written and spoken modalities. I compare the dependency length growth rates in these corpora to those in Universal Dependencies 2.6 and find that dependency lengths in writing are not consistently longer than those in speech. Rather, the dependency lengths of more head-initial, SVO languages grew at a slightly faster rate in speech than in writing, while the reverse pattern held for more head-final, SOV languages.

Keywords: typology, spoken corpora, dependency length minimization, dependency locality, YouTube

How to Cite:

Kramer, A., (2021) “Dependency Lengths in Speech and Writing: A Cross-Linguistic Comparison via YouDePP, a Pipeline for Scraping and Parsing YouTube Captions”, Society for Computation in Linguistics 4(1), 359-365. doi: https://doi.org/10.7275/pz9g-d780

Downloads:
Download PDF

96 Views

37 Downloads

Published on
01 Jan 2021