Text Segmentation Similarity Revisited: A Flexible Distance-based Approach for Multiple Boundary Types
- Ryan Ka Yau Lai (University of California, Santa Barbara)
- Yujie Li (University of California, Santa Barbara)
- Shujie Zhang (University of California, Berkeley)
Abstract
Segmentation of texts into discourse and prosodic units is a ubiquitous problem in corpus linguistics and psycholinguistics, yet best practices for its evaluation – whether evaluating consistency between human segmenters or humanlikeness of machine segmenters – remain understudied. Building on segmentation edit distance (Fournier & Inkpen 2012, Fournier 2013), this paper introduces a new measure for evaluating similarity between two segmentations of the same text with multiple, mutually exclusive boundary types, accounting for varying identifiability and confusability between these types. We implement a dynamic programming algorithm for calculation specifically geared towards this type of segmentation problem, apply it to a case study of intonation unit segmentation measuring inter-annotator agreement, and make suggestions for interpreting results.
Keywords: text segmentation, intonation units, prosodic annotation, inter-annotator agreement, dynamic programming, discourse annotation
How to Cite:
Lai, R., Li, Y. & Zhang, S., (2023) “Text Segmentation Similarity Revisited: A Flexible Distance-based Approach for Multiple Boundary Types”, Society for Computation in Linguistics 6(1), 300-309. doi: https://doi.org/10.7275/fk79-fv58
Downloads:
Download PDF