Paper

Text Segmentation Similarity Revisited: A Flexible Distance-based Approach for Multiple Boundary Types

Authors: , ,

Abstract

Segmentation of texts into discourse and prosodic units is a ubiquitous problem in corpus linguistics and psycholinguistics, yet best practices for its evaluation – whether evaluating consistency between human segmenters or humanlikeness of machine segmenters – remain understudied. Building on segmentation edit distance (Fournier & Inkpen 2012, Fournier 2013), this paper introduces a new measure for evaluating similarity between two segmentations of the same text with multiple, mutually exclusive boundary types, accounting for varying identifiability and confusability between these types. We implement a dynamic programming algorithm for calculation specifically geared towards this type of segmentation problem, apply it to a case study of intonation unit segmentation measuring inter-annotator agreement, and make suggestions for interpreting results.

Keywords: text segmentation, intonation units, prosodic annotation, inter-annotator agreement, dynamic programming, discourse annotation

How to Cite: Lai, R. , Li, Y. & Zhang, S. (2023) “Text Segmentation Similarity Revisited: A Flexible Distance-based Approach for Multiple Boundary Types”, Society for Computation in Linguistics. 6(1). doi: https://doi.org/10.7275/fk79-fv58