Paper

Text Segmentation Similarity Revisited: A Flexible Distance-based Approach for Multiple Boundary Types

Authors
  • Ryan Ka Yau Lai (University of California, Santa Barbara)
  • Yujie Li (University of California, Santa Barbara)
  • Shujie Zhang (University of California, Berkeley)

Abstract

Segmentation of texts into discourse and prosodic units is a ubiquitous problem in corpus linguistics and psycholinguistics, yet best practices for its evaluation – whether evaluating consistency between human segmenters or humanlikeness of machine segmenters – remain understudied. Building on segmentation edit distance (Fournier & Inkpen 2012, Fournier 2013), this paper introduces a new measure for evaluating similarity between two segmentations of the same text with multiple, mutually exclusive boundary types, accounting for varying identifiability and confusability between these types. We implement a dynamic programming algorithm for calculation specifically geared towards this type of segmentation problem, apply it to a case study of intonation unit segmentation measuring inter-annotator agreement, and make suggestions for interpreting results.

Keywords: text segmentation, intonation units, prosodic annotation, inter-annotator agreement, dynamic programming, discourse annotation

How to Cite:

Lai, R., Li, Y. & Zhang, S., (2023) “Text Segmentation Similarity Revisited: A Flexible Distance-based Approach for Multiple Boundary Types”, Society for Computation in Linguistics 6(1), 300-309. doi: https://doi.org/10.7275/fk79-fv58

Downloads:
Download PDF

97 Views

31 Downloads

Published on
01 Jun 2023