Parsing "Early English Books Online" for Linguistic Search

Seth Kulick; Neville Ryant; Beatrice Santorini

doi:10.7275/kr54-n102

Options

Paper

Parsing "Early English Books Online" for Linguistic Search

Authors

Seth Kulick (University of Pennsylvania)
Neville Ryant (University of Pennsylvania)
Beatrice Santorini (University of Pennsylvania)

Abstract

This work addresses the question of how to evaluate a state-of-the-art parser on Early English Books Online (EEBO), a 1.5-billion-word collection of unannotated text, for utility in linguistic research. Earlier work has trained and evaluated a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) and defined a query-based evaluation to score the retrieval of 6 specific sentence types of interest. However, significant differences between EEBO and the manually-annotated PPCEME make it inappropriate to assume that these results will generalize to EEBO. Fortunately, an overlap of source material in PPCEME and EEBO allows us to establish a token alignment between them and to score the POS-tagging on EEBO. We use this alignment together with a more principled version of the query-based evaluation to score the recovery of sentence types on this subset of EEBO, thus allowing us to estimate the increase in error rate on EEBO compared to PPCEME. The increase is largely due to differences in sentence segmentation between the two corpora, pointing the way to further improvements.

Keywords: parsing, part-of-speech tagging, diachronic syntax

How to Cite:

Kulick, S., Ryant, N. & Santorini, B., (2023) “Parsing "Early English Books Online" for Linguistic Search”, Society for Computation in Linguistics 6(1), 222-242. doi: https://doi.org/10.7275/kr54-n102

Downloads:
Download PDF

448 Views

140 Downloads

Published on
2023-06-01

License

Creative Commons Attribution 4.0

Authors

Seth Kulick (University of Pennsylvania)
Neville Ryant (University of Pennsylvania)
Beatrice Santorini (University of Pennsylvania)

Publication details

Pages: 222-242
Submitted on: 2023-05-15

File Checksums (MD5)

PDF: e8870abb1f39fc7a2d9cefb812c9fceb

Parsing "Early English Books Online" for Linguistic Search

Abstract

Harvard-Style Citation

Vancouver-Style Citation

APA-Style Citation

Non Specialist Summary