Parsing "Early English Books Online" for Linguistic Search
- Seth Kulick (University of Pennsylvania)
- Neville Ryant (University of Pennsylvania)
- Beatrice Santorini (University of Pennsylvania)
Abstract
This work addresses the question of how to evaluate a state-of-the-art parser on Early English Books Online (EEBO), a 1.5-billion-word collection of unannotated text, for utility in linguistic research. Earlier work has trained and evaluated a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) and defined a query-based evaluation to score the retrieval of 6 specific sentence types of interest. However, significant differences between EEBO and the manually-annotated PPCEME make it inappropriate to assume that these results will generalize to EEBO. Fortunately, an overlap of source material in PPCEME and EEBO allows us to establish a token alignment between them and to score the POS-tagging on EEBO. We use this alignment together with a more principled version of the query-based evaluation to score the recovery of sentence types on this subset of EEBO, thus allowing us to estimate the increase in error rate on EEBO compared to PPCEME. The increase is largely due to differences in sentence segmentation between the two corpora, pointing the way to further improvements.
Keywords: parsing, part-of-speech tagging, diachronic syntax
How to Cite:
Kulick, S., Ryant, N. & Santorini, B., (2023) “Parsing "Early English Books Online" for Linguistic Search”, Society for Computation in Linguistics 6(1), 222-242. doi: https://doi.org/10.7275/kr54-n102
Downloads:
Download PDF