Paper

Parsing Early Modern English for Linguistic Search

Authors
  • Seth Kulick (University of Pennsylvania)
  • Neville Ryant (University of Pennsylvania)
  • Beatrice Santorini (University of Pennsylvania)

Abstract

This work addresses the question of whether the output of a state-of-the-art parser is accurate enough to support research in theoretical linguistics. In order to build reliable models of syntactic change, we aim to eventually parse the 1.5-billion-word Early English Books Online (EEBO) corpus. But since EEBO is not yet parsed, we begin by constructing and testing a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). In order to obtain robust results, we define an 8-fold split on PPCEME. We then evaluate the parser with evalb and, more relevantly for us, with a task-specific metric - namely, its accuracy in parsing 6 sentence types necessary to track the rise of auxiliary do (as in They did not come vs. its historical precursor They came not). Retrieving the relevant sentences from the gold and test versions with CorpusSearch queries, we find that the parser\'s accuracy promises to be sufficient for our purposes. A remaining concern is the variability of the output, which we plan to address with three pieces of future work sketched in the conclusion.

Keywords: parsing, syntax, historical linguistics

How to Cite:

Kulick, S., Ryant, N. & Santorini, B., (2022) “Parsing Early Modern English for Linguistic Search”, Society for Computation in Linguistics 5(1), 143-157. doi: https://doi.org/10.7275/twww-ef90

Downloads:
Download PDF

68 Views

31 Downloads

Published on
01 Feb 2022