Extended Abstract

Overview of AMALGUM – Large Silver Quality Annotations across English Genres

Authors: , , , , ,

Abstract

Corpus resources for Linguistics and NLP research on discourse phenomena, such as coreference and discourse trees, are limited by a lack of large scale, well-understood, annotated datasets: corpora are either very large (100M-10G tokens) but shallowly annotated and with unknown composition, or richly annotated, but smaller. Here, we present a resource that takes a middle path, combining some of the best features of scraped corpora - size, open licenses, lexical diversity - and high quality curated data for more interpretable inferences with complex annotations.

Keywords: corpora, coreference, discourse, RST, treebank, annotation

How to Cite: Gessler, L. D. , Peng, S. , Liu, Y. , Zhu, Y. , Behzad, S. & Zeldes, A. (2021) “Overview of AMALGUM – Large Silver Quality Annotations across English Genres”, Society for Computation in Linguistics. 4(1). doi: https://doi.org/10.7275/ep47-3t54