Overview of AMALGUM – Large Silver Quality Annotations across English Genres

Luke D Gessler; Siyao Peng; Yang Liu; Yilun Zhu; Shabnam Behzad; Amir Zeldes

doi:10.7275/ep47-3t54

Extended Abstract

Overview of AMALGUM – Large Silver Quality Annotations across English Genres

Authors: Luke D Gessler (Georgetown University) , Siyao Peng (Georgetown University) , Yang Liu , Yilun Zhu , Shabnam Behzad (Georgetown University) , Amir Zeldes (Georgetown University)

Abstract

Corpus resources for Linguistics and NLP research on discourse phenomena, such as coreference and discourse trees, are limited by a lack of large scale, well-understood, annotated datasets: corpora are either very large (100M-10G tokens) but shallowly annotated and with unknown composition, or richly annotated, but smaller. Here, we present a resource that takes a middle path, combining some of the best features of scraped corpora - size, open licenses, lexical diversity - and high quality curated data for more interpretable inferences with complex annotations.

Keywords: corpora, coreference, discourse, RST, treebank, annotation

How to Cite: Gessler, L. D. , Peng, S. , Liu, Y. , Zhu, Y. , Behzad, S. & Zeldes, A. (2021) “Overview of AMALGUM – Large Silver Quality Annotations across English Genres”, Society for Computation in Linguistics. 4(1). doi: https://doi.org/10.7275/ep47-3t54