Extended Abstract

Overview of AMALGUM – Large Silver Quality Annotations across English Genres

Authors
  • Luke D Gessler (Georgetown University)
  • Siyao Peng (Georgetown University)
  • Yang Liu
  • Yilun Zhu
  • Shabnam Behzad (Georgetown University)
  • Amir Zeldes (Georgetown University)

Abstract

Corpus resources for Linguistics and NLP research on discourse phenomena, such as coreference and discourse trees, are limited by a lack of large scale, well-understood, annotated datasets: corpora are either very large (100M-10G tokens) but shallowly annotated and with unknown composition, or richly annotated, but smaller. Here, we present a resource that takes a middle path, combining some of the best features of scraped corpora - size, open licenses, lexical diversity - and high quality curated data for more interpretable inferences with complex annotations.

Keywords: corpora, coreference, discourse, RST, treebank, annotation

How to Cite:

Gessler, L. D., Peng, S., Liu, Y., Zhu, Y., Behzad, S. & Zeldes, A., (2021) “Overview of AMALGUM – Large Silver Quality Annotations across English Genres”, Society for Computation in Linguistics 4(1), 434-437. doi: https://doi.org/10.7275/ep47-3t54

Downloads:
Download PDF

102 Views

39 Downloads

Published on
01 Jan 2021