Overview of AMALGUM – Large Silver Quality Annotations across English Genres
- Luke D Gessler (Georgetown University)
- Siyao Peng (Georgetown University)
- Yang Liu
- Yilun Zhu
- Shabnam Behzad (Georgetown University)
- Amir Zeldes (Georgetown University)
Abstract
Corpus resources for Linguistics and NLP research on discourse phenomena, such as coreference and discourse trees, are limited by a lack of large scale, well-understood, annotated datasets: corpora are either very large (100M-10G tokens) but shallowly annotated and with unknown composition, or richly annotated, but smaller. Here, we present a resource that takes a middle path, combining some of the best features of scraped corpora - size, open licenses, lexical diversity - and high quality curated data for more interpretable inferences with complex annotations.
Keywords: corpora, coreference, discourse, RST, treebank, annotation
How to Cite:
Gessler, L. D., Peng, S., Liu, Y., Zhu, Y., Behzad, S. & Zeldes, A., (2021) “Overview of AMALGUM – Large Silver Quality Annotations across English Genres”, Society for Computation in Linguistics 4(1), 434-437. doi: https://doi.org/10.7275/ep47-3t54
Downloads:
Download PDF