Overview of AMALGUM – Large Silver Quality Annotations across English Genres

Luke D Gessler; Siyao Peng; Yang Liu; Yilun Zhu; Shabnam Behzad; Amir Zeldes

doi:10.7275/ep47-3t54

Options

Extended Abstract

Overview of AMALGUM – Large Silver Quality Annotations across English Genres

Authors

Luke D Gessler (Georgetown University)
Siyao Peng (Georgetown University)
Yang Liu
Yilun Zhu
Shabnam Behzad (Georgetown University)
Amir Zeldes (Georgetown University)

Abstract

Corpus resources for Linguistics and NLP research on discourse phenomena, such as coreference and discourse trees, are limited by a lack of large scale, well-understood, annotated datasets: corpora are either very large (100M-10G tokens) but shallowly annotated and with unknown composition, or richly annotated, but smaller. Here, we present a resource that takes a middle path, combining some of the best features of scraped corpora - size, open licenses, lexical diversity - and high quality curated data for more interpretable inferences with complex annotations.

Keywords: corpora, coreference, discourse, RST, treebank, annotation

How to Cite:

Gessler, L. D., Peng, S., Liu, Y., Zhu, Y., Behzad, S. & Zeldes, A., (2021) “Overview of AMALGUM – Large Silver Quality Annotations across English Genres”, Society for Computation in Linguistics 4(1), 434-437. doi: https://doi.org/10.7275/ep47-3t54

Downloads:
Download PDF

394 Views

224 Downloads

Published on
2021-01-01

License

Creative Commons Attribution 4.0

Authors

Luke D Gessler (Georgetown University)
Siyao Peng (Georgetown University)
Yang Liu
Yilun Zhu
Shabnam Behzad (Georgetown University)
Amir Zeldes (Georgetown University)

Publication details

Pages: 434-437
Submitted on: 2021-01-14

File Checksums (MD5)

PDF: b09ddefb531d4048eade9d7f1ef40e37

Overview of AMALGUM – Large Silver Quality Annotations across English Genres

Abstract

Harvard-Style Citation

Vancouver-Style Citation

APA-Style Citation

Non Specialist Summary