MGEN: Millions of Naturally Occurring Generics in Context

Gustavo Cilleruelo; Emily Allaway; Barry Haddow; Alexandra Birch

doi:10.7275/scil.3147

Options

Paper

MGEN: Millions of Naturally Occurring Generics in Context

Authors

Gustavo Cilleruelo (University of Edinburgh)
Emily Allaway
Barry Haddow
Alexandra Birch

Abstract

MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze at scale the features of generic sentences, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people.

MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at gustavocilleruelo.com/mgen.

Keywords: Generics, Dataset, Quantifier

How to Cite:

Cilleruelo, G., Allaway, E., Haddow, B. & Birch, A., (2025) “MGEN: Millions of Naturally Occurring Generics in Context”, Society for Computation in Linguistics 8(1): 11. doi: https://doi.org/10.7275/scil.3147

Downloads:
Download PDF

480 Views

599 Downloads

Published on
2025-06-13

Peer Reviewed

License

Creative Commons Attribution 4.0

Authors

Gustavo Cilleruelo (University of Edinburgh)
Emily Allaway
Barry Haddow
Alexandra Birch

Publication details

Article Number: 11
Submitted on: 2025-05-29
Accepted on: 2025-06-12

File Checksums (MD5)

PDF: 4534ab53eb0d5fa8626c1b10f416047d

MGEN: Millions of Naturally Occurring Generics in Context

Abstract

Harvard-Style Citation

Vancouver-Style Citation

APA-Style Citation

Non Specialist Summary