Abstract

BLiMP: A Benchmark of Linguistic Minimal Pairs for English

Authors
  • Alex Warstadt (New York University)
  • Alicia Parrish (New York University)
  • Haokun Liu (New York University)
  • Anhad Mohananey (New York University)
  • Wei Peng (New York University)
  • Sheng-Fu Wang (New York University)
  • Samuel R. Bowman (New York University)

Abstract

We introduce BLiMP (The Benchmark of Linguistic Minimal Pairs), a human-solvable challenge set for evaluating language models (LMs) that covers a broad range of major grammatical phenomena in English. BLiMP consists of over 30 datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. Like GLUE (Wang et al., 2018), BLiMP makes it easy to directly compare models. Evaluating n-gram, LSTM, and Transformer LMs (GPT-2 and TransformerXL), we find that transformers are strongest overall, achieving (near) human performance on agreement and binding. However, phenomena like wh-islands and NPI licensing remain challenging even for state-of-the-art LMs.

Keywords: acceptability, language model, evaluation, transformer, n-gram

How to Cite:

Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S. & Bowman, S. R., (2020) “BLiMP: A Benchmark of Linguistic Minimal Pairs for English”, Society for Computation in Linguistics 3(1), 437-438. doi: https://doi.org/10.7275/zejz-qs04

Downloads:
Download PDF

122 Views

58 Downloads

Published on
01 Jan 2020