Subject-verb Agreement with Seq2Seq Transformers: Bigger Is Better, but Still Not Best

  • Michael A Wilson (Yale University)
  • Zhenghao Zhou (Yale University)
  • Robert Frank (Yale University)


Past work (Linzen et al., 2016; Goldberg, 2019, a.o.) has used the performance of neural network language models on subject-verb agreement to argue that such models possess structure-sensitive grammatical knowledge. We investigate what properties of the model or of the training regimen are implicated in such success in sequence to sequence transformer models that use the T5 architecture (Raffel et al., 2019; Tay et al., 2021). We find that larger models exhibit improved performance, especially in sentences with singular subjects. We also find that larger pre-training datasets are generally associated with higher performance, though models trained with less complex language (e.g., CHILDES, Simple English Wikipedia) can show more errors when trained with larger datasets. Finally, we show that a model\'s ability to replicate psycholinguistic results does not correspondingly improve with more parameters or more training data: none of the models we study displays a fully convincing replication of the hierarchically-informed pattern of agreement behavior observed in human experiments.

Keywords: subject-verb agreement, transformer language models, sequence to sequence models, agreement attraction

How to Cite:

Wilson, M. A., Zhou, Z. & Frank, R., (2023) “Subject-verb Agreement with Seq2Seq Transformers: Bigger Is Better, but Still Not Best”, Society for Computation in Linguistics 6(1), 278-288. doi:

Download PDF



Published on
01 Jun 2023