Automated phonetic transcription for varieties of English: wav2vec 2.0 fine-tuned on the Buckeye Corpus
Abstract
Reliable automated phonetic transcription would vastly increase the database for phonological analysis and theorizing. Advances in speech recognition technology stand to bring us to that goal. We present a wav2vec 2.0 model fine-tuned on the Buckeye corpus of conversational English. We experiment with different amounts of training data for fine-tuning, as well as different gender and age distributions in training data. We find that good results are achieved with about two hours of training data, and that performance is generally robust to skews in the makeup of the training data. These findings are encouraging for the project of extending these methods to languages and varieties that are less well resourced. We also compare our model on the Buckeye test set to a group of universal models, and ones trained on the TIMIT corpus. These comparisons suggest that targeted fine-tuning is worthwhile where the data exist. As a first step in extending our model to other varieties, we also used the TIMIT corpus test set. Our Buckeye-tuned model continues to outperform the universal models on the TIMIT test set, but by a smaller margin. To make our models broadly accessible, we have released them publicly along with a web-based interface which supports input and output in Praat TextGrid format.
Keywords: automated transcription, speech recognition, IPA transcription, wav2vec 2.0, Buckeye corpus
How to Cite:
Partridge, V., Pater, J., Bhangla, P., Nirheche, A. & Prickett, B., (2026) “Automated phonetic transcription for varieties of English: wav2vec 2.0 fine-tuned on the Buckeye Corpus”, Proceedings of the Annual Meetings on Phonology 2(1). doi: https://doi.org/10.7275/amphonology.3874
Downloads:
Download PDF
46 Views
14 Downloads