Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
On this page
We introduce the Universal Speech Model (USM), a single large model thatperforms automatic speech recognition (ASR) across 100+ languages. This isachieved by pre-training the encoder of the model on a large unlabeledmultilingual dataset of 12 million (M) hours spanning over 300 languages, andfine-tuning on a smaller labeled dataset. We use multilingual pre-training withrandom-projection quantization and speech-text modality matching to achievestate-of-the-art performance on downstream multilingual ASR and speech-to-texttranslation tasks. We also demonstrate that despite using a labeled trainingset 1/7-th the size of that used for the Whisper model, our model exhibitscomparable or better performance on both in-domain and out-of-domain speechrecognition tasks across many languages.
Further reading
- Access Paper in arXiv.org