Meta Says Its AI Improves The Quality Of Speech Recognition By Lip Reading


Hear from CIOs, CTOs, and other senior executives and leaders on data and AI strategies at the Future of Work Summit on January 12, 2022. Learn more

People perceive speech both by listening to it and by observing the movements of the speaker’s lips. In fact, studies show that visual cues play a key role in language learning. In contrast, AI speech recognition systems are built primarily – or entirely – on audio. And they require a substantial amount of data to train, typically ranging from tens of thousands of hours of recording.

To determine whether visuals – especially images of mouth movement – can improve the performance of speech recognition systems, researchers at Meta (formerly Facebook) developed Hidden audiovisual unit BERT (AV-HuBERT), an executive who learns to understand speech by watching and hearing people speak. Meta complaints that AV-HuBERT is 75% more accurate than the best audiovisual speech recognition systems using the same number of transcriptions. Additionally, according to the company, AV-HuBERT outperforms the older best audiovisual speech recognition system by using one-tenth of the tagged data, making it potentially useful for languages ​​with little audio data.

“In the future, AI frameworks like AV-HuBERT could be used to improve the performance of speech recognition technology in noisy everyday conditions – for example, interactions at a party or in a street market. lively, ”Meta AI researcher Abdelrahman Mohamed told VentureBeat. in an interview. “And assistants in smartphones, augmented reality glasses, and smart speakers equipped with a camera – for example, the Alexa Echo Show – could also benefit from this technology.”


Meta is not the first to apply AI to the lip reading problem. In 2016, researchers at the University of Oxford created a system which was almost twice as accurate as experienced lip readers in some tests and could process video in near real time. And in 2017, DeepMind, owned by Alphabet, formed a system over thousands of hours of TV broadcasts to correctly translate about 50% of the words without errors on a test set, much better than a human expert’s 12.4%.

But the Oxford University and DeepMind models, like many later lip-reading models, were limited in the range of vocabulary they could recognize. The models also required datasets associated with transcripts to practice, and they could not process the audio from the speakers in the videos.

Quite uniquely, AV-HuBERT takes advantage of unsupervised or self-supervised learning. With supervised learning, algorithms like DeepMind’s are trained on labeled example data until they can detect the underlying relationships between the examples and particular outputs. For example, a system can be trained to write the word “dog” (the exit) when shown a picture of a Corgi (the example). However, AV-HuBERT learns to classify unlabeled data – by processing the data to learn about its inherent structure.

AV-HuBERT is also multimodal in the sense that he learns to perceive language through a series of sound cues and lip movements. By combining cues such as lip and tooth movement during speech, as well as auditory information, Meta says AV-HuBERT can capture “nuanced associations” between the two types of data.

The initial AV-HuBERT model was trained on 30 hours of English-language TED Talk videos, significantly less than the 31,000 hours on which the previous top model was trained. But despite training on less data, AV-HuBERT’s Word Error Rate (WER), a measure of speech recognition performance, was slightly better at 32.5% compared to the 33.6% of the ‘old model in cases where a speaker could be seen but not heard. (The WER is calculated by dividing the number of misrecognized words by the total number of words; 32.5% translates to about one error every 30 words.) TED Talks’ 433 hour training further reduced the WER d ‘AV-HuBERT at 28.6%.

Once AV-HuBERT learned the structure and correlation between the data well, the researchers were able to train it more on unlabeled data: 2,442 hours of English-language celebrity videos uploaded to YouTube. Not only did this bring the WER down to 26.9%, but Meta claims that it demonstrates that only a small amount of labeled data is needed to train the framework for a particular application (for example, when multiple people speak simultaneously) or a language. different .

Indeed, Meta claims that AV-HuBERT is around 50% better than audio models only at recognizing a person’s speech while loud music or noise is playing in the background. And when speech and background noise are also loud, AV-HuBERT manages a WER of 3.2% against 25.5% of the previous best multimodal model.

Potential gaps

In many ways, AV-HuBERT is emblematic of Meta’s growing investment in unsupervised multimodal technology for complex tasks. The company recently detailed a new multimodal system designed to tackle harmful content on its platforms, called Learning a few strokes, and published models that can learn to recognize speech, segment images, copy text style, and recognize objects from unlabeled data. Unlike supervised systems, unsupervised systems can be considerably more flexible and less expensive to deploy; the tags in the tagged datasets come from human annotators who must painstakingly add each one.

Because it requires less labeled data for training, Meta claims that AV-HuBERT could open up possibilities for developing conversation models for “low-resource” languages, such as Susu in the Niger Congo family. AV-HuBERT could also be useful for creating voice recognition systems for people with speech impairments, the company suggests, as well as for detecting deepfakes and generating realistic lip movements for virtual reality avatars.

But Os Keyes, an AI ethicist at the University of Washington, expressed concern that AV-HuBERT has class and disability limitations. Does it work for people with distorted facial speech patterns due to a disability? They told VentureBeat via email. “It seems pretty ironic to successfully create speech recognition software that relies on lip reading and is prone to inaccuracies when pointed at… deaf people. “

In a Microsoft and Carnegie Mellon paper offering a research roadmap towards equity in AI, the co-authors point out that aspects of facial analysis systems similar to AV-HuBERT may not work well for people with Down syndrome, achondroplasia (which alters bone growth) and “other conditions that cause facial features to differ. Such systems could also fail for people who have had a stroke, the researchers note, or who suffer from Parkinson’s disease, Bell’s palsy, autism, or Williams syndrome – who may not use (or be able to use) the same facial expressions as neurotypical people.

In an email, Mohamed pointed out that AV-HuBERT only focuses on the lip area to capture lip movement, not the entire face. As with most AI models, AV-HuBERT’s performance will be “proportional to the number of representative samples from different populations in the training data,” he added.

“To evaluate our approach, we used the publicly available LRS3 dataset, which consists of TED Talk videos made publicly available in 2018 by researchers at the University of Oxford. Since this dataset does not represent disabled speakers, we do not have a specific percentage for the expected performance degradation, ”Mohamed said. “[But this] the newly proposed technology is not limited by the current distribution of speakers in the training dataset. We predict that different training data sets covering larger and diverse populations would provide significant performance gains. “

Meta says it “will continue to compare and develop approaches that improve audiovisual speech recognition models in everyday scenarios where background noise and speaker overlap are common.” Beyond that, he plans to extend AV-HuBERT – which Meta does not intend to put into production – to multilingual references beyond English.


VentureBeat’s mission is to be a digital public place for technical decision-makers to learn about transformative technology and conduct transactions. Our site provides essential information on data technologies and strategies to guide you in managing your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the topics that interest you
  • our newsletters
  • Closed thought leader content and discounted access to our popular events, such as Transform 2021: Learn more
  • networking features, and more

Become a member


About Timothy Cheatham

Check Also

Bring Home Thornton Dial – Garden & Gun

For Thornton Dial, creative expression was a way of being: In interviews with art collector …