About Automatic Speech Recognition

Audio to text transcription or Automatic Speech Recognition (ASR) has improved dramatically over recent years due to the advent of cloud computing.

Most Automatic Speech Recognition systems use a Recurrent Neural Network Transducer model (RNN-T). RNN-Ts use an encoder, prediction network and joiner network. The encoder takes input sequences, such as a speech waveform, and extracts useful recognition features. These features are fed into a search graph algorithm, to produce characters and spaces. A prediction network uses previous symbols to predict the next symbol. Finally a joiner network combines the encoder output with the prediction network output. RNN-Ts output multiple possible solutions, ranked by probability.

Speech recognition engines combine acoustic, pronunciation, and language models, resulting in a search graph that is several gigabytes in size. For this reason, audio data is sent to the cloud and returned as a text file, thereby reducing the need for end user processing power.

Use Cases for automatic speech transcription

The usefulness of automatic transcription depends on the context, or context, in which it is being used.

A student with a hearing impairment

  • Use case: As a student with a hearing impairment, I need accurate transcriptions of teaching material in audio format, so that I can learn.
    • In this context, ASR will not be the best solution, because ASR errors are highest on non-standard vocabulary. Human curated captions would be a better solution for conveying meaning accurately.

A staff member with a hearing impairment

  • Use case 1: As a staff member with a hearing impairment, I need on demand transcriptions, so that I can join Zoom meetings that occur at short notice.
    • Whilst ASR is not the most accurate form of captioning, it can still be useful in situations where there isn't time to arrange captioning by a third party.
  • Use case 2: As a staff member with a hearing impairment, I need accurate transcriptions, so that I can understand what students are saying during online tutorials.
    • In this case, it is critical to both students and the staff member that information is conveyed accurately. ASR would not be adequate.

A person with English as a second language

  • Use case: As a person with English as a second language, I need transcriptions of audio content, so that I better understand it.
    • In this context, ASR can be very useful despite the fact that there are some errors.

A web publisher looking to add captions to video

  • Use case: As a content creator of publicly available web content, I need a transcript of audio content, so that I can save time and effort when creating video captions and transcripts.
    • ASR can cut down the time required to produce captions, but the results still need to be edited by a human.
  • Note: Privacy is not generally an issue here as the intention is to make the content public.

A researcher working with interviews

  • Use case: As a researcher, I need a rough transcript of interviews so that I can save time producing final transcripts.
    • ASR can save time, but the results still need to be edited by a human.

Automatic Speech Recognition makes audio content more accessible to users with English as a second language, users in noisy or quiet environments, or users passively following webinars. ASR is not sufficiently accurate to accomodate the needs of staff and students with hearing impairments.

Automatic Speech Recognition (ASR) providers

Most end user products work via the use of an Application Programming Interface, or API. Audio data is sent to a cloud based service, which then converts it into text and uses AI to correct errors.

Popular API's include:

  • Nuance: speech recognition for Siri, Dragon. (Recently bought by Microsoft)
  • Microsoft Speech-to-text API in Azure: speech recognition for Microsoft Dictate, Teams, Cortana
  • Google speech API: used in Google Assistant
  • Amazon Alexa: speech recognition for Amazon Echo

Microsoft Azure Speech to Text

  • Microsoft Azure Speech to Text is an API that allows developers to send audio data to a cloud based service using an API.
  • The Azure API can also return a confidence level, from 0.0 (no confidence) to 1.0 (full confidence).

Other tools

References

Need web help?

All websites and applications which form part of the University web presence are expected to be compliant with the W3C's Web Accessibility Guidelines (WCAG) 2.2 AA guidelines.

Get web accessibility help