Audio to Text Transcription

Audio to text transcription or Automatic Speech Recognition (ASR) has improved dramatically over recent years.

Use Cases for automatic transcription

The required accuracy of automatic transcription depends on the context in which it is to be used.

Use case for end users include:

  • As a student with a hearing impairment, I need accurate transcriptions of teaching material in audio format, so that I can learn.
  • As a person with English as a second language, I need transcriptions of audio content, so that I better understand it.

Use cases for web publishers include:

  • As a content creator of publicly available web content, I need a transcript of audio content, so that I can save time and effort when creating video captions and transcripts. Note: Privacy is not generally an issue here as the intention is to make the content public.

Use cases for researchers include:

  • As a researcher, I need a reasonably accurate transcript of interviews so that I can return to content in the future.
  • As a researcher, I need highly accurate transcripts of interviews so that I can include them in my research findings.

Automatic Speech Recognition (ASR) providers

There has been an explosion of speech to text providers in recent times and this is set to continue.

Most end user products are connecting to one of the following API's:

  • Nuance: speech recognition for Siri, Dragon. (Recently bought by Microsoft)
  • Microsoft Speech-to-text API in Azure: speech recognition for Microsoft Dictate, Teams, Cortana
  • Google speech API: used in Google Assistant
  • Amazon Alexa: speech recognition for Amazon Echo

Privacy and Security

  • Because transcription occurs in the cloud, data which you upload is shared with Microsoft's AI services.
  • Microsoft says that it does not store audio or transcription results. But what about corrections that you make? It's unclear whether that Microsoft uses that to improve it's speech recognition engine and thus regards it as product improvement data.
  • All processing of data in the cloud contains some risk, but it is reasonable to think that more reputable vendors, who publish statements about how they use and store data, are of lower risk than vendors who are vague about where and how data is being processed.


Microsoft Word in Office 365

  • Both the desktop version of Word and the online version of Word, Office 365, include a Dictate function, which will take audio from the computer's microphone and convert it into text.
  • The Office 365 version also allows you to transcribe audio by uploading an audio file in mp3, .mp5, .m4a and .wav formats.

Results of Microsoft Word transcription of a podcast

Microsoft Azure Speech to Text

  • Microsoft Azure Speech to Text is an API that allows developers to send audio data to a cloud based service using an API.
  • The Azure API can also return a confidence level, from 0.0 (no confidence) to 1.0 (full confidence).

Other tools

Other uses of Text to Speech Technology


There are various models for measuring the accuracy of automatic transcription.

Word Error Rate (WER) Model

The WER Model works by adding up the total number of errors in a transcription and dividing it by the total number of words in the source content, e.g. an audio file.

WER = (Insertions + Deletions + Substitutions) / Number of Words in Source Content

Errors include:

  • Insertions: words in the transcription that don't appear in the source content.
  • Deletions: words that don't appear in the transcription but are in the source content.
  • Substitutions: words in the transcription that are incorrectly transcribed from the source content.

Strengths of the WER Model:

  • he WER Model is widely used.
  • There are a variety of test content that has been tested against different ASR engines.


  • Every error is weighted the same, regardless of the impact on understanding.
  • It gives no credit for partial mistakes that the reader can recognise.

NER Model


Contact Us

For assistance or to report accessibility problems please contact:

Andrew Normand
Web Accessibility Lead
Phone: +61 3 9035 4867