Audio to Text Transcription
Audio to text transcription or Automatic Speech Recognition (ASR) has improved dramatically over recent years.
Use Cases for automatic transcription
The required accuracy of automatic transcription depends on the context in which it is to be used.
Use case for end users include:
- As a student with a hearing impairment, I need accurate transcriptions of teaching material in audio format, so that I can learn.
- As a person with English as a second language, I need transcriptions of audio content, so that I better understand it.
Use cases for web publishers include:
- As a content creator of publicly available web content, I need a transcript of audio content, so that I can save time and effort when creating video captions and transcripts. Note: Privacy is not generally an issue here as the intention is to make the content public.
Use cases for researchers include:
- As a researcher, I need a reasonably accurate transcript of interviews so that I can return to content in the future.
- As a researcher, I need highly accurate transcripts of interviews so that I can include them in my research findings.
Automatic Speech Recognition (ASR) providers
There has been an explosion of speech to text providers in recent times and this is set to continue.
Most end user products are connecting to one of the following API's:
- Nuance: speech recognition for Siri, Dragon. (Recently bought by Microsoft)
- Microsoft Speech-to-text API in Azure: speech recognition for Microsoft Dictate, Teams, Cortana
- Google speech API: used in Google Assistant
- Amazon Alexa: speech recognition for Amazon Echo
Privacy and Security
- Because transcription occurs in the cloud, data which you upload is shared with Microsoft's AI services.
- Microsoft says that it does not store audio or transcription results. But what about corrections that you make? It's unclear whether that Microsoft uses that to improve it's speech recognition engine and thus regards it as product improvement data.
- All processing of data in the cloud contains some risk, but it is reasonable to think that more reputable vendors, who publish statements about how they use and store data, are of lower risk than vendors who are vague about where and how data is being processed.
Microsoft Word in Office 365
- Both the desktop version of Word and the online version of Word, Office 365, include a Dictate function, which will take audio from the computer's microphone and convert it into text.
- The Office 365 version also allows you to transcribe audio by uploading an audio file in mp3, .mp5, .m4a and .wav formats.
- Up to five hours of audio can be uploaded each month and file size limit is 200mb.
- Instructions on how to transcribe recordings.
- If you want to try this out for yourself, try using an mp3 from the Harvard Podcast Archive.
Microsoft Azure Speech to Text
- Microsoft Azure Speech to Text is an API that allows developers to send audio data to a cloud based service using an API.
- The Azure API can also return a confidence level, from 0.0 (no confidence) to 1.0 (full confidence).
- Google Cloud Speech-to-Text
- Google Live Transcribe for Android
- Mac Dictate
- Dragon Speech Recognition
- Other free tools
Other uses of Text to Speech Technology
- How Does Siri Work? The Science Behind Siri
- Α Benchmarking of IBM, Google and With Automatic Speech Recognition Systems
There are various models for measuring the accuracy of automatic transcription.
Word Error Rate (WER) Model
The WER Model works by adding up the total number of errors in a transcription and dividing it by the total number of words in the source content, e.g. an audio file.
WER = (Insertions + Deletions + Substitutions) / Number of Words in Source Content
- Insertions: words in the transcription that don't appear in the source content.
- Deletions: words that don't appear in the transcription but are in the source content.
- Substitutions: words in the transcription that are incorrectly transcribed from the source content.
Strengths of the WER Model:
- he WER Model is widely used.
- There are a variety of test content that has been tested against different ASR engines.
- Every error is weighted the same, regardless of the impact on understanding.
- It gives no credit for partial mistakes that the reader can recognise.
- The NER Model aims to refine the definition of accuracy by weighting errors based on their impact on comprehension.
- Accuracy Rate in Live Subtitling: The NER Model
- Evaluating an automatic speech recognition service
- Challenges in Measuring Automatic Transcription Accuracy
- Does Word Error Rate Matter?
- The trouble with WER
For assistance or to report accessibility problems please contact:
Web Accessibility Lead
Phone: +61 3 9035 4867