Accuracy

Accuracy of automatic transcriptions can be affected by multiple factors.

Technical issues include:

Internet connection
Microphone quality
Recording software
Recording hardware
Downstream data processing
- Analogue to digital conversion
- File format conversion
Limitations in model design
Limitations in model training

Intelligibility of speech can be affected by:

Recording environment noise
Distance to speaker
Speaker articulation
Speaker accent
Use of technical and industry-specific language.

Measurement Models

There are various models for measuring the accuracy of automatic transcription.

Word Error Rate (WER) Model

The WER is the most common metric for speech recognition accuracy. It is recommended by the US National Institute of Standards and Technology for evaluating the performance of ASR systems.

The WER Model works by adding up the total number of errors in a transcription and dividing it by the total number of words in the source content, e.g. an audio file. The lower the WER, the more accurate the transcription is.

WER = (Insertions + Deletions + Substitutions) / Number of Words in Source Content

Errors include:

Insertions: words in the transcription that don't appear in the source content.
Deletions: words that don't appear in the transcription but are in the source content.
Substitutions: words in the transcription that are incorrectly transcribed from the source content.

Strengths of the WER Model:

The WER Model is widely used.
There are a variety of test content that has been tested against different ASR engines.

Weaknesses:

Every error is weighted the same, regardless of the impact on understanding.
It gives no credit for partial mistakes that the reader can recognise.
WER is not a true percentage as it has no upper bound, WER can be over 100% where there are many insertions and deletions.

NER Model

NER is most often used in measuring the accuracy of live captioning products, including as part of regulation. For example, UK’s Communications Regulator, Ofcom, regards 98% NER as the acceptable accuracy threshold for TV broadcast live-subtitling.

The NER Model aims to refine the definition of accuracy by weighting errors based on their impact on comprehension. The higher the NER, the more accurate the transcription is.

NER = (Number of Words in Source Content – Edition Errors – Recognition Errors) / Number of Words in Source Content

Errors include:

Edition errors: words that don't appear in the transcription but are in the source content, for example due to re-speaker omitting the information. Edition errors are weighted based on impact to user’s comprehension.
Recognition errors: incorrect word or words appear in the transcription, caused by mispronunciation or software not recognising the word. Recognition errors are weighted based on impact to user’s comprehension.

Strengths of the NER Model:

Not all errors are equal, errors are weighted based on impact to user’s comprehension.
Results of NER tests can be analysed to provide training and feedback to the captioners.

Weaknesses:

Requires an accurate, verbatim transcript to compare to.
Using the model may be labour intensive, for example evaluating the quality of captioned program can take between 10 and 15 times the length of a program.

Accuracy Rate in Live Subtitling: The NER Model

References

Evaluating an automatic speech recognition service: Six steps for performing a transcription accuracy evaluation using the WER Model. Variations in calculating WER is also discussed.
Challenges in Measuring Automatic Transcription Accuracy: The WER model and some of its issues.
Does Word Error Rate Matter?
The trouble with WER
A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech: Comparison of automatic transcription services, factors affecting accuracy and relation to non-verbal behaviours.
Automatic Speech Recognition Errors Detection and Correction: Factors affecting accuracy and WER strength and weaknesses.
Accuracy Rate in Live Subtitling – the NER Model: Sample application of NER model on real-life subtitles.
Measuring live subtitling quality: Publication by UK’s communications regulator, Ofcom on NER and live subtitling quality.
Assessing the Quality of Live Captions – the NER Model: The NER model, automated captions and what’s next.
Media Access Australia – Caption Quality: International approaches to standards and measurement of caption quality.