Accuracy
Accuracy of automatic transcriptions can be affected by multiple factors.
Technical issues include:
- Internet connection
- Microphone quality
- Recording software
- Recording hardware
- Downstream data processing
- Analogue to digital conversion
- File format conversion
- Limitations in model design
- Limitations in model training
Intelligibility of speech can be affected by:
- Recording environment noise
- Distance to speaker
- Speaker articulation
- Speaker accent
- Use of technical and industry-specific language.
Measurement Models
There are various models for measuring the accuracy of automatic transcription.
Word Error Rate (WER) Model
The WER is the most common metric for speech recognition accuracy. It is recommended by the US National Institute of Standards and Technology for evaluating the performance of ASR systems.
The WER Model works by adding up the total number of errors in a transcription and dividing it by the total number of words in the source content, e.g. an audio file. The lower the WER, the more accurate the transcription is.
WER = (Insertions + Deletions + Substitutions) / Number of Words in Source Content
Errors include:
- Insertions: words in the transcription that don't appear in the source content.
- Deletions: words that don't appear in the transcription but are in the source content.
- Substitutions: words in the transcription that are incorrectly transcribed from the source content.
Strengths of the WER Model:
- The WER Model is widely used.
- There are a variety of test content that has been tested against different ASR engines.
Weaknesses:
- Every error is weighted the same, regardless of the impact on understanding.
- It gives no credit for partial mistakes that the reader can recognise.
- WER is not a true percentage as it has no upper bound, WER can be over 100% where there are many insertions and deletions.
NER Model
NER is most often used in measuring the accuracy of live captioning products, including as part of regulation. For example, UK’s Communications Regulator, Ofcom, regards 98% NER as the acceptable accuracy threshold for TV broadcast live-subtitling.
The NER Model aims to refine the definition of accuracy by weighting errors based on their impact on comprehension. The higher the NER, the more accurate the transcription is.
NER = (Number of Words in Source Content – Edition Errors – Recognition Errors) / Number of Words in Source Content
Errors include:
- Edition errors: words that don't appear in the transcription but are in the source content, for example due to re-speaker omitting the information. Edition errors are weighted based on impact to user’s comprehension.
- Recognition errors: incorrect word or words appear in the transcription, caused by mispronunciation or software not recognising the word. Recognition errors are weighted based on impact to user’s comprehension.
Strengths of the NER Model:
- Not all errors are equal, errors are weighted based on impact to user’s comprehension.
- Results of NER tests can be analysed to provide training and feedback to the captioners.
Weaknesses:
- Requires an accurate, verbatim transcript to compare to.
- Using the model may be labour intensive, for example evaluating the quality of captioned program can take between 10 and 15 times the length of a program.
Accuracy Rate in Live Subtitling: The NER Model
References
- Evaluating an automatic speech recognition service: Six steps for performing a transcription accuracy evaluation using the WER Model. Variations in calculating WER is also discussed.
- Challenges in Measuring Automatic Transcription Accuracy: The WER model and some of its issues.
- Does Word Error Rate Matter?
- The trouble with WER
- A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech: Comparison of automatic transcription services, factors affecting accuracy and relation to non-verbal behaviours.
- Automatic Speech Recognition Errors Detection and Correction: Factors affecting accuracy and WER strength and weaknesses.
- Accuracy Rate in Live Subtitling – the NER Model: Sample application of NER model on real-life subtitles.
- Measuring live subtitling quality: Publication by UK’s communications regulator, Ofcom on NER and live subtitling quality.
- Assessing the Quality of Live Captions – the NER Model: The NER model, automated captions and what’s next.
- Media Access Australia – Caption Quality: International approaches to standards and measurement of caption quality.