Accuracy

Accuracy of automatic transcriptions can be affected by multiple factors.

Technical issues include:

  • Internet connection
  • Microphone quality
  • Recording software
  • Recording hardware
  • Downstream data processing
    • Analogue to digital conversion
    • File format conversion
  • Limitations in model design
  • Limitations in model training

Intelligibility of speech can be affected by:

  • Recording environment noise
  • Distance to speaker
  • Speaker articulation
  • Speaker accent
  • Use of technical and industry-specific language.

Measurement Models

There are various models for measuring the accuracy of automatic transcription.

Word Error Rate (WER) Model

The WER is the most common metric for speech recognition accuracy. It is recommended by the US National Institute of Standards and Technology for evaluating the performance of ASR systems.

The WER Model works by adding up the total number of errors in a transcription and dividing it by the total number of words in the source content, e.g. an audio file. The lower the WER, the more accurate the transcription is.

WER = (Insertions + Deletions + Substitutions) / Number of Words in Source Content

Errors include:

  • Insertions: words in the transcription that don't appear in the source content.
  • Deletions: words that don't appear in the transcription but are in the source content.
  • Substitutions: words in the transcription that are incorrectly transcribed from the source content.

Strengths of the WER Model:

  • The WER Model is widely used.
  • There are a variety of test content that has been tested against different ASR engines.

Weaknesses:

  • Every error is weighted the same, regardless of the impact on understanding.
  • It gives no credit for partial mistakes that the reader can recognise.
  • WER is not a true percentage as it has no upper bound, WER can be over 100% where there are many insertions and deletions.

NER Model

NER is most often used in measuring the accuracy of live captioning products, including as part of regulation. For example, UK’s Communications Regulator, Ofcom, regards 98% NER as the acceptable accuracy threshold for TV broadcast live-subtitling.

The NER Model aims to refine the definition of accuracy by weighting errors based on their impact on comprehension. The higher the NER, the more accurate the transcription is.

NER = (Number of Words in Source Content – Edition Errors – Recognition Errors) / Number of Words in Source Content

Errors include:

  • Edition errors: words that don't appear in the transcription but are in the source content, for example due to re-speaker omitting the information. Edition errors are weighted based on impact to user’s comprehension.
  • Recognition errors: incorrect word or words appear in the transcription, caused by mispronunciation or software not recognising the word. Recognition errors are weighted based on impact to user’s comprehension.

Strengths of the NER Model:

  • Not all errors are equal, errors are weighted based on impact to user’s comprehension.
  • Results of NER tests can be analysed to provide training and feedback to the captioners.

Weaknesses:

  • Requires an accurate, verbatim transcript to compare to.
  • Using the model may be labour intensive, for example evaluating the quality of captioned program can take between 10 and 15 times the length of a program.

Accuracy Rate in Live Subtitling: The NER Model

References