Introduction
Transana recently upgraded to a new version of the Faster Whisper automated transcription tool with release 5.30. Transana currently provides 11 transcription model options for Faster Whisper, all of which claim to support about 100 languages. These models differ primarily in the data they were trained with, resulting in a significant variation in the accuracy of the transcripts produced and in the processing speed required.
To help Transana users determine which model might provide the best balance of accuracy and speed for their unique data sets, we decided to perform some tests that you can use for guidance.
We will discuss the results of our explorations in three blog posts. This post will focus on the accuracy of the transcription models. The next post will focus on the transcription speed of those models. The third post will look at speed and accuracy together to suggest a plan for approaching the automated transcription of real-life data using Faster Whisper in Transana 5.30 and later.
Methodology
We tested all Faster Whisper models using four media files. The speech in all the files used in formal testing was in English, although we looked less formally at some non-English data as well. We hope to publish a post with accuracy results using non-English media files soon.
Two of the four media files featured excellent audio quality and two distinct speakers, while two of the videos were recorded in a classroom environment with overlapping speech and more distance between some of the participants and the microphone.
We tested these four files on five different computers, three Windows computers with different generations of NVidia graphics cards and two Macs with Apple processors.
Results – Accuracy of Faster Whisper models
To assess transcription accuracy, we created highly accurate and detailed “reference” transcripts for each data file using Transana and performed word-for-word comparisons. Accuracy results are reported as a percentage of words that match the reference transcript. Words could be identical in both transcripts, they could differ in the two transcripts, they could appear in only the source transcript, or they could appear in only the reference transcript.
Accuracy between files
Figure 1
Figure 1 shows the accuracy results for our four data files across the eleven Faster Whisper transcription models available in Transana. Accuracy results with a given model did not differ across computers. Different computers, with different CPU and GPU configurations, produced identical accuracy results for our four test files across five computers for each of the 11 transcription models.
Overall, as shown by the differences between the four lines, the better the audio quality, the better the automated transcription accuracy. We see that the Heather (blue) and Jeanine (red) files, which have excellent audio quality, produce better accuracy than the two classroom video files (magenta and green). The difference in accuracy based on data file characteristics ranged from 17.8% to a whopping 80.2%.
The tiny model struggled with the two classroom files and the base model struggled with the longer classroom file. The large and large-v3 models struggled with the classroom files as well.
English-only vs. Multi-lingual Models
We note that all of our test data was in English. We also did a very quick test of a file in Spanish. I am not fluent in Spanish, so I could not create a reference file. The small model transcribed the 10-minute video as the word “¿Qué?“ repeated 1031 times, which even I could tell was not correct. The distil-large-v3 model produced a transcript in English even though a Spanish transcript was requested. (I cannot speak to the accuracy of this translation.) The remaining transcripts were in Spanish, and a colleague who is a native speaker of Spanish said the transcripts looked reasonably accurate.
Accuracy across files
Table 1
Model | N | Mean | Range | Minimum | Maximum |
---|---|---|---|---|---|
Distil-Large-v3 | 33 | 87.41 | 18.10 | 75.96 | 94.06 |
Large-v2 | 33 | 87.40 | 21.45 | 75.58 | 97.03 |
Large-v1 | 33 | 87.06 | 24.03 | 71.97 | 96.00 |
Medium | 33 | 86.79 | 20.09 | 75.91 | 96.00 |
Large-v3-Turbo | 33 | 86.45 | 19.39 | 73.75 | 93.14 |
Turbo | 33 | 86.45 | 19.39 | 73.75 | 93.14 |
Small | 33 | 86.02 | 17.75 | 75.39 | 93.14 |
Base | 33 | 79.92 | 42.32 | 50.82 | 93.14 |
Large | 33 | 73.86 | 46.43 | 45.80 | 92.23 |
Large-v3 | 33 | 73.86 | 46.43 | 45.80 | 92.23 |
Tiny | 33 | 60.08 | 80.18 | 13.94 | 94.12 |
Table 1 shows the average accuracy of all Faster Whisper models used in Transana across data files in decreasing order of average accuracy. The tiny model performed poorly on files with lower audio quality but actually worked quite well on the two files with the best audio quality. The base, large, and large-v3 models showed similar, if less extreme, tendencies.
Figure 2 shows this data in graphical form, showing accuracy scores for each model for each of our test files.
Figure 2
Conclusions
Transcription accuracy is the most important result for automated transcription models. At first glance, we see that seven models performed almost identically on average. We see that four more models performed just as well when files had very good audio quality, but struggled to varying degrees with more challenging files.
Our quick test with Spanish data suggests that researchers working with non-English data will want to experiment a bit with different models to see what works best with their data.
It is important to note that transcription accuracy is only one factor that needs to be considered in evaluating the performance of Faster Whisper. The next blog post will look at transcription processing speed, which is more complicated and nuanced than the issue of accuracy.