Introduction
In the two previous blog posts, we looked at the accuracy of transcripts of four English language media files using the eleven available Faster Whisper transcription models. Two of the four media files featured excellent audio quality and two distinct speakers, while two of the videos were recorded in a classroom environment with overlapping speech and more distance between some of the participants and the microphone. We also experimented with making a transcript of a Spanish language audio file with each model. We then examined the relative speed of each model to produce a transcript, using multiple computer processors. This post examines the intersection of those two independent factors on the automated transcription process.
Results – Faster Whisper Accuracy and Speed
Figure 1
In Figure 1, the horizontal axis represents transcription processing time in seconds, with faster models on the left and slower models on the right. The vertical axis represents transcription accuracy, with higher accuracy at the top of the graph. This graph makes the relationship between transcription accuracy and transcription speed clear for our set of four files.
Let’s look at these models more closely.
All processing speeds average the transcription of each file across multiple computers with different hardware configurations. As described in the previous blog post, even the slowest models with a compatible and properly configured graphics processing unit (GPU) run faster than most of the models without a GPU. If you have a Windows computer with a supported video card, it’s worth the time to get it set up correctly.
The tiny model is clearly the fastest, but it has problems with accuracy, especially with the two classroom data files, where it was less than 20 % accurate with its transcription. The base model is also fast but had accuracy problems with the classroom data.
To the right, the large-v3 and large models form a pair that is both very slow and not all that accurate, especially with the classroom data. Above that, there is a cluster containing the large-v2 and large-v1 models which might best be termed slow but accurate.
Finally, we see a cluster in the upper left corner of the scatterplot showing models that are both relatively fast and comparatively accurate. For English, the distil-large-v3 model shows the best combination of accuracy and transcription speed, followed closely by the small, turbo, and large-v3-turbo models. The medium model appears close to this group, but takes twice as long to process files as the others.
Other factors to consider
Faster Whisper models do not behave completely consistently. This affects the settings you may need for your data.
The disti-large-v3 model claims internally to be multilingual, but in practice creates a translation in English when presented a non-English data file regardless of settings. It is unable to produce a source-language transcript.
The turbo and large-v3-turbo models work well with non-English data, but are unable to create English translations of this data with that is requested.
I am currently experimenting with non-English media files. My initial impression is that different models work better with some languages than others. I anticipate writing a blog post about Transana’s Faster Whisper support for non-English languages in the near future which will provide guidance about which models work best with different languages I’ve looked at.
Using Faster Whisper
Faster Whisper is a fast-moving tool for automated transcription under active development. Evaluations such as this series of blog posts are just a snapshot in time of the performance of that tool. Since the last time I examined the intersection of accuracy vs speed in automated transcription less than 18 months ago, the landscape has changed significantly. I have also changed my automated transcription habits, facilitated in no small part by the upgrades first released in Transana 5.30.
The conclusions drawn in this series of post are meant to be a snapshot of performance at one moment in time with one set of data files that reflect a particular set of data collection conditions. I would not recommend using these conclusions to decide on one particular model for your own work.
Instead, I would recommend an experimental approach to automated transcription. Identify a “typical” data file from your data set. Create a roughly 5 to 10-minute sample from that data file. If you have several different types of data, test several sample files. Run that data with all the models that Transana supports. If you have several computers you might use for transcription, run the test on multiple computers.
If you have a Windows computer with a supported GPU, use that. Then you can focus your experimenting mostly on transcription quality rather than processing time. If you don’t have a GPU, look at processing time because it makes a difference from a practical standpoint, but remember that transcription accuracy also affects efficiency in getting your media data ready for analysis.
See what works best for you, with your data, with your computers. I hope you will send me a message through the Contact form letting me know what you find.