Introduction
Not all automated transcription options are created equal.
There are several important issues to consider when choosing an approach to automated transcription. These include, in rough order of importance, data security issues, language(s) spoken in the data, transcription quality, processing speed, and cost. I will describe each issue below, and follow each discussion with a mention of how the issue is addressed in Transana.
Data Security and Confidentiality
Researchers who want to use automated transcription need to think carefully about the issues of data security, data privacy, and participant confidentiality. Automated transcription tools generally follow one of two models.
Server-based automated transcription tools require researchers to submit files and they return transcripts. You send your media data to a server that you do not control for automated transcription processing. The companies and organizations running these servers generally have carefully-considered privacy policies (which you should read in detail before submitting data). Still, Institutional Review Boards (IRBs) and Ethics Panels can be hesitant to allow the use of such services.
Embedded automated transcription tools process media files on the researcher’s own computer. The computer loads a “language model” into its memory and does the processing necessary to create the automated transcript without the data ever leaving the computer. No data is shared externally. From a confidentiality perspective, this process is considered much lower risk in its potential for data exposure.
The process selected for automating transcription often needs to be approved by the Institutional Review Board (IRB) or other ethics panel that reviews research at their institution. Some data is more sensitive than other data. For example, audio of participants discussing their drug use or sexual histories, or video of children or medical patients being interviewed will likely face more data privacy scrutiny than if your data to be transcribed consists of television shows, YouTube videos or other publicly available media. IRBs are likely to prefer embedded automated transcription tools for sensitive data.
Implementation in Transana
Starting with release 5.10, Transana offers three options for automated transcription. Faster Whisper is an internal, embedded automated transcription tool, while Speechmatics and Deepgram are external, server-based tools. With Faster Whisper, no data leaves the computer running Transana, maximizing data security and confidentiality.
Language(s) Spoken
Different automated transcription tools support different languages. Obviously, researchers need to choose an automated transcription tool that supports the language spoken in the data file they want transcribed.
Mixed-Language Data
As of March, 2024, I have not found an automated transcription tool that can successfully process media files with mixed-language data.
A slight digression
Many automated transcription tools provide a rating of their confidence for the transcription of each word they generate. I theorized that these confidence ratings would be higher for a word that was in a correctly identified language than for an incorrectly identified language during the transcription process. We could, I theorized, create a transcript of mixed-language data by comparing an automated transcript in each of the languages used and simply picking the word in a give position with the higher level of transcription confidence.
This experiment failed spectacularly. The automated transcription tools I tested suggested incorrect words for mis-identified languages with a shockingly high degree of confidence. They were no more confident transcribing correct words in a correctly-identified langauge than incorrect words in an incorrectly-identified language.
It’s too bad. This could have been such an elegant solution to a challenging problem.
Implementation in Transana
The three automated transcription tools implemented in Transana 5.10 support about 100 languages. You can see a list of the languages supported by these tools on the Automated Transcription – Overview page in the Transana Tutorial.
An automated solution for mixed-language data remains elusive. Creating separate automated transcripts in each language and using Transana’s support for multiple simultaneous transcripts to manually edit out incorrect interpretations of incorrect languages is the best we can offer at the moment.
Transcription Quality
Different transcription tools provide wildly varying accuracy. Each automated transcription tool I’ve looked at offers multiple models one can select, with different balances between speed and accuracy.
Audio quality and audio characteristics of the data also make a big difference in transcription accuracy. I tested four media files with each of 12 language models from across 3 tools, and here is a summary of what I found:
File Type | Transcript Accuracy Range | Comments |
---|---|---|
Television Ad | 91% to 95% | Excellent audio, no interference. |
BBC News Story | 89% to 93% | Excellent audio, two UK news presenters. |
Movie | 65% to 90% | Excellent audio. Twelve main speakers, some background music, some overlapping speech. |
Classroom Data Sample | 57% to 80% | Good audio quality for field data. Many speakers, including children. Significant overlapping speech. |
Please see the Blog post Automated Transcription – Comparing Models for details.
Implementation in Transana
Transana offers three different automated transcription tools. Faster Whisper offers seven language models. Speechmatics offers two models. Deepgram offers three models. This provides flexibility so researchers can find the proper balance of quality for the type of data a researcher brings and processing speed.
Figure 1
My tests suggest there are three tiers of accuracy in the tools and models available in Transana.
Tier | Tool and Model | Average accuracy |
Highest | Speechmatics – Enhanced | 89.5% |
Faster Whisper – Large-v2 | 88.9% | |
Faster Whisper – Large | 88.1% | |
Faster Whisper – Large-v1 | 88.0% | |
Speechmatics – Standard | 87.3% | |
Faster Whisper – Medium | 87.0% | |
Middle | Faster Whisper – Small | 84.0% |
Deepgram – Nova | 83.2% | |
Faster Whisper – Tiny | 83.1% | |
Lowest | Faster Whisper – Base | 80.25% |
Deepgram – Enhanced | 80.2% | |
Deepgram – Base | 78.4% |
Processing Speed
The amount of time it takes to process a media file and generate a written transcript varies widely between tools and between models within each tool. I can only speak here about tools I’ve used extensively, but other tools can be assessed in similar terms.
The Deepgram service is extremely fast. It can turn around a 90 minute movie file in under a minute. However, this impressive speed comes at the price of accuracy. Deepgram came in 8th, 11th, and 12th in terms of accuracy of the 12 tool and model combinations I tested. This service particularly struggled with the classroom data sample, never breaking the 70% accuracy level. The Deepgram model that performed best, called Nova, only supports English and Spanish transcription.
The Speechmatics service was also reasonably fast, returning transcripts in less time than it would take to watch the media submitted. Processing of an hour-long data file might take between 20 and 45 minutes, with the Standard model running somewhat faster than the Enhanced model. Speechmatics’ Enhanced model provided the highest transcription quality across our four test files, while the Standard model came in a strong fifth of twelve.
The story is more complicated for the Faster Whisper system. The Faster Whisper system runs on the user’s computer, not on a server that is hardware-optimized for automated transcription. While this is optimal from a data security standpoint, it means that Faster Whisper is generally slower. However, that depends very heavily on the language model selected and the computer performing the transcription.
Implementation in Transana
Deepgram is far and away the fastest choice available in Transana. Faster Whisper‘s Tiny, Base, and Small models also work pretty quickly. However, all of these options sacrifice quality in the name of speed.
Speechmatics and Faster Whisper‘s Medium model offer moderate processing speeds. Quality is pretty good for these options.
Faster Whisper‘s Large-v2, Large, and Large-v1 models take longer than other options, at least on my computers. This speed issue can be addressed with a new function (to be released in the second quarter of 2024) that allows Transana to process automated transcripts in the background while you work on other analyses in the foreground. While the automated transcription is not any faster, you can continue other tasks in Transana while other transcripts are processing.
Figure 2
This graph shows Transcription Accuracy by Transcription Time for the different Faster Whisper language models on four different computer. All tests were run on the same 25 minute long media file containing challenging classroom data. In these tests:
- The Tiny, Base, and Small models tend to perform similarly across computers.
- The Medium model clusters too, except for being significantly slower on an 8-year-old Windows computer.
- The Large and Large-v2 models tended to perform similarly to each other on each computer, but processing speed varied a lot between computers.
- The Large-v1 model’s processing speed varied widely between computers, performing better on two older Windows computers than it did on two newer macOS computers. On the Mac, it was considerably slower than other models.
Cost
Cost is a significant consideration for some projects, particularly those with a lot of data. Some systems are accessible for free, while costs can vary greatly between paid tools.
Implementation in Transana
In Transana, the Faster Whisper tool is available for free. Speechmatics and Deepgram charge for the use of their automated transcription software. Transana integrates each of these services, but researchers need to create an account with these companies to use their tools. Transana does not receive any revenue from either company.
Final Thoughts
The range of options for automated transcription provides researchers the opportunity to choose an optimal combination of data privacy, speed, accuracy and cost to meet their needs.
If a project has sensitive human-subjects data, they would want to use an embedded system. Similarly, if a project does not have a sufficient budget to use a paid automated transcription tool, they would want to use a free system.
From there, it is a matter of finding the right balance between speed and accuracy from the available choices.
If server-based, paid systems are an option, Figure 1 shows that, at least within Transana, Speechmatics offers good accuracy, with characteristics of the data file having an impact on accuracy. and Faster Whisper‘s Small, Medium, and Large models’ results show that embedded, free systems, while often somewhat slower, offer solid options too.
Projects that require an embedded system or a free option may wish to study Figure 2. This shows the interplay between language models and the computer hardware running the model in a way that informs the decision of how to proceed. Researchers who are not using Transana may wish to collect a little data for the options they have available to enable similar data-driven decision-making.