Introduction

As the author of transcription and analysis software, I’ve been evaluating tools for automating the transcription of video data for a long time.  For two decades, it took me longer to format and correct automated transcripts so I could analyze them than it took to create them from scratch.

As of 2023, researchers can finally create AI-assisted automated transcripts that they can edit into analysis-ready transcripts faster than creating transcripts manually.  In this blog post, I’ll compare the quality of the three automated transcription options I’ve implemented in Transana, along with the numerous models available within each option.

Data Files

For the purpose of quality assessment, I selected four media files to transcribe and compare.

  1.  a 30 second commercial with a single speaker and good audio,
  2. a 3 minute BBC News piece with two speakers and good audio,
  3. a classic movie with twelve main speakers and a couple of minor characters and good audio with a few instances of overlapping speech, and
  4. a half-hour sample of classroom data with multiple speakers including children, varying audio quality and strength, and some incidents of overlapping speech.

All samples are in English.

Options and Models

I started by transcribing each of these files by hand using Transana.  I created high quality transcripts with a high level of accuracy.  These manual transcripts serve as a baseline reference, to which all automated transcripts are compared.

Transana supports three options for Automated Transcription, and each of the automated transcription options offers several different models.

The Speechmatics option was introduced in Transana 5.00 in May, 2023.  It offers models called “standard” and “enhanced.” Speechmatics is a server-based, commercial option that supports transcription in 48 languages.

The Deepgram option was introduced in Transana 5.03 in September, 2023. It offers a “base” model, an “enhanced” model, and a model they call “nova.” Deepgram is a server-based, commercial option that supports transcription in 20 languages in the “base” model, 17 languages in the “enhanced” model, and 2 languages in the “nova” model, providing support for 21 languages overall.

Faster Whisper, which will be available in Transana 5.10, to be released in January, 2024, offers models called “tiny,” “base,” “small,” “medium,” “large,” “large-v1,” and “large-v2.”  Faster Whisper is an embedded, open-source option based on OpenAI’s Whisper model, supporting transcription in 98 languages and automated translation from 97 non-English languages into English. As a general rule, the models take longer and are more accurate as you move up the levels, with the exception that large-v2 is often slightly faster than large-v1 in my experience.

There are many important differences between these three options.  Major differences include supported languages (as already described), cost, transcription speed, transcription quality, and data security issues.  Faster Whisper transcription is available at no charge, while Speechmatics and Deepgram charge for their services, with Deepgram being the more affordable of the two. Deepgram is extremely fast, Speechmatics takes longer, and Faster Whisper varies considerably depending on which model is selected but generally takes longer than the other options to achieve comparable levels of quality.

Transcription quality, however, is critical. To assess transcript quality, I generated automated transcripts of the four test files using each model from each of the automated transcription options available.  I exported all of these transcripts to plain text files, and stripped out formatting and punctuation.  I then compared each of the automatically-generated files to the manually-transcribed version.  I determined the number of words that were identical, the number of words that were different, the number of words that were missing from the automated file, and the number of words that were added. From this, I calculated a percentage accuracy score for each automated transcript.

Figure 1

Results

The results of this analysis are presented in Figure 1.  The automated transcription models in this graph are organized by average transcription accuracy across the four test files from lowest accuracy to highest.

This graph shows several things.  First, Automated Transcription performs very differently for different types of data across all options and models.  This is not particularly suprising, and the details are quite intesting.

The 30 second ad video was the  best case scenario for Automated Transcription.  The data involved a single speaker with excellent audio quality.  Accuracy ranged from a low of 91.5% of the words in the ad transcribed correctly to a high of 95.3%.  Since the ad was short, at only 106 words, this represents a range of only 4 words difference from the worst to the best results.  Eight of our twelve transcription models got either 100 or 101 words correct for this media file.

The BBC interview, at 3 minutes 20 seconds in length, was slightly more challenging for Automated Transcription programs.  This audio file contained two speakers taking clear turns, again with excellent audio quality.  Accuracy ranged from 89.2% up to 92.6% across our tests, a range of 21 words across the 618 word interview.  Speechmatics – Standard and Speechmatics – Enhanced performed best, with Deepgram – Nova and Faster Whisper’s Large, and Large-v2 models coming in close behind.

The movie selection, the 1957 classic 12 Angry Men, was 1 hour 31 minutes in length.  It featured twelve primary speakers, mostly reasonably distinct, mostly using standard American English.  Having more speakers makes the automated transcription task more challenging.  The audio quality was quite good and speakers took clear turns, but there were a couple of brief periods of overlapping speech and this file introduced background music.  Transcription quality ranged from 65.3% accuracy up to 89.8%, with seven of twelve transcription models breaking the 85% accuracy threshold.  Some models attempted to transcribe background music as words, which hurt their scores. Faster Whisper – Large-v2 and Speechmatics – Enhanced performed best, with 13,261 and 13,148 correct words respectively out of the 14,768 words in the comparison manual transcript.  With a transcript of this size, even minor differences in accuracy can lead to significant differences in the time it takes for a transcriber to make corrections.

(Note:  The Faster Whisper – Large-v1 model failed this test spectacularly on its first attempt, getting no words correct.  I re-ran the test and it performed quite a bit better.  I am still unsure what caused this glitch.  My informal experience is that a couple of the Faster Whisper models can get “stuck” on a phrase on rare occasions, while the commercial tools have not show this issue that I’ve seen so far.)

While processing time is not a focus of this analysis, it is worth making a few observations on that topic here.  The three models provided by Deepgram produced nearly instantaneous transcripts from this data file, taking less than 30 seconds each for submission – transcription – return process of this 91 minute media file.  These were also three of the five lowest-quality transcripts.  The two models provided by Speechmatics took about 15 – 20 minutes to process this file.  The models from Faster Whisper varied tremendously in processing time, with the three large models taking between 2.5 and 4 hours to complete the process.  Please note that processing time is affected significantly by the power of the computer running the automated transcription process for the Faster Whisper model, while Deepgram and Speechmatics run the transcription processing software on their servers rather than the user’s computer and thus are not affected by the power of the user’s processor.

The 30 minute classroom data file was expected to be the most difficult for Automated Transcription.  The data included a number of distinct voices, had significant sections with overlapping speech, and audio quality on the students was somewhat less consistent than the professional recordings of the other three files.  Automated Transcription quality ranged from a low of 57.6% to top scores of 79.8% by three models, Faster Whisper – Large and Large-v2 options (both at 3382 of 4250 words correct) and Speechmatics – Enhanced at 3391 of 4250 words correct.  None of the tools handled overlapping speech well.

Looking just at the quality of transcription, the clear winners are, in order, Speechmatics – Enhanced, Faster Whisper – Large-v2, Faster Whisper – Large, and Faster Whisper – Large-v1Speechmatics – Standard and Faster Whisper – Medium trailed by about 1%, which is reasonably trivial on files of 15 minutes or less, but can be significant with longer files. Finally, the Faster Whisper – Tiny model did remarkably well on files with professional quality audio, but could not handle the more challenging audio of the classroom data.

Factors in Choosing a Tool and Model

There are a number of factors that come into play when selecting a method for automated transcription, including language support, data privacy concerns, cost, transcription speed, and quality of the transcription. 

Different automated transcription options support transcription in different languages.  Each of Deepgram’s models supports a different list of languages, with the Enhanced and Nova options supporting only a few of the most common languages.  Speechmatics supports a broad list of languages, and Faster Whisper supports the largest list as well as offering a translation option.  Your options may be limited, depending on the support for the language you are transcribing.

Data privacy issues must be considered next.  Some data is extremely sensitive.  If your project involves human-subjects sensitive data, you either need to process your automated transcripts on your own computer (using Faster Whisper) or you need to get approval from your IRB or ethics board to submit the data to the Speechmatics or Deepgram server.

Cost is the next consideration.  Faster Whisper is free, but generally slow compared to the commercial systems.  Deepgram is generally cheaper than Speechmatics, but often produces lower-quality transcripts.  Different models from these companies have different costs, so check their web sites before you start processing large amounts of data.

Transcription speed is another practical consideration in making the decision of what option and model to use.  Deepgram is lightening fast, but sacrifices quality in the process.  Speechmatics processing speed is dependent upon file length, but is generally considerably faster than Faster Whisper models that produce similar levels of quality.  Faster Whisper transcript speed varies considerably by model, with higher quality transcription taking longer.  For data with good quality audio, the Faster Whisper – Medium model can be a good balance of speed and quality, but this model struggled with the classroom data.  The three large Faster Whisper models took considerably longer than real-time to process the 91 minute movie file.

Conclusions

While that seems like a lot of factors to balance, I think it is actually reasonably manageable, and this is how I am likely to proceed:

If I have small files with good audio quality, I’ll probably use Faster Whisper – Large-v2 or Faster Whisper – Large.  The transcription quality is reasonably good and processing time on small files is manageable.

If I have a large number of long files, or if I have real-world data with less-than-great audio quality, the decision is a little more complicated.  If I have IRB approval and non-sensitive data, and if I have a budget to pay for automated transcription, I’d choose Speechmatics – Enhanced because of the faster processing time that option offers.  If the data is too sensitive or the budget too tight, I’ll stick with Faster Whisper – Large-v2 or Large and solve the processing time issues by having the program work in the background and when I’m away from my desk.  I’ve also been known to dedicate multiple computers to a task like this, as I have those resources.

Finally, if I am involved in a project with very tight deadlines, I’d probably use Faster Whisper – Medium for sensitive data and I’d consider Deepgram – Nova if the data could be sent to an external server and I had an appropriate budget.

Of course, it’s important to acknowledge that this is a snapshot of data from a very fast-moving field.  One of the advantages of using commercial servers like Speechmatics and Deepgram is that they can update their servers and provide an improved product at any time.  It is not clear to me yet what updating the Faster Whisper transcription models might entail. 

There are also many other providers that were not considered for this analysis.  Otter.ai, for example, was excluded because they do not offer an interface that allows integration with Transana, making the transcription process awkward and time coding much less accurate.  Only data in English was tested, although I’m open to exploring other languages as long as someone can provide me with a media file and manual “reference” transcript.  And I have limited access to real-world data collected in the field under less-than-ideal conditions where I have permission to submit the data for external transcription.  I’m sure I will be revisiting this topic in the future.