A Support Request regarding Automated Transcription

A new Transana user sent a support request, saying she was having difficulty creating automated transcripts using the Faster Whisper tool and its Large model.  She reported that Transana was freezing, and asked for help trouble-shooting this issue.  She had indicated that she was using a test file, so I asked her to share the file with me.

I loaded her media file into my own copy of Transana.  The first thing I noticed was that the waveform diagram for her data file looked like this:

This indicated that her audio signal was extremely weak, with low volume, the electronic equivalent of a whisper.  I suspected that an automated transcription tool would have a very difficult time processing this file because of that.

So the first thing I did was load the file in an audio editor called Audacity and applied the “Amplify” effect.  I saved the new audio file and loaded that into Transana.  The new waveform diagram looked like this:

Was this ideal?  Absolutely not.  I could have enhanced the audio quite a bit more.  However, I suspected that I did not need to.

Then next step I took was to request that Transana produce automated transcripts of both files using Faster Whisper.  I requested a transcript in each of Faster Whisper’s 7 quality models of both the original audio file and the amplified copy.

Unlike the individual who had requested my help, I did not experience any freezes or crashes during the automated transcription process.  Transana produced 14 automated transcripts for me, as requested.

However, I noticed that the transcripts took longer than I expected with the original media files. The original media file was just less than 16 minutes long, and Large model transcription was taking somewhere between 30 and 40 minutes.  (I was using Transana 5.20, so the transcription processing was taking place in the background while I did other things, and I was not paying close attention to the speed of transcription to get exact times.)  My suspicion is that the automated transcription process was struggling with the weak audio signal, and as a result took a long time to process the files.  Since the person who submitted the support request was using Transana 5.11 and followed a different process where automated transcription occurs as the primary process rather than a background process, I would guess that Transana appeared frozen to her while processing was in fact chugging along slowly.

Then I started looking at the Transcripts.  The transcript for the original audio file was frankly terrible.  As I’d expected, the audio signal was simply too weak for Faster Whisper to “hear” and interpret.  We even see evidence of “hallucinating,” where the automated transcription program tries its best to interpret the audio but fails badly.

The amplified version of the audio file shows much better results, with only one slightly incorrect word in the section shown, where the program heard “white -clawed” instead of “white claw.”  The rest of the transcript matches the media file well.

Note that both transcript examples show Faster Whisper’s “Large” model, as that is what the person who contacted me wanted.

Finally, the audio on this file essentially ends at 12 minutes, about 3:45 before the file ends. The rest of the file was just silence, typing, and background noise. Not surprisingly, this portion of the automatic transcript needed editing.

So when it comes to automated transcription, the quality of the audio file really matters. That’s not terribly surprising, is it?