My first thought as I watched the first U.S. Presidential debate of 2020 was “Oh, those poor transcribers.”  I recognized the huge challenge this would present for transcribers, with constant interruptions and so much overlapping speech.  I knew it would strain a single transcript nearly to the breaking point to represent this contentious debate.

I also knew how I would handle this data in my ongoing dataset of U.S. Presidential debates.  I would not even try to create a single transcript to represent what occurred on the debate stage.  Instead, I would give each candidate and the moderator their own transcript, and use time code placement to allow clear highlighting of overlaps as the video plays. This approach makes the act of transcription easier, and makes it much easier to understand what is being said at the most difficult points in this data.

The way I see it, the function of a transcript is to facilitate the understanding and analysis of video and audio data.  It’s not a substitute or replacement for your media data.  It’s not the data you should be analyzing.  It’s a tool to help you make sense of media data when it can be overwhelming, as occurs a number of times in this debate, and it serves as a map to the data, helping you find a passage or segment in the media data that you want to look at.  When watching the debate in Transana, I can see what each participant is saying at any moment by listening and looking at their matched transcript.  When I want to find where someone made a particular comment, I can do a text search without having to worry about how the representation of overlapping speech might obscure what was said by each participant. 

Here is a brief demonstration video of the moderator, Chris Wallace, trying to ask a question early on in the debate.

A few tricks I found made things easier.

First, psychologists have long recognized the “cocktail party effect,” where people are able to hear their own name jump out in the murmur of overlapping conversations.  It turns out that people are often pretty good at focusing on one voice at a time during overlapping speech.  I took advantage of this by transcribing one individual’s contributions to a segment at a time.  I’d transcribe one person until there was a clear break, then do the same passage for the second person.  And so on.

Normally, I time-code at the beginning of each person’s turn, as well as at every subject change within a turn, and every two or three sentences otherwise to keep the chunks manageable.  I normally do not time-code the end of a segment because the time code that starts the next segment is good enough.  But in this case, I wanted the overlaps and interruptions to jump out more prominently.  So I time-coded the start of each utterance and every few sentences within an utterance (you can’t really call them “turns” when both people are talking at the same time!), and then at the end of an utterance when the person stopped talking for a noticeable period of time.  This allows for just the visual effect I want.  It’s really clear, watching the dance of the transcripts, when only one person is talking and when both are.

I found that it was actually not too difficult to transcribe this way, partly because I didn’t have to worry about representing the overlaps in a single linear transcript, and partly because I could concentrate on one speaker at a time.  There were still some challenges, but the process went pretty smoothly and I’m pleased with the results overall.

Just to be clear, these techniques are useful when there is a LOT of overlapping speech.  This approach is overkill for the brief, rare periods of overlap we witness in most conversations.  But in extreme cases, like this debate, they can make all the difference in making the data accessible, easy to understand, and easier to analyze.

If you have questions about what I did or why, please use the Contact form to let me know.