Caption Mining at the Crossroads of Digital Humanities & Media Studies

30Nov12

Lately I’ve become more and more intrigued by Digital Humanities as a subfield/movement/trend/etc. within academia, in large part because the people who are actively driving much of DH are super engaging & welcoming via social networks like Twitter and various blogs. As I am committed to open access publishing, public-facing scholarship, and innovative modes of academic engagement, Digital Humanists feel like fellow travelers. But as someone who has been actively engaged with the study & use of digital media for over a decade, I’ve frequently wondered about the intersection between Digital Humanities, which tends to cluster in the fields of History and English, and Media Studies, where digital tools & objects of study have been commonplace but understood quite differently. This is actually the topic of a workshop that Miriam Posner & I put together for the Society for Cinema & Media Studies conference in March (the call for the workshop is here on Miriam’s blog, and the lineup of participants looks great), so I’ll leave these larger issues for then.

But for now, I’ve often wondered what some of the tools of Digital Humanities might look like applied to media objects rather than the literary texts or historical artifacts that they’ve tended to focus on. One such tool is the word cloud, measuring concordances within a text to seek patterns of frequently used words. Films and television programs feature words as well, and thus we might imagine looking at dialogue as a dataset to be analyzed and reconfigured using a tool like Wordle. Of course, the methods of scanning and digitizing books don’t work for moving images, but the other day it occurred to me that most DVDs already include digitized text of the dialogue, in the form of the subtitles and captions.

So I was happy to realize that there is already a tool available for extracting captions and turning them into a text file: ccextractor. Alas, this open-source application works best on Windows & I’m a diehard Mac user, so I had my colleague Ethan Murphy install it on a departmental PC and figure out how best to get it working. (The Mac version is command line, so you need to know what you’re doing more than I do to use it effectively.) The results are pretty impressive; this page details the process of decrypting a DVD (technically illegal, although I think this is clearly fair use & wouldn’t be an enforceable violation, as it fits with the spirit of the DMCA exemptions that have been established for educational use) and outputting the captions into a text file. This process took around 10 minutes for one DVD.

I test drove this process using the first episode of The Wire. Here’s what the famous first scene looks like extracted:

( police sirens wailing )
( police radio chattering )
( McNulty )
SO, YOUR BOY'S NAME IS WHAT ?
( man )
SNOT.
YOU CALLED THE GUY SNOT ?
( man )
SNOTBOOGIE, YEAH.
"SNOTBOOGIE."
HE LIKE THE NAME ?
WHAT ?
SNOTBOOGIE.
THIS KID WHOSE MAMA
WENT TO THE TROUBLE
OF CHRISTENING HIM
OMAR ISAIAH BETTS ?
YOU KNOW,
HE FORGETS HIS JACKET,
SO HIS NOSE STARTS RUNNING,
AND SOME ASSHOLE
INSTEAD OF
GIVING HIM A KLEENEX,
HE CALLS HIM "SNOT."
SO, HE'S "SNOT" FOREVER.
DOESN'T SEEM FAIR.
LIFE JUST BE
THAT WAY, I GUESS.
SO...
WHO SHOT SNOT ?
I AIN'T GOING TO NO COURT.
( dog barking )
MOTHERFUCKER, AIN'T HAVE
TO PUT NO CAP IN HIM THOUGH.
DEFINITELY NOT.
HE COULD'VE JUST
WHIPPED HIS ASS,
LIKE WE ALWAYS WHIP HIS ASS.
I AGREE WITH YOU.
HE GONNA KILL SNOT.
SNOT BEEN DOING THE SAME SHIT
SINCE I DON'T KNOW HOW LONG.
KILL A MAN OVER
SOME BULLSHIT.
I'M SAYING,
EVERY FRIDAY NIGHT
IN THE ALLEY
BEHIND THE CUT-RATE,
WE ROLLING BONES, YOU KNOW ?
I MEAN, ALL THE BOYS
FROM AROUND THE WAY,
WE ROLL TILL LATE.
ALLEY CRAP GAME, RIGHT ?
AND LIKE EVERY TIME,
SNOT, HE'D FADE
A FEW SHOOTERS.
PLAY IT OUT TILL
THE POT'S DEEP.
THEN HE'D SNATCH AND RUN.
EVERY TIME ?
COULDN'T HELP HISSELF.
LET ME UNDERSTAND YOU.
EVERY FRIDAY NIGHT,
YOU AND YOUR BOYS
WOULD SHOOT CRAP, RIGHT ?
AND EVERY FRIDAY NIGHT,
YOUR PAL SNOTBOOGIE,
HE'D WAIT TILL THERE
WAS CASH ON THE GROUND,
THEN GRAB THE MONEY
AND RUN AWAY ?
YOU LET HIM DO THAT ?
WE CATCH HIM
AND BEAT HIS ASS.
BUT AIN'T NOBODY
EVER GO PAST THAT.
I GOTTA ASK YOU.
IF EVERY TIME SNOTBOOGIE
WOULD GRAB THE MONEY
AND RUN AWAY,
WHY'D YOU EVEN LET
HIM IN THE GAME ?
WHAT ?
IF SNOTBOOGIE
ALWAYS STOLE THE MONEY,
WHY'D YOU LET HIM PLAY ?
GOT TO.
THIS AMERICA, MAN.
( man chattering )

And here’s what the whole episode looks like when turned into a Wordle, graphically representing the program’s unique brand of profanity:

Wordle of dialogue for THE WIRE, "The Target"

Wordle of dialogue for THE WIRE, “The Target”

Now, there are some key tweaks that need to be made to accurately tabulate words within the dialogue. The captions include some sonic cues in parentheses — “( police sirens wailing )” — that shouldn’t be incorporated into the dialogue, and Wordle treats the all caps of the dialogue differently from these lowercase cues, thus both “MAN” and “man” appear separately. Additionally, the character names in parenthesis indicate when a character is speaking off-screen, so these are misleading cues as well. ccextractor can be set to change cases and maybe to filter out cues depending on how a given DVD encodes them, so there’s need for a bit of customization. And it’s essential to remember that this is a transcript, not a screenplay—not only are character names not indicated, but the screenplay form includes a performance and visual blueprint and sense of rhythm that this raw transcript neglects. (You can compare this scene with an early version of the pilot screenplay downloadable here.)

In surveying work in Digital Humanities, it may seem that the point of the field is developing and playing with such tools, but as with any method or model, the techniques only work when paired with a research question that is an appropriate match for the approach. So for what questions is such “caption mining” useful to answer? I had some ideas, but also asked people on Twitter and Facebook for their thoughts as well. Concordances and other quantitative measures can be useful to get a sense of the dialogue quirks and tendencies that comprise a given film or TV program’s verbal style. Such analyses are most productive comparatively, whether looking across a given writer’s work, comparing examples within a genre or across eras, or charting differences throughout the ongoing run of a series. Daniel Chamberlain, another scholar at the nexus of DH & Media Studies, offered the following suggestions: “There are probably some low-level arguments to be made by comparing this with literacy metrics (some shows use big words, some are aimed at less educated audiences), or using simple tools like voyant (Amy Sherman-Palladino packs more words into an episode than anyone else). You might be able to frame questions about the long run of a series (do the scripts “repeat” or get “stale” or do they continue to develop). You might be able to generate evidence making claims about what happens as showrunners or writers come and go. You could even look to make Zeitgeist arguments by comparing batches of shows from different years or eras. These are mostly about gathering familiar (if more robust) forms of textual evidence.” (And Miriam mentioned that the Zeitgeist question evokes Ben Schmidt’s work with TV anachronisms.)

This approach can also target specific key words—for instance, on Twitter a colleague mentioned she’d be interested in looking at how often the world “torture” is used within various series she is analyzing to supplement her study of narrative representations of torture. If we had a particularly large corpus of series, we could chart the shifting use of profanity or other culturally-charged terms surrounding identity or politics. Probably for such a project to work, we’d need to develop a huge database of transcripts along the lines of the massive literary databases of scanned books like Google’s ngrams, an endeavor complicated by copyright issues (as I assume HBO would balk at an open database of the entire Wire dialogue!) and high labor costs—if we could overcome the copyright issues, perhaps we could agree on standard forms and upload self-extracted transcripts to a site like how Cinemetrics crowdsources editing data for films and television?

Another potential use for these transcripts is as a guide for navigating a video, especially for the vast body of a serial. When working on a program, I’ve often struggled to remember precisely where a scene might fall in a series—video is impractical to search, but having a full transcript would make that process much simpler for teaching and analysis (at least if the scene’s memorable feature is tied to dialogue, not visuals). ccextractor allows for the transcript to include timecode, making this navigation process quite easy—especially if you’re working on a video essay or remix (which I see as fertile ground to connect DH and Media Studies), where a transcript can facilitate creating a useful editing log.

There are lots of possibilities for making discoveries about the language of a film or television text, but this tool raises one large caution flag: we cannot mistakenly reduce a moving image work to its dialogue. There is a long tradition of scholars trained to study language & literature treating film texts just as they consider printed work, focusing on narrative structures, verbal style, metaphors, etc., but paying scant attention to visual style, music, performance, temporal systems, or other formal elements that make film essentially than literature. But with that caution in mind, we shouldn’t ignore a moving image text’s dialogue and verbal systems, and I hope that ccextractor offers a useful tool to provide some new access to these elements.

So I end this brainstorming post with a question: what would you use this tool to discover about a film or television program?



4 Responses to “Caption Mining at the Crossroads of Digital Humanities & Media Studies”

  1. This is a fascinatingly suggestive post, Jason. There are crossovers here with the work that BBC R&D has been doing with semantic subtitle analysis, and especially the Channelography project, developed with the digital agency Rattle.
    See here for more: http://www.rattlecentral.com/channelography

  2. I’ve been experimenting recently with topic modeling, and that’s immediately what I thought of when reading this post. You need a larger corpus than even all the seasons of the Wire (I think) for topic modeling to be a useful tool, but it occurs to me that there might already be a large script depository of television shows online somewhere.

    The next thing I thought of would be how predictable and amusing a Deadwood word-cloud would be.


  1. 1 Editors’ Choice: A Text-Mining and Visualization Roundup : Digital Humanities Now
  2. 2 Text Analysis |

Leave a comment