LaTeX in open captions for MP4 videos

With accessibility legislation requiring videos to be captioned by September 23, 2020, and with our teaching being online, I've been thinking about getting LaTeX into captions for videos.

What follows is a bit of a hack, but it seems to work! It only works for open captions, which are burned into the video, and I would be very interested to hear of solutions for closed captions.

Overall workflow

Here is the workflow:

  1. Record the video. I've been using OBS Studio for this, following the excellent videos of James Sumner (if you watch these, you will see his influence elsewhere on this page too - this whole workflow is an extended version of his, to allow LaTeX in open captions).

  2. At this stage, we have a file video.mp4, say. Now we upload it to YouTube. Once uploaded, we can go into YouTube Studio, and play the video. The video has its own weblink with an identifier. Copy it.

  3. Go to DownSub, paste in the weblink to your video, and click on “Download”, and soon you will get the option of downloading an .srt file. Do this; this is the automated caption file made by Google (in particular, in SubRip format; there are other alternatives, but this is the one we'll need later).

  4. Unfortunately, it isn't very good, to say the least. But you can edit this in a text editor. Feel free to insert LaTeX commands here, but ensure that each line has matching closing mathematics tags (dollar, etc.) to an opening one. This is the time-consuming part! Allow about 4 times as long as the length of the video for editing the captions....

  5. Download the following files, and put them into a directory:

      • gawk64.exe (987KB), a Windows implementation of AWK (this is a native Linux program, so is certainly available easily there, and I imagine that there are Mac implementations too);

      • opencaps.awk (12KB), an AWK script for the open captions;

      • opencaps.bat (102 bytes), a batch file for running the script from the command line;

      • white.jpg (4KB), which will be used as a background file, although the opacity will be so great that you won't see this anyway; probably it can be omitted.

(It would be easy to make a version which didn't need navigating to the transcript file, and which simply worked by editing a line and double-clicking on the batch file.)

  1. Navigate to the directory, and run from the command line: “opencaps video”, and you will get some outputs:

      • videotrans.tex, a LaTeX transcript file (feel free to edit the opencaps.bat to make it run LaTeX too, to get a PDF directly, if you wish, but you may want to edit this to put paragraphs in);

      • videocc.srt, a (not completely great) attempt to convert LaTeX into a closed caption file: there aren't any closed captioning systems which allow LaTeX commands, so this is a fudge - LaTeX mathematics is converted into yellow italics. (SubRip allows only boldface, italics and font color as HTML tags, as far as I can see.) It doesn't look very good, and lots of LaTeX commands don't work at all. I don't think I'm going to do anything about this, though - anyone who really wants the captions can watch the open caption version.

      • video.html, an HTML file. This HTML file displays the captions with the right timings (approximately); each caption has its own div, and some JavaScript displays the right one with the right time delay. The LaTeX displays well, as MathJax is imported in the header; the text is white on a grey background rectangle (with some opacity), but the rectangle is actually on a transparent background.

  1. Now return to OBS Studio, and make a scene with a “Media Source” source playing the video video.mp4 and a “Browser” source playing the subtitle HTML file video.html. Both of these should be set to refresh when the scene is started; I made an extra scene to start from, and when I'm ready to record, I switch scene to the video plus subtitle scene, pressing “Start Recording” immediately. (Note that you should turn off the computer microphone if possible, or be very quiet - an early attempt recorded my exercise bicycle as well as the video....) You will need to manually stop the recording at the end (or edit the video to cut off extra bits).

  2. In this way, you get a video which you can save as video_oc.mp4, a version of the file with open captions (i.e., captions burned in).

  3. I found that the file size was rather large. In any case, you should add your closed caption file to the original (uncaptioned) video, which you can do with HandBrake, to get video_cc.mp4, the original video video.mp4 with closed captions (not LaTeX) video_cc.srt. But this generally reduces the file size even compared with the original video - so one can do the same thing for the open caption file, and we find a huge reduction in file size (by a factor of about 8, in my experiments so far): replace video_oc.mp4 by this version with the closed captions added (they will be turned off by default, and the open captions still play properly).

And that's it! By putting things into a batch script file, most of the steps are automated, and it's not too annoying.

An example

Here is a transcript file for a video (on classification problems in statistics): ML7-1.srt (6KB). This was a (heavily) edited version of what was downloaded from Downsub; note that some lines have been merged to make longer lines (so the caption numbering is all messed up), and there is some LaTeX too.

On running “opencaps ML7-1”, we get three files:

  • ML7-1.html (13KB), an HTML file with the captions;

  • ML7-1cc.srt (7KB), a new SubRip file which is a moderate attempt to add closed captions, and which will be added with Handbrake to the video;

  • ML7-1trans.tex (4KB), a LaTeX transcript file.

Issues so far

This is just an experimental thing that I made for my videos - it seems to work OK generally, but everything is hacked from people who really understand what they are doing. But there are lots of LaTeX commands that I haven't tried: I haven't tried matrices, or arrays, or even displayed equations, etc., which might require a larger baseline.

Please let me know of any issues, but I don't promise to do anything about them! Email me at a.f.jarvis@sheffield.ac.uk if you have fixes, suggestions, etc, I'm keen to read them!

I'll maintain a list of issues found by people using this method.

  • Sometimes bars are missing in the output: $\overline{x}$ is output in the same way as $x$. (Perhaps the LaTeX file was wrong?)