Mark Pilgrim – A Gentle Introduction to Video Encoding: Captioning

by Simon. Average Reading Time: almost 11 minutes.

This article was first published on 7th January 2009, on Mark Pilgrim’s website. That website no longer exists so this article serves as an historical record. I have preserved all emphasis and links as per the original article.

The first thing you need to know about captions and subtitles is that captions and subtitles are different. The second thing you need to know about captions and subtitles is that you can safely ignore the differences unless you’re creating your own from scratch. I’m going to use the terms interchangeably throughout this article, which will probably drive you crazy if you happen to know and care about the difference.

Historically, captioning has been driven by the needs of deaf and hearing impaired consumers, and captioning technology has been designed around the technical quirks of broadcast television. In the United States, so-called “closed captions” are embedded into a part of the NTSC video source (“Line 21”) that is normally outside the viewing area on televisions. In Europe, they use a completely different system that is embeddable in the PAL video source. Over time, each new medium (VHS, DVD, and now online digital video) has dealt a blow to the accessibility gains of the previous medium. For example:

  • PAL VHS tapes did not have enough bandwidth to store closed captions at all.
  • DVDs have the technical capability, but producers often manage to screw it up anyway; e.g. DVDs of low-budget television shows are often released without the closed captions that accompanied the original broadcast.
  • HDMI cables drop “Line 21” closed captions altogether. If you play an NTSC DVD on an HDTV over HDMI, you’ll never see the closed captions, even if the DVD has them.

And accessible online video is just fucking hopeless. (And no, it won’t change unless new regulation forces it to change. When it comes to captioning, Joe Clark has been right longer than many of you have been alive.)

So even in broadcast television, captioning technology was fractured by different broadcast technologies in different countries. Digital video had the capability of unifying the technologies and learning from their mistakes. Of course, exactly the opposite happened. Early caption formats split along company lines; each major video software platform (RealPlayer, QuickTime, Windows Media, Adobe Flash) implemented captioning in their own way, with levels of adoption ranging from nil to zilch. At the same time, an entire subculture developed around “fan-subbing,” i.e. using captioning technology to provide translations of foreign language videos. For example, non-Japanese-speaking consumers wanted to watch Japanese anime films, so amateur translators stepped up to publish their own English captions that could be overlaid onto the original film. In the 1980s, fansubbers would actually take VHS tapes and overlay the English captions onto a new tape, which they would then (illegally) distribute. Nowadays, translators can simply publish their work on the Internet as a standalone file. English-speaking consumers can have their DVDs shipped directly from Japan, and they use software players that can overlay standalone English caption files while playing their Japanese-only DVDs. The legality of distributing these unofficial translations (even separately, in the form of standalone caption files) has been disputed in recent years, but the fansubbing community persists.

Technically, there is a lot of variation in captioning formats. At their core, captions are a combination of text to display, start and end times to display it, information about where to position the text on a screen, fonts, styling, alignment, and so on. Some captions roll up from the bottom of the screen, others simply appear and disappear at the appropriate time. Some caption formats mandate where each caption should be placed and how it should be styled; others merely suggest position and styling; others leave all display attributes entirely up to the player. Almost every conceivable combination of these variables has been tried. Some forms of media try multiple combinations at once. DVDs, for example, can have two entirely distinct forms of captioning — closed captioning (as used in NTSC broadcast television) embedded in the video stream, and one or more subtitle tracks. DVD subtitle tracks are used for many different things, including subtitles (just the words being spoken, in the same language as the audio), captions for the hearing impaired (which include extra notations of background noises and such), translations into other languages, and director’s commentary. Oh, and they’re stored on the DVD as images, not text, so the end user has no control over fonts or font size.

Beyond DVDs, most caption formats store the captions as text, which inevitably raises the issue of character encoding. Some caption formats explicitly specify the character encoding, others only allow UTF-8, others don’t specify any encoding at all. On the player side, most players respect the character encoding if present (but may only support specific encodings); in its absence, some players assume UTF-8, some guess the encoding, and some allow the user to override the encoding. Obviously standalone caption files can be in any format, but if you want to embed your captions as a track within a video container, your choices are limited to the caption formats that the video container supports.

And remember when I said that there were a metric fuck-ton of audio codecs? Forget that. There are an imperial fuck-ton of caption formats (i.e. multiply by 9/5 and add 32). Here is a partial list of caption formats, taken from the list of formats supported by Subtitle Workshop, which I used to caption my short-lived video podcast series:

Adobe Encore DVD, Advanced SubStation Alpha, AQTitle, Captions 32, Captions DAT, Captions DAT Text, Captions Inc., Cheetah, CPC-600, DKS Subtitle Format, DVD Junior, DVD Studio Pro, DVD Subtitle System, DVDSubtitle, FAB Subtitler, IAuthor Script, Inscriber CG, JACOSub 2.7+, Karaoke Lyrics LRC, Karaoke Lyrics VKT, KoalaPlayer, MacSUB, MicroDVD, MPlayer, MPlayer2, MPSub, OVR Script, Panimator, Philips SVCD Designer, Phoenix Japanimation Society, Pinnacle Impression, PowerDivX, PowerPixel, QuickTime Text, RealTime, SAMI Captioning, Sasami Script, SBT, Sofni, Softitler RTF, SonicDVD Creator, Sonic Scenarist, Spruce DVDMaestro, Spruce Subtitle File, Stream SubText Player, Stream SubText Script, SubCreator 1.x, SubRip, SubSonic, SubStation Alpha, SubViewer 1.0, SubViewer 2.0, TMPlayer, Turbo Titler, Ulead DVD Workshop 2.0, ViPlay Subtitle File, ZeroG.

Which of these formats are important? The answer will depend on whom you ask, and more specifically, how you’re planning to distribute your video. This series is primarily focused on videos delivered as files to be played on PCs or other computing devices, so my choices here will reflect that. These are some of the most well-supported caption formats:

  • SubRip
  • SubStation Alpha
  • MPEG-4 Timed Text
  • SAMI
  • SMIL


SubRip is the AVI of caption formats, in the sense that its basic functionality is supported everywhere but various people have tried to extend it in mostly incompatible ways and the result is a huge mess. As a standalone file, SubRip captions are most commonly seen with a .srt extension. SubRip is a text-based format which can include font, size, and position information, as well as a limited set of HTML formatting tags, although most of these features are poorly supported. Its “official” specification is a doom9 forum post from 2004. Most players assume that .srt files are encoded in Windows-1252 (what Windows programs frequently call “ANSI”), although some can detect and switch to UTF-8 encoding automatically.

Because .srt files are so often published separately from the video files they describe, the most common use case is to put your .srt file in the same directory as your video file and give them the same name (up to the file extensions). But it is also possible to embed SubRip captions directly into AVI files with AVI-Mux GUI, into MKV files with mkvmerge, and into MP4 files with MP4Box.

You can play SubRip captions in Windows Media Player or other DirectShow-based video players after installing VSFilter; in QuickTime after installing Perian; on Linux, both mplayer and VLC support it natively.

SubStation Alpha

SubStation Alpha and its successor, Advanced SubStation Alpha, are the preferred caption formats of the fansubbing community. As standalone files, they are commonly seen with .ssa or .ass extensions. They have a spec longer than three paragraphs. They are actually miniature scripting languages. A .ass file contains a series of commands to control position, scrolling, animation, font, size, scaling, letter spacing, borders, text outline, text shadow, alignment, and so on; and a series of time-coded events for displaying text given the current styling parameters. It has support for multiple character encodings.

The playing requirements for SubStation Alpha captions are almost identical to SubRip. The same plugins are required for Windows and Mac OS X. On Linux, mplayer prides itself on having the most complete SSA/ASS implementation.

MPEG-4 Timed Text

a.k.a. “MPEG-4 Part 17,” a.k.a. ISO 14496-17, MPEG-4 Timed Text (hereafter “MP4TT”) is the one and only caption format for the MP4 container. It is not a file format; it is only defined in terms of a track within an MP4 container. As such, it can not be embedded in any other video container, and it can not exist as a separate file. (Note: the last sentence was a lie; the MPEG-4 Timed Text format is really the 3GPP Timed Text format, and it can very much be embedded in a 3GPP container. What I meant to say is that the format can not be embedded in any of the other popular video container formats like AVI, MKV, or OGG. I could go on about the subtle differences between MPEG-4 Timed Text in an MP4 container and 3GPP Timed Text in a 3GPP container, but it would just make you cry, and besides, technical accuracy is for pussies.)

MP4TT defines detailed information on text positioning, fonts, styles, scrolling, and text justification. These details are encoded into the track at authoring time, and can not be changed by the end user’s video player. The most readable description of its features is actually the documentation for GPAC, an open source implementation of much of the MPEG-4 specification (including MP4TT). Since MP4TT doesn’t define a text-based serialization, GPAC invented one for their own use; since their format is designed to capture all the possible information in an MP4TT track, it turns out to be an easy way to read about all of MP4TT’s features.

MP4Box, part of the GPAC project, can take an .srt file and convert it into a MPEG-4 Timed Text track and embed it in an existing MP4 file. It can also reverse the process — extract a Timed Text track from an MP4 file and output a .srt file.

On Mac OS X, QuickTime supports MP4TT tracks within an MP4 container, but only if you rename the file from .mp4 to .3gp or .m4v. I shit you not. (On the plus side, changing the file extension will allow you to sync compatible video to an iPod or iPhone, which will actually display the captions. Still not kidding.) On Windows, any DirectShow-based video player (such as Windows Media Player or Media Player Classic) supports MP4TT tracks once you install Haali Media Splitter. On Linux, VLC has supported MP4TT tracks for several years.


SAMI was Microsoft’s first attempt to create a captioning format for PC video files (as opposed to broadcast television or DVDs). As such, it is natively supported by Microsoft video players, including Windows Media Player, without the need for third-party plugins. It has a specification on MSDN. It is a text-based format that supports a large subset of HTML formatting tags. SAMI captions are almost always embedded in an ASF container, along with Windows Media video and Windows Media audio.

Don’t use SAMI for new projects; it has been superceded by SMIL. For historical purposes, you may enjoy reading about creating SAMI captions and embedding them in an ASF container, as long as you promise to never, ever try it at home.


SMIL (Synchronized Multimedia Integration Language) is not actually a captioning format. It is “an XML-based language that allows authors to write interactive multimedia presentations.” It also happens to have a timing and synchronization module that can, in theory, be used to display text on a series of moving pictures. That is to say, if you think of SMIL as a way to provide captions for a video, you’re doing it wrong. You need to invert your thinking — your video and your captions are each merely components of a SMIL presentation. SMIL captions are not embedded into a video container; the video and its captions are referenced from a SMIL document.

SMIL is a W3C standard; the most recent revision, SMIL 3.0, was just published in December 2008. If you printed out the SMIL 3.0 specification on US-Letter-sized paper, it would weigh in at 395 pages. So don’t do that.

QuickTime supports a subset of SMIL 1.0. WebAIM provides a nice tutorial on using SMIL to add captions to a QuickTime movie.

Further reading

This article has been tagged

, , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Other articles I recommend

Mark Pilgrim – A Gentle Introduction to Video Encoding: Container Formats

You may think of video files as “AVI files” or “MP4 files.” In reality, “AVI” and “MP4″ are just container formats. Just like a ZIP file can contain any sort of file within it, video container formats only define how to store things within them, not what kinds of data are stored. (It’s a little more complicated than that, because not all video streams are compatible with all container formats, but never mind that for now.) A video file usually contains multiple tracks — a video track (without audio), one or more audio tracks (without video), one or more subtitle/caption tracks, and so forth. Tracks are usually interrelated; an audio track contains markers within it to help synchronize the audio with the video, and a subtitle track contains time codes marking when each phrase should be displayed. Individual tracks can have metadata, such as the aspect ratio of a video track, or the language of an audio or subtitle track. Containers can also have metadata, such as the title of the video itself, cover art for the video, episode numbers (for television shows), and so on.

Mark Pilgrim – A Gentle Introduction to Video Encoding: Lossy Video Codecs

The most important consideration in video encoding is choosing a video codec. A future article will talk about how to pick the one that’s right for you, but for now I just want to introduce the concept and describe the playing field. (This information is likely to go out of date quickly; future readers, be aware that this was written in December 2008.)

Mark Pilgrim – A Gentle Introduction to Video Encoding: Lossy Audio Codecs

Unless you’re going to stick to films made before 1927 or so, you’re going to want an audio track. A future article will talk about how to pick the audio codec that’s right for you, but for now I just want to introduce the concept and describe the playing field. (This information is likely to go out of date quickly; future readers, be aware that this was written in December 2008.)