Mark Pilgrim – A Gentle Introduction to Video Encoding: Lossy Audio Codecs

by Simon. Average Reading Time: about 10 minutes.

This article was first published on 30th December 2008, on Mark Pilgrim’s website. That website no longer exists so this article serves as an historical record. I have preserved all emphasis and links as per the original article.

Unless you’re going to stick to films made before 1927 or so, you’re going to want an audio track. A future article will talk about how to pick the audio codec that’s right for you, but for now I just want to introduce the concept and describe the playing field. (This information is likely to go out of date quickly; future readers, be aware that this was written in December 2008.)

Like video codecs, audio codecs are algorithms by which an audio stream is encoded. Like video codecs, there are lossy and lossless audio codecs. Today’s article will only deal with lossy audio codecs. Actually, it’s even narrower than that, because there are different categories of lossy audio codecs. Audio is used in many places where video is not (telephony, for example), and there is an entire category of audio codecs optimized for encoding speech. You wouldn’t rip a music CD with these codecs, because the result would sound like a 4-year-old singing into a speakerphone. But you would use them in an Asterisk PBX, because bandwidth is precious, and these codecs can compress human speech into a fraction of the size of general-purpose codecs.

And that’s all I have to say about speech-optimized audio codecs. Onward…

As I mentioned in part 2: lossy video codecs, when you “watch a video,” your player software is doing several things at once:

  1. Interpreting the container format
  2. Decoding the video stream
  3. Decoding the audio stream and sending the sound to your speakers
  4. Possibly decoding the subtitle stream as well. (Tomorrow’s article will be all about subtitle formats! I can hardly wait!)

The audio codec specifies how to do #3 — decoding the audio stream and turning it into digital waveforms that your speakers then turn into sound. As with video codecs, there are all sorts of tricks to minimize the amount of information stored in the audio stream. And since we’re talking about lossy audio codecs, information is being lost during the recording → encoding → decoding → listening lifecycle. Different audio codecs throw away different things, but they all have the same purpose: to trick your ears into not noticing the parts that are missing.

One concept that audio has that video does not is channels. We’re sending sound to your speakers, right? Well, how many speakers do you have? If you’re sitting at your computer, you may only have two: one on the left and one on the right. My desktop has three: left, right, and one more on the floor. So-called “surround sound” systems can have six or more speakers, strategically placed around the room. Each speaker is fed a particular channel of the original recording. The theory is that you can sit in the middle of the six speakers, literally surrounded by six separate channels of sound, and your brain synthesizes them and feels like you’re in the middle of the action. Does it work? A multi-billion-dollar industry seems to think so.

Most general-purpose audio codecs can handle two channels of sound. During recording, the sound is split into left and right channels; during encoding, both channels are stored in the same audio stream; during decoding, both channels are decoded and each is sent to the appropriate speaker. Some audio codecs can handle more than two channels, and they keep track of which channel is which and so your player can send the right sound to the right speaker.

There are lots of audio codecs. Did I say there were lots of video codecs? Forget that. There are a metric fuck-ton of audio codecs. These are the ones you need to know about:

  • MPEG-1 Audio Layer 3
  • Advanced Audio Coding
  • Windows Media Audio
  • Vorbis
  • Dolby Digital
  • Digital Theater System

MPEG-1 Audio Layer 3

MPEG-1 Audio Layer 3…colloquially known as “MP3.” If you haven’t heard of MP3s, I don’t know what to do with you. Walmart sells portable music players and calls them “MP3 players.” Walmart. Anyway…

MP3s can contain up to 2 channels of sound. They can be encoded at different bitrates: 64 kbps, 128 kbps, 192 kbps, and a variety of others from 32 to 320. Higher bitrates mean larger file sizes and better quality audio, although the ratio of audio quality to bitrate is not linear. (128 kbs sounds more than twice as good as 64 kbs, but 256 kbs doesn’t sound twice as good as 128 kbs.) Furthermore, the MP3 format allows for variable bitrate encoding, which means that some parts of the encoded stream are compressed more than others. For example, silence between notes can be encoded at a very low bitrate, then the bitrate can spike up a moment later when multiple instruments start playing a complex chord. MP3s can also be encoded with a constant bitrate, which, unsurprisingly, is called constant bitrate encoding.

The MP3 standard doesn’t define exactly how to encode MP3s (although it does define exactly how to decode them); different encoders use different psychoacoustic models that produce wildly different results, but are all decodable by the same players. The open source LAME project is the best free encoder, and arguably the best encoder period for all but the lowest bitrates.

The MP3 format was standardized in 1991 and is patent-encumbered, which explains why Linux can’t play MP3 files out of the box. Pretty much every portable music player supports standalone MP3 files, and MP3 audio streams can be embedded in any video container. Adobe Flash can play both standalone MP3 files and MP3 audio streams within an MP4 video container.

Advanced Audio Coding

Advanced Audio Coding…affectionately known as “AAC.” Standardized in 1997, it lurched into prominence when Apple chose it as their default format for the iTunes Store. Originally, all AAC files “bought” from the iTunes Store were encrypted with Apple’s proprietary DRM scheme, called FairPlay. Many songs in the iTunes Store are now available as unprotected AAC files, which Apple calls “iTunes Plus” because it sounds so much better than calling everything else “iTunes Minus.” The AAC format is patent-encumbered; licensing rates are available online.

AAC was designed to provide better sound quality than MP3 at the same bitrate, and it can encode audio at any bitrate. (MP3 is limited to a fixed number of bitrates, with an upper bound of 320 kbs.) AAC can encode up to 48 channels of sound, although in practice no one does that. The AAC format also differs from MP3 in defining multiple profiles, in much the same way as H.264, and for the same reasons. The “low-complexity” profile is designed to be playable in real-time on devices with limited CPU power, while higher profiles offer better sound quality at the same bitrate at the expense of slower encoding and decoding.

All current Apple products, including iPods, AppleTV, and QuickTime support certain profiles of AAC in standalone audio files and in audio streams in an MP4 video container. Adobe Flash supports all profiles of AAC in MP4, as do the open source mplayer and VLC video players. For encoding, the FAAC library is the open source option; support for it is a compile-time option in mencoder and ffmpeg. (I’ll dive into all the different encoding tools in a future article.)

Windows Media Audio

Windows Media Audio…a.k.a. “WMA.” As you might guess from the name, Windows Media Audio was developed by Microsoft. The acronym “WMA” has historically referred to many different things: a lossless audio codec (“WMA Lossless”), a speech-optimized codec (“WMA Voice”), and several different lossy audio codecs (“WMA 1,” “WMA 2,” “WMA 7,” “WMA 8,” “WMA 9,” and “WMA Pro”). It is also (incorrectly) used to refer to the Advanced Systems Format, because WMA-encoded audio streams are usually embedded in an ASF container. Roughly speaking, the lossy audio codecs (WMA 1-9) compete with MP3 and low-complexity AAC; WMA Lossless competes with Apple Lossless and FLAC; WMA Pro competes with high-complexity AAC, Vorbis, AC-3, and DTS.

All the different codecs under the “WMA” brand are playable with Windows Media Player, which comes pre-installed on desktops and laptops running Microsoft Windows XP and Vista. Portable devices like the Zune and the ironically named “PlaysForSure” devices can play WMA 1-9; stores that allow you to “purchase” WMA files generally encrypt them with a Microsoft-proprietary DRM scheme. The open source ffmpeg project can play WMA 1-9, and Flip4Mac offers a commercial QuickTime component to encode and decode WMA audio on Mac OS X.

WMA 1-9 support up to 2 channels of sound; WMA Pro supports up to 8 channels of sound. All WMA formats are patent-encumbered; licensing information is available from Microsoft.


Vorbis…known to many as “Ogg Vorbis,” although for some reason that pisses off both Ogg and Vorbis advocates. (Technically, “Ogg” is a container format, and Vorbis audio streams can be embedded in other containers.) Vorbis is not encumbered by any known patents and is therefore supported out-of-the-box by all major Linux distributions and by portable devices running the open source Rockbox firmware. Mozilla Firefox 3.1 will support Vorbis audio files in an Ogg container, or Ogg videos with a Vorbis audio track. Android mobile phones can also play standalone Vorbis audio files. Vorbis audio streams are usually embedded in an Ogg container, but they can also be embedded in an MP4 or MKV container (or, with some hacking, in AVI).

There are open source Vorbis encoders and decoders, including OggConvert (encoder), ffmpeg (decoder), aoTuV (encoder), and libvorbis (decoder). There are also QuickTime components for Mac OS X and DirectShow filters for Windows.

Vorbis supports an arbitrary number of sound channels.

Dolby Digital

Dolby Digital…a.k.a. “AC-3.” AC-3 was developed by Dolby Laboratories. AC-3 is most well-known for being a mandatory format in the DVD standard; all DVD players must be able to decode AC-3 audio streams. It is also mandatory for Blu-Ray players, and many digital TV broadcasts send AC-3 audio streams as well. AC-3 supports up to 6 channels of sound and bitrates of up to 640 kbps, although its most popular application — audio on DVDs — is officially limited to 448 kbps. (Blu-Ray discs may use the maximum 640 kbps.)

There are open source encoders and decoders for AC-3, including liba52 (decoding), AC3Filter (decoding), and Aften (encoding). ffmpeg has a compile-time option to include liba52, which will allow all ffmpeg-based players and plugin chains (like GStreamer) to play AC-3 audio streams. However, the AC-3 format is patent-encumbered; licensing is brokered by Dolby Laboratories.

AC-3 is rarely seen in standalone audio files; it is designed to be embedded in a video container. Other than DVDs and Blu-Ray discs (which use a video container format I haven’t talked about yet), you can embed AC-3 audio streams in MKV, AVI, and — just standardized earlier this year — in MP4 files (discussion). Apple’s AppleTV set-top box is the only hardware device I know of that supports AC-3 in MP4; you can encode AppleTV-compatible AC3-in-MP4 videos with HandBrake, or manually insert AC-3 audio into existing MP4 files with this Windows-only fork of mp4creator.

Digital Theater System

Digital Theater System…a.k.a. “DTS.” As you might guess from the name, DTS is designed for real-life movie theaters. Like WMA, “DTS” is a brand name for a family of different audio formats. The “core” DTS format supports up to six channels; later extensions like DTS-HD support up to eight channels. There is also DTS-HD Master Audio, a lossless variant by the same company. Core DTS is designed for high bitrates (up to 1536 kbps, which is virtually indistinguishable from being there in the first place). DTS-HD Master Audio bitrates can go even higher, although at some point even audiophiles will wonder why they should bother.

Core DTS was not originally part of the DVD specification, so early DVD players did not support it. Most recent DVD players support natively decoding core DTS audio or passing the audio stream through to an external speaker system which decodes it, but relatively few DVDs include a DTS stream due to size constraints. Core DTS is a mandatory part of the Blu-Ray specification, and many Blu-Ray discs include a DTS audio track — sometimes the exact same stream that was originally played in the movie theater. (DTS-HD Master Audio is an optional part of the Blu-Ray specification, but few Blu-Ray discs include it due to — you guessed it — size constraints.)

DTS is patent-encumbered; licensing is brokered by DTS, Inc.

And so forth and so on…
As with everything else in this series, this article barely scratches the surface. (Really!) If you like, you can read about other audio codecs: ATRAC, Musepack, MP2, RealAudio, AMR, ADPCM, and so forth and so on. Wikipedia has a comparison of common audio codecs, HydrogenAudio has lots of technical details, and wiki.multimedia.cx is always your friend too.

This article has been tagged

, , , , , , , , , , , , , , , , , , , , , , , , ,

Other articles I recommend

Mark Pilgrim – A Gentle Introduction to Video Encoding: Container Formats

You may think of video files as “AVI files” or “MP4 files.” In reality, “AVI” and “MP4″ are just container formats. Just like a ZIP file can contain any sort of file within it, video container formats only define how to store things within them, not what kinds of data are stored. (It’s a little more complicated than that, because not all video streams are compatible with all container formats, but never mind that for now.) A video file usually contains multiple tracks — a video track (without audio), one or more audio tracks (without video), one or more subtitle/caption tracks, and so forth. Tracks are usually interrelated; an audio track contains markers within it to help synchronize the audio with the video, and a subtitle track contains time codes marking when each phrase should be displayed. Individual tracks can have metadata, such as the aspect ratio of a video track, or the language of an audio or subtitle track. Containers can also have metadata, such as the title of the video itself, cover art for the video, episode numbers (for television shows), and so on.

Mark Pilgrim – A Gentle Introduction to Video Encoding: Lossy Video Codecs

The most important consideration in video encoding is choosing a video codec. A future article will talk about how to pick the one that’s right for you, but for now I just want to introduce the concept and describe the playing field. (This information is likely to go out of date quickly; future readers, be aware that this was written in December 2008.)

Mark Pilgrim – A Gentle Introduction to Video Encoding: Constraints

I had lunch with my father the other day, and I explained this series as well as I could to someone who didn’t start programming when he was 11. His immediate reaction was, “Why are there so many different formats? Why can’t everybody just agree on a single format? It is political, or technical, or both?” The short answer is, it’s both. The history of video in any medium — and especially since the explosion of amateur digital video — has been marred by a string of companies who wanted to use container formats and video codecs as tools to lock content producers and content consumers into their little fiefdoms. Own the format, own the future. And when I say “history” — well, it’s still going on.