Mark Pilgrim – A Gentle Introduction to Video Encoding: Constraints

by Simon. Average Reading Time: about 5 minutes.

This article was first published on 8th January 2009, on Mark Pilgrim’s website. That website no longer exists so this article serves as an historical record. I have preserved all emphasis and links as per the original article.

I had lunch with my father the other day, and I explained this series as well as I could to someone who didn’t start programming when he was 11. His immediate reaction was, “Why are there so many different formats? Why can’t everybody just agree on a single format? It is political, or technical, or both?” The short answer is, it’s both. The history of video in any medium — and especially since the explosion of amateur digital video — has been marred by a string of companies who wanted to use container formats and video codecs as tools to lock content producers and content consumers into their little fiefdoms. Own the format, own the future. And when I say “history” — well, it’s still going on. Tried to play a Windows Media Video on Mac OS X lately? The codec and container support is out there, but it’s not baked in. Want to watch movie trailers on Apple.com? Please install QuickTime. And so forth and so on. The only thing that was pre-installed on both platforms was Flash, so when a few startups dipped their toes into the Internet video waters, the ones that used Flash Video won despite it being an objectively inferior codec. (Some revision of Flash 9 added support for H.264 video, AAC audio, and the MP4 container, which is what YouTube HD uses.)

So that’s the politics. But there are also technical barriers. As with all engineering, video encoding is primarily about constraints. I can think of 10 just off the top of my head:

  1. CPU capacity for decoding and playing in real time. This is one of the most important constraints, since video is meant to be watched in real time. That sounds simple, but it’s incredibly complex. Every video you’ve ever watched in your entire life had to be decoded and played in real time. Otherwise it stutters and the viewing experience sucks. And we’re talking about video here; if the viewing experience sucks, there’s nothing left. Some codecs are just more complex than others, and that translates into higher system requirements to decode videos in real time. As I’ve mentioned before, some codecs are now decoded by specialized hardware. iPhones have a little chip inside them that understands H.264 Baseline Profile; without that, the iPhone would need a Core 2 Duo processor to play movies, and it would have a battery life of 10 minutes.
  2. Codec compatibility. Normal people won’t download codecs or plug-ins just to watch a dog on a skateboard, or even to watch a trailer for a $100 million blockbuster. (Sadly, they will download plug-ins for porn, but those are invariably trojan horses. Or so I’ve read. Moving on…) The phone in your pocket can probably play AMR ringtones, maybe MP3 ringtones, but probably not Vorbis ringtones (unless you have an Android phone) — and you probably couldn’t download new codecs even if you wanted to (which, I must reiterate, nobody wants to). Apple and Real Networks tried for years to corner the web video market, but 99% of schmucks with a browser have Flash, so Flash video won on the web. Meanwhile, Firefox 3.1 will ship with support for the <video> element but will only support Theora and Vorbis in an Ogg container — even if your underlying operating system ships with other codecs.
  3. CPU capacity for encoding. Encoding takes a long time. Taking my home movie from iMovie to a DVD used to take 8 hours on a Powerbook G4 laptop. These days you can rip a DVD movie with Xvid in 30 minutes, or you can rip it with a more complex codec with all optional features turned on, and maybe it’ll still take 8 hours. It’ll look better, but will it look 16 times better? If you’re only doing it once, maybe you don’t care. If you’re running YouTube and people are uploading 13 hours of video every minute, maybe you do. CPU cycles aren’t free; at that scale, they’re not even cheap. (That’s a real statistic, by the way; I got it from the page on the Google intranet entitled “What can we tell non-Googlers?” and it’s accurate as of September 2008.)
  4. Acceptable delay between recording and delivery. In my own experience, videos I’ve uploaded on YouTube are available within minutes, which is just mind-boggling when you consider the volume. If you’re re-encoding a live stream, even a few minutes delay is probably unacceptable. That means you’ll need a faster encoder, a less complex codec, or lower quality settings.
  5. Audience size. It’s not a big secret that lots of video on the Internet looks like crap. Partly that’s because the video uploader uploaded crappy video, but it’s also because most Internet videos are only watched by a few people, and it’s just not a worthwhile tradeoff to spend 8 hours re-encoding it. On the other hand, if you’re mastering a DVD that’ll get sold to 10 million people, you’ll probably use higher quality settings.
  6. Screen dimensions. DVDs can’t store high-def 1920 x 1080 video because the standard doesn’t allow for it, which makes perfect sense because it was designed around the screen resolution of standard-def TVs. Blu-Ray ups the limit, but there’s still a limit. Screen sizes vary more for PC video, but there will always be practical upper limits depending on your audience.
  7. My bandwidth. If you’re streaming or downloading video, some percentage of your audience is probably living in a third-world country like the United States, with limited broadband access, slow speeds, and monthly bandwidth caps. Larger file size = longer wait to play = fewer videos watched overall.
  8. Your bandwidth. Obviously every bit I download is a bit that you upload, and bandwidth ain’t free either. “When I get a little money I buy bandwidth; and if any is left I buy food and clothes.” Or something like that.
  9. Hard limits on storage size. As I mentioned before, physical media has upper limits on total size. Commercial DVDs can hold upwards of 9 GB, which seems like a lot but really isn’t. Blu-Ray maxes out at 50 GB, which seems like a lot but really isn’t.
  10. Patents / licensing costs. Did I mention that most popular video codecs are patent-encumbered? This is why Wikimedia uses Theora exclusively, and why Firefox can ship a native Theora decoder and but won’t ever ship H.264.

…and that’s the short list.

All of which leads me to the Zen of video encoding, which is this:

There is no right or wrong. There is only what works and what doesn’t.

If you can find even one combination of tools, delivery devices, and target platforms that satisfies your constraints and still accomplishes your goals, congratulations. You’re ahead of 99% of the people who’ve tried.

This article has been tagged

, , , , , , , , , , , , , , , , ,

Other articles I recommend

Mark Pilgrim – A Gentle Introduction to Video Encoding: Lossy Video Codecs

The most important consideration in video encoding is choosing a video codec. A future article will talk about how to pick the one that’s right for you, but for now I just want to introduce the concept and describe the playing field. (This information is likely to go out of date quickly; future readers, be aware that this was written in December 2008.)

Mark Pilgrim – A Gentle Introduction to Video Encoding: Container Formats

You may think of video files as “AVI files” or “MP4 files.” In reality, “AVI” and “MP4″ are just container formats. Just like a ZIP file can contain any sort of file within it, video container formats only define how to store things within them, not what kinds of data are stored. (It’s a little more complicated than that, because not all video streams are compatible with all container formats, but never mind that for now.) A video file usually contains multiple tracks — a video track (without audio), one or more audio tracks (without video), one or more subtitle/caption tracks, and so forth. Tracks are usually interrelated; an audio track contains markers within it to help synchronize the audio with the video, and a subtitle track contains time codes marking when each phrase should be displayed. Individual tracks can have metadata, such as the aspect ratio of a video track, or the language of an audio or subtitle track. Containers can also have metadata, such as the title of the video itself, cover art for the video, episode numbers (for television shows), and so on.

Mark Pilgrim – A Gentle Introduction to Video Encoding: Lossy Audio Codecs

Unless you’re going to stick to films made before 1927 or so, you’re going to want an audio track. A future article will talk about how to pick the audio codec that’s right for you, but for now I just want to introduce the concept and describe the playing field. (This information is likely to go out of date quickly; future readers, be aware that this was written in December 2008.)