Wednesday, 17 June 2009

Theora: it's not totally useless...

I've been taking a brief look at Theora, since it seems to be winning its battle to be one of the standard HTML5 video-tag codecs.

It's not entirely stupid: it's obviously been designed for PCs rather than by anyone involved in TV or video conferencing and it's in much the same state as WMV9 Main Profile (which is nearly VC-1 main profile, but not quite ..).

Theora turns out to be a fairly standard I- and P- only block structured codec. We have no B-frames, but we can predict either from the preceeding frame or the preceeding I-frame (Theora calls them INTRA or INTER rather than I and P). Theora is progressive-only and nominally fixed frame-rate only though I suspect variable frame rate by PTS will become quite common.

Theora's blocks are the right size (8x8). It has two block groupings - macroblocks of 2x2 blocks and superblocks of 4x4, with blocks arranged in a Hilbert curve rather than in raster order the way MPEG-2, H.264 and VC-1 do.

Raster order for Theora is bottom-to-top left-to-right, so (0,0) is bottom left rather than top left. It's unclear why this happened.

There's a fairly conventional three-plane colour structure, two supported colour spaces (NTSC-M and PAL), you can code in 4:2:0, 4:2:2 or 4:4:4, and chroma sits between luma samples in both X and Y - there's no variable luma positioning.

The decoded region is in whole macroblocks, but the visible frame can be any window on it so we can have arbitrary amounts of invisible picture. It appears that superblocks are the unit of coding but macroblocks the unit of motion compensation.

There's a fairly conventional MV/residual/deblock filter structure with only one, fairly simple in-loop deblock. You'll probably want an out-of-loop dering and deblock for low bitrate.

The transform is a quite particularly implemented DCT - the butterflies and cos approximation values are specified in the spec. It's effectively yet another explicit integer-only frequency transform and a quick read suggests that it's exact.

Motion vector derivation and motion compensation is pretty standard; we get motion vectors down to quarter-pel and the filter is a round-and-average beast rather than anything FIR-like.

The bitstream is run-length Huffman coded and packets are bit-counted, so we have to rely on out-of-ES framing to recover from synchronisation errors. There's obviously no emulation prevention. Presumably for coding efficiency reasons Theora groups bits by role rather than by macroblock, so we get all the coding markers, then all the MVs, then all the coefficients - a bit like some of the bitplane coding in VC-1.

This means we need more memory (and more memory I/O) than necessary, but it's not entirely fatal and at least we get the coefficients last.

Theora has almost entirely dynamic quant and coding tables, stored in the decoder initialisation headers, which may be quite big - the standard suggests 16kish. This means that for effective MPEG-TS/PS/PES use we're going to need some kind of out-of-ES SPS framing and effectively means that Theora has no ES. This is a right pain, both because it means that Theora ES streams don't exist and because it means we can't optimise quant tests in zigzag decode.

The zigzag table, oddly, is fixed, so we can optimise that. Go figure.

The Ogg framing format is quite odd - its plethora of structures is reminiscent of ASF - but should be dealable with. It's clearly where Matroska got its odd thread ideas from.

Next up: Dirac ..


  1. "motion vectors down to quarter-pel"

    I'm afraid not. The motion-vectors are half-pel (and arguably fake half-pel at that; only two points are ever averaged, never four). Even if this would be qpel in the chroma planes, it's effectively rounded to hpel there too.

    Macroblocks are the unit of motion compensation... unless the macroblock is coded in 4MV mode, in which case every luma block gets its own MV, and they're averaged to produce the chroma MV.

    Also, I'm very curious as to what you mean by "the zigzag table is fixed" and "we can't optimise quant tests in zigzag decode".

    Anyway, if you have any questions about Theora or the Ogg embedding, please join us in #theora on, or join one of the mailing lists. It's a very friendly place.

  2. Oh, I also don't know what you mean by VFR using PTS. The Ogg mapping for Theora does not allow for variable framerates, so presumably you have something else in mind.

    Also, in Theora, duplicated frames are extremely cheap (just a few bytes) so the recommended way of, say, switching between 24 and 30 fps, is to make a 120 fps stream with lots of duplicated frames. It only costs a few kbps, similar to the cost of adding timestamps to every frame. It's not true VFR, but it's pretty good.

    Of course, this requires players smart enough not to recopy their image buffers after a duplicate frame, and unfortunately most are still pretty dumb... but that logic is way easier than handling frames with arbitrary timestamps.

  3. Ben, the standard Ogg mapping is fixed frame rate. There's no reason someone couldn't define a new mapping that specifies a VFR. It would only require altering the granpos to encode a PTS instead of frame number (and probably an alternate way of hinting where keyframes are). It wouldn't be compatible with the standard mapping, but it would work just fine.