Base RGB colorspace
Images in YCbCr is based on some base RGB colorspace. An RGB colorspace is specified by 3 primary (red/green/blue) and white point chromaticities and a transfer characteristic. The transfer characteristic is the "gamma" value. The curve is often close to a gamma curve, but not exactly. Computer displays generally use sRGB, which is nearly identical to the HD television colorspace, and visually very similar to the various SD television colorspaces. RGB colorspace specification can be done at various levels of pedantry.
For now, it's sufficient to ignore much of the pedantry and assume sRGB/HD/SD.
Color Matrixing
YCbCr values are RGB values multiplied by a matrix. The matrix converts the colors into values that roughly represent grey (Y), and Cb and Cr, which represent differences from grey along the blue and red axes. There are two common matrices in use, corresponding to HD television and SD television. There is also a third related matrix used for JPEG images.
For normal 8-bit video, Y runs from 16 (black) to 235 (white), and Cb/Cr from 16 to 240. That means that YCbCr values outside this range cannot be specified by RGB values between 0 and 255 (They are "outside the RGB gamut"). These are not important for playback but minorly important for video editing. JPEG images uses the full range from 0 to 255.
Subsampling
The chroma components are usually subsampled in the horizontal and vertical direction. The most common subsamplings are called 4:4:4 (no subsampling, like RGB), 4:2:2 (subsampling only in the horizontal direction), and 4:2:0 (most common, subsampling in both directions). However, other formats are by no means rare, such as 4:1:1 (4x horizontal) in NTSC DV, and "3:1:1" (3x horizontal) in Sony HDCAM.
If components are subsampled, the subsampled values are converted from the non-subsampled values either by taking every other value (co-sited subsampling), or interpolating between values (half-sited subsampling). The most common case is to have co-sited subsampling horizontally, and half-sited subsampling vertically. This is what is done in MPEG-2 video on DVDs and in broadcast applications. JPEG, MPEG-1, and Theora use half-sited subsampling in both directions when running in 4:2:0. In PAL DV (which is 4:2:0), chroma is cosited in both directions, but Cr and Cb are separated by one pixel vertically.
Generally, non-subsampled values are low pass filtered before subsampling, in order to decrease aliasing. Ideally, one uses matching filters to get the best results for a downsampling/upsampling round trip. Unfortunately, it's generally unknown what filter was originally used when downsampling, and recommended practice is only approximately specified by standards.
It is possible to create matching filters to upsample and then downsample losslessly, although the upsampled values has more aliasing than a more optimal upsampling filter.
Hard transitions, such as rendering of text, do not work well when subsampled. It's a good idea to render text to be overlaid onto video with some fuzziness or low-pass filtering.
Interlacing
Interlacing is vertical subsampling for all components. Each picture at 50/60 Hz is downsampled vertically: half of the pictures, you take lines 0,2,4, the other half, you take lines 1,3,5. The downsampled pictures are called fields. Temporally adjacent fields are often combined into frames, so you get a picture where every other line was measured at a different time. (Actually, cameras create fields directly, only reading half the lines off the sensors.)
Vertical chroma subsampling and interlacing interact: Lines in a chroma component come from alternating fields. One needs to upsample the chroma component vertically before deinterlacing.
Pictures are generally not low-pass filtered in the vertical direction before decimation process to create fields. This allows static or low-motion video to be reconstructed with no visual blurring in the vertical direction, but viewing a single field will often show vertical aliasing. This makes direct scaling interlaced video impossible (unless you are only scaling horizontally).
Deinterlacing can be done in several different ways, varying in complexity and quality. The ideal method will take a field and it's nearest neighbors, and for each pixel to be reconstructed, compares neighboring pixels in the same field to pixels in the neighboring fields, and if they are similar and/or indicate low motion, interpolate the pixel temporally. Otherwise, the pixel is interpolated spatially. Faster, lower quality methods can use 1 or no neighboring fields. Slower methods can use more complex motion estimation and reconstruction.
Memory Layout
Memory layout is either packed or planar. Planar formats have the Y plane as a 2-d array of bytes followed by the U plane, followed by the V plane. Sometimes the U and V are swapped.
Packed formats for 4:2:2 are usually UYVYUYVY... or YUYVYUYV... for each line. Packed 4:4:4 is usually xYUVxYUVxYUV..., the x is either arbitrary spacing or an alpha value. Packed 4:2:0 doesn't exist in practice.
The distance in memory from a pixel on one line to the corresponding pixel on the next line is called the row stride. Row strides are almost always multiples of 4, in order to keep the first pixel of each line aligned to 4. This means that most odd-sized video needs to be padded by a few bytes.
There are several de facto standardized formats, specified by a FOURCC, which is a 4-byte code that is usually four letters. Examples are I420 (planar 4:2:0), UYVY (packed 4:2:2) and AYUV (packed 4:4:4). There are complex unwritten rules for how to calculate the row stride for each standard format, as well as how to calculate the size of the chroma components. These rules are roughly based on what video card manufacturers and Direct Show use, which also ends up being used by the XV extension.
Various usage patterns are made easier by allowing non-standard row strides and specifying the address of each plane separately. GStreamer does not allow this currently, but this is a goal for 1.0.

