Wednesday night
Most of the attendees met informally and chatted over tapas.
Thursday morning
Introductions
First things first was everybody introducing themselves and their interests.
- Benjamin Otte
- swfdec, GStreamer hacking, trying to fix video
- Josep Torra
- With Fluendo, GStreamer
- Jan Schmidt
- GStreamer release manager, DVD/Blu-ray playback
- Edward Hervey
- GStreamer, Pitivi---wants to keep the solution simple
- David Schleef
- swfdec, liboil, orc, GStreamer hacking for 10 years
- Eric Anholt
- Make Linux graphics not suck (on Intel)
- Carl Worth
- Cairo graphics library---wants to see accelerated cairo
- Søren Sandmann Pedersen
- Pixman maintainer, formerly GTK+ hacking
- René Stadler
- Working for Nokia---Maemo, GStreamer
- Felipe Contreras
- Also with Nokia, (but working at lower levels)
- Wim Taymans
- GStreamer maintainer
- Chris Wilson (still in transit to the conference)
- Cairo hacker extraordinaire, performance measurement wizard
Background on projects and problem spaces
The next idea was to give a high-level overview of the various problems with graphics and video, and the state of the various projects intended to address them.
What's hard about GPU acceleration (Eric)
The graphics card supports writes at 1.5GB/second, but reads at only 20MB/second. With all existing applications/application stacks, there's at least *some* operation that requires a software fallback. So all of the current slowness is when we have to stop and read things back.
Possible solutions:
- Fix so that every single operation in accelerated (eliminate all
- fallbacks)
- Stop trying to do things on the GPU. Some applications have gotten
- a lot faster by just doing everything on the CPU and pushing the result.
Another thing to note here is that many of the operations we want to do in X are extremely tiny, (draw a few glyphs, for example). So the overhead in setting up the GPU, (which you won't notice if you're streaming thousands of triangles), becomes dominant.
An introduction to pixman (Søren)
Old "core" X graphics primitives were not suitable for modern applications. Keith Packard added the Render extension to X to provide what applications really want, (with an initial software implementation inside the X server). Render took off quite well initially due to finally providing high-quality font rendering, (sub-pixel, antialiased text rendered on the client side).
The cairo graphics library was implemented to make it easier to draw things with the Render extension, (which only had a single primitive for geometry---trapezoids). Cairo originally included a copy of the software implementation of Render from the X server. Meanwhile, the X server forked a bunch, (leading to more copies).
Søren merged all of the derived versions of this software-rendering code into a single library, which is pixman. Today, both cairo and the X server link against this stand-along pixman library.
Pixman has been through 4 major revisions:
- 0.10 Initial release
- 0.12 Add SSE2 acceleration
- 0.14 Addition of ARM fast paths
- 0.16 Addition of extended PDF blend modes, much better ARM
- support.
As of the 0.16 release, there was a big rewrite and re-organization of the pixman code base. It's now got a consistent coding style, and the code is much easier to approach, and to implement new fast paths. There's also a better tst suite.
The current focus of development is on performance. Recent work includes much better image scaling, faster filtering, and much better and more ARM acceleration.
Pixman has a reputation for being really slow. That used to be true, but currently pixman is actually competitive with anything out there, (for example, Google's skia).
One thing that is still slow is trapezoid rendering. The fix here is probably to get rid of trapezoids as input to pixman. One approach is to have cairo pass "spans" (a run-length encoded mask) after rasterizing, or cairo could pass a polygon (prior to rasterizing).
Other things to look at in the future include orc, color spaces, etc.
An introduction to cairo (Carl)
[More details on history of X, the Render extension, cairo and pixman. But not captured well here since I was talking so not taking notes.]
Here's a diagram illustrating cairo's rendering model, (accepting a source pattern, a mask pattern, and combining those to a destination surface).
http://cworth.org/~cworth/papers/cairo_ddc2005/html/cairo-003.html
Surfaces are something that you can render to, and are associated with cairo backends, (so we have xlib surfaces, image surfaces, PDF surfaces, Quartz surfaces, etc.)
Patterns are only read from (not written to) and also have various types. One type is a surface pattern which reads from any surface. Other pattern types are parametric, such as gradient patterns which are effectively just a small equation that can be evalutated for to generate a color (and alpha) for any coordinate.
It sounds clear that we're going to want some support for "video" in cairo. Is that just a video pattern? Or does cairo need some support to target a video surface?
[Here there was a side discussion about whether converting YUV to RGB, doing rendering there, and then converting back to YUV would be adequate. If we can do this with 10 or more bits per channel, or perhaps as a floating point surface, then is that suffciently lossless?]
A closer look at pixman interfaces (Benjamin/Søren)
Here's the current (internal) interface of pixman:
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-private.h
Note that there are functions for fetching a scanline and storing a scanline. And then functions for doing the "composite" operation that Carl described for cairo, (blend source and mask onto destination).
Søren points out that one thing that he'd like to change in pixman is to make it tile-based rather than scanline-based, (which would probably help with non-scanline-based formats like YUV).
Benjamin has proposed changes to pixman to add support for various YUV formats. Pointers to these can be found here:
http://lists.cairographics.org/archives/cairo/2009-September/018221.html
He has also proposed exposing existing support in cairo to create a cairo surface with any format supported by pixman.
Søren shows how fast paths are added to pixman by illustrating with the sse2 fast paths here:
http://cgit.freedesktop.org/~sandmann/pixman/tree/pixman/pixman-sse2.c?h=flags
Currently, table-lookup of fast paths is currently showing up on profiles. The plan is to add a simple cache to eliminate this overhead.
Thursday afternoon
Overview of GStreamer (Wim)
GStreamer started at a time when there were not many X applications for video, (xanim was state of the art), and there weren't APIs either (not Xc).
The fundamental object in GStreamer is an "element". Example elements could be a "file source" element, an "MPEG decode" element, and an "ALSA sink" element. Elements can be combined into a dataflow graph (a DAG). Elements have "pads" which are the potential connection points between them where data flows. At each connection there is a process of negotiating the capabilities ("caps"), such as what format or resolution should be used for data communication. The dataflow graph exists within a "pipeline" that has the graph and also manages clock synchronization, etc. "Buffer" objects are reference-counted pointers to a chunk of memory, (or really any data structure that can support reading and writing somehow---such as an opaque data type of some library). Buffers are readable always, but writable only when there is a single reference.
The pipeline can do things such as ensuring that separe audo and video sink elements remain syncrhonized. There are optimizations such as to allow a decode element to request a chunk of memory, (such as a shared-memory segment from the X server), so that the decoder can decode directly to the target buffer and avoid a memcpy.
Pipelines also provide some sideband information such as reporting errors, and can accept queries for things such as latency.
Elements can be run in separate threads, so care needs to be taken to ensure that any APIs manipulating objects shared by separate elements are thread-safe.
Adding cairo-based plugin to GStreamer: gst-plugins-cairo (Benjamin)
Code is here:
http://cgit.freedesktop.org/~company/gst-plugins-cairo/tree/gst-libs/gst/cairo
Originally conceived of when trying to deal with video inside the swfdec flash player. Cairo was convenient for rendering, but couldn't talk video. GStreamer had some limited rendering capabiities (such as for sub-title rendering) that only support a very limited number of formats.
So why not just teach GStreamer to pass around cairo surfaces inside GStreamer buffer objects. And that's what's in the gst-plugins-cairo code.
One problem with this is the threaded nature of GStreamer, (and the lack of good thread support in X). Current GStreamer applications avoid this problem by having only one element talk to X, (the final video sink). But if each internal element is a cairo surface, any one of these might trigger a fallback and need to, for exampl, read back from the X server. And this can cause problems between different threaded elements.
All of the cairo-based GStreamer elements are made backward- compatible. If they are connected to another element that cannot accept a cairo-backed buffer, then it will get the image data from cairo and pass it as a convention GStreamer buffer.
[Side discussion on how to share objects across different elements (such as a cairo_t or cairo_surface_t), and how this is done in gst-plugins-gl for example.]
Hardware Accelerated Video in GStreamer (Josep)
Modern GPUs have dedicated video decoding units, which we want to get at via GStreamer. Lots of APIs (XvMC, VAAPI, VDPAU, XvBA, etc.) with varying capabilities. VDPAU has some post-processing that VAAPI does not. But VAAPI includes encoding as well.
Some example code for compositing and rendering:
VAAPI:
vaAssociateSubpicture (vaDisplay, subpic_id, &vaSurface, 1, 0, 0,
subpic_x, subpic_y, image.width, image.height, 0);
vaSyncSurface(vaDisplay, vaContext, vaSurface);
vaPutSurface(vaDisplay, vaSurface, win, left, top, width, height, 0,
0, win_width, win_height, NULL, 0, deinterlace_flag);VDPAU:
VdpVideoMixerRender (mixer, background_surface,
background_source_rect, current_picture_structure, 0, NULL,
video_surface, 0, NULL, video_source_rect, output_surface,
destination_rect, destination_video_rect, 1, VdpLayer const
*subpicture_layer)
VdpPresentationQueueDisplay (pqueue, output_surface, clip_width,
clip_height, 0);Neither API has a direct way to get data into OpenGL. But with either API one can render to a pixmap and then use GLX_TEXTURE_FROM_PIXMAP.
Example code for rendering to a pixmap:
VAAPI:
vaAssociateSubpicture (vaDisplay, subpic_id, &vaSurface, 1, 0, 0,
subpic_x, subpic_y, image.width, image.height, 0);
vaSyncSurface(vaDisplay, vaContext, vaSurface);
vaPutSurface(vaDisplay, vaSurface, xpixmap, left, top, width,
height, 0, 0, win_width, win_height, NULL, 0, deinterlace_flag);VDPAU:
VdpPresentationQueueTargetCreateX11 (device, xpixmap, pqueue_target);
VdpPresentationQueueCreate (device, pqueue_target, pqueue);
VdpVideoMixerRender (mixer, background_surface,
background_source_rect, current_picture_structure, 0, NULL,
video_surface, 0, NULL, video_source_rect, output_surface,
destination_rect, destination_video_rect, 1, VdpLayer const
*subpicture_layer)
VdpPresentationQueueDisplay (pqueue, output_surface, clip_width,
clip_height, 0);From here, Josep showed his proposal for getting at these things from GStreamer. The proposal includes a libgstva.so that is an abstraction on top of VAAPI, VDPAU, XvBA etc. and is extensible by adding further backends. It adds a single new vapostprocessing element and a single vasink element to GStreamer (which use libgstva.so to get at whatever video aceleration is available). There are also new links from the existing vpdaumpegdec, fluvadec, ffmpegdec and glsink elements into libgstva.so.
See Josep's presentation for more details (including diagrams):
http://gstreamer.freedesktop.org/wiki/VideoHackfest?action=AttachFile&do=get&target=VAinGST.pdf
Wim asked whether one could mix and match, (use a VDPAU decode and then connect it to a VAAPI element), and Josep said no.
Benjamin pointed out that it ends up looking a lot like his gst-plugins-cairo approach, but passing around VAAPI/VDPAU surfaces rather than cairo surfaces.
[More debate and discussion that I didn't capture well in the notes.]
Introduction to gst-dsp (Felipe)
See source here:
http://github.com/felipec/gst-dsp
In an ARM system, you'll often have ARM CPU, DSP, and a display controller all reading/writing from a single block of system memory, (with cache flushing/management needed when switching between CPU and DSP).
[Apologies for not capturing more of this...]
Friday Morning
Introduction to Mesa (Eric)
Mesa is our implementation of OpenGL.
It takes care of input, state tracking, pushing things out to drivers. Drivers sit on top of the DRM kernel modules. DRM module manages the command execution ring. A driver takes a chunk of state from Mesa, fill in a bunch of GPU structures in video memory, and then inserting commands into the ring.
Mesa is a reasonable implementation of GL, but there's definitely room for impreovement---particularly in CPU usage. That said, some applications are very well accelerated by Mesa.
Mesa is a very large code base, with a relatively small number of core contributors. So portions of the code base definitely don't get the love they really need.
Mesa is at OpenGL 2.1 now and needs to get to 3.x soon.
The OpenGL specification allows for extensions. Sometimes these come for the OpenGL ARB (the board owning the specification). Sometimes hardware vendors get together and make an exctension when the ARB isn't moving fast enough. And individuals can just write extensions as well. Within Mesa adding a new extension consists of adding a chunk of XML and then filling in the relevant function bodies.
So Mesa is easy to change---we can really do whatever we need/want to do.
Brief introduction to Kernel Memory Management (Eric)
Benjamin asks: Some people have been saying that it would make sense to pass around DRI handles in GStreamer, (as opposed to cairo surfaces or whatever else).
Eric:
Kernel memory manager manages graphics objects. You get one by calling an ioctl. Then you can map it into your address space, put content into one, etc. All of our drivers sit on top of this interface and talk about these DRI handles.
So, for example, a Pixmap in the X server has one of these handles. And these DRI names allow different pieces of the system, (X, OpenGL, whatever), to communicate about shared objects by simply passing DRI names around.
One question is whether it makes sense to use the X server as the thing that names our objects. For example, the VAAPI stuff already gives you a Pixmap in the end. Is that enough to pass around, or do we need to extract the underlying DRI object from it?
Brief introduction to De-interlacing (Edward and David)
If we talk about a 60i Hz stream this is really 30 frames per second presented as 60 fields per second, where each field contains only half the scanlines of a frame. But the fields are offest in time.
So a naive combination of the two fields will be objectionable when things are moving. So more sophisticated deinterlacing will perform motion compensation to estimate the missing information. This might consist of examining an 8x8 region of the next and previous fields to estimate the missing content.
Introduction to gst-plugins-gl (David)
Separate module of plugins for GStreamer, (separate from GStreamer to avoid GStreamer base from depending on OpenGL). Original purpose is to support an OpenGL sink for GStreamer.
Meanwhile, there's a desire to decode video using OpenGL. And once you use OpenGL to decode, you don't want to pull the data out only to push it back to OpenGL for display.
The idea at the time (when gst-plugins-gl was first written) was to push framebuffer objects around in the GStreamer pipeline. And this was hard at the time. The main difficulty was pushing around a GL context throughout the pipeline. An initial approach was for each element to either ask a neighboring element for a context or to open one itself. So some random element in the pipeline would end up opening the display. This sort of worked, but it wouldn't allow for any non-GL elements in the pipeline.
Things have gotten better since.
And people are now using this plugin to tie directly into Clutter. So gst-plugins-gl fits in well with those kinds of frameworks.
Another difficulty encountered is due to the fact that the context can be used in multiple elements, can be used in callbacks, and is used in display. So you have to add a lock for different threads wanting to use the same context. And at the time, Mesa wouldn't be happy with different threads using the same context, even when everything was properly serialized. So what gst-plugins-gl does (which David didn't and doesn't recommend) is to proxy all OpenGL communication through a single thread associated with the context.
Eric: One thing I'd like is if two neighboring elements could determine whether they are running in the same thread, (so that they could be more efficient if so---such as not rebinding the context and triggering a full flush of all hardware state, etc.).
Friday Afternoon
Hacking ideas
We've got most of the introductions out of the way, so now it's time to start the actual hacking. Here's a list of tasks we'd like to work on over the next few days:
- Merge yuv support into pixman/cairo
- GLX Multi-thread extension (one context with multiple threads)
- gst-plugins-cairo merging
- gst-plugins-cairo buffer abstraction
- XRenderPutImage/XV to Pixmap
- VAAPI in GStreamer
- Interaction between backends
- Integrating Orc into pixman
- cairo_shader_t
- gst-plugins-cairo on the n900
Introduction to Orc (David)
Here's the code:
git://code.entropywave.com/git/orc.git
Here's an example of a program in Orc:
http://cgit.freedesktop.org/gstreamer/gst-plugins-bad/tree/ext/cog/cog.orc
So Orc does a couple of things:
- Parses the language above.
- Takes the intermediate representation of the language and can then
- output several different things (SSE, Arm/Neon, C, assembly, etc.)
What David usually does is write Orc code (like the above) and then run Orc to generate C code like this:
http://www.schleef.org/~ds/cogorc.c.txt
Each function in the Orc program becomes 3 functions in the C program:
- A disabled function that is pure C (doesn't link against orc
- library)
- A backup function used when Orc can't compile a program for one
- reason or another.
- A function that calls into the Orc library to create, compile,
- and execute a program at run-time.
Here's an example of the generated SSE source code:
http://www.schleef.org/~ds/cogorc.s
So you could use this file instead of the previous C code.
So options you can get are C code that doesn't use Orc, C code that does use Orc, or SSE code that doesn't use Orc. Code that using Orc is the most interesting in some sense, because it will benefit from future imrpovements in Orc, and also opens up the opportunity for run-time optimizations.
Another nice feature of the compiler is that it generates test code. It gives you a C function that when called, generates random data, and calls both the C implementation and the SSE implementation of your function and verifies that both implementations always match.
Orc also has a backup for the case of a backup function not being present. That is, if the runtime code generator fails for some reason, (say you've got an older SSE that doesn't implement all Orc opcodes), and if there's no backup function---in this case Orc will emulate a backup function by stepping through the intermediate form and interpreting. Obviously, this is even slower than using the C-code backup function, but it does mean that Orc will always be able to do *something*.
The OrcExecutor structure is a sort of unified stack frame, which the user fills out in order to execute an Orc program and to pass it arguments. It turns out that this structure was poorly designed and will probably justify an ABI break in Orc to fix.

