On compositing, APIs, performance, and simplicity

I’ve been thinking about the whole software structure involved in the sharing of video hardware among processes doing accelerated video operations (which I mean in a general sense, from video decoding to 3D). The development of compositing has certainly changed the paradigm of how this is done, and I believe it’s a great improvement. However, it came as a hack that the designers of what it was built on had never considered, much like AJAX, and as a result there are some serious design issues that need to be addressed before it will truly work well. The problem that jumps out to me is that the critical path of video content rendering between it’s origin in a userspace process and its display on the screen has to cross the CPU/GPU barrier 3 times, which is just silly. Allow me to illustrate:

  1. Program(CPU) gives the GPU some data and directions for how to render it into a buffer. (Critical path goes from CPU to GPU)
  2. GPU renders into the buffer and it sits around waiting for the compositor(CPU) to get a scheduling slot. (Critical path goes from GPU to CPU)
  3. Compositor(CPU) tells the GPU to take the buffer and render it as a part of another 3D scene. (Critical path goes from CPU to GPU)
  4. During output of the next frame, the compositor’s output buffer is put on the screen.

Obviously, the compositor process shouldn’t be in the critical path here. Consider if the path looked like this:

  1. Compositor(CPU) responds to some event (such as input), calculates a description (as a function of time) of how to assemble various buffers into a 3D scene, and sends it to the GPU. (Not part of critical path)
  2. Program(CPU) gives the GPU some data and directions for how to render it into a buffer. (Critical path goes from CPU to GPU)
  3. The GPU uses stored descriptions to automatically composite buffers in preparation for the next frame, thereby getting the program’s data to the screen.

This would not only increase performance of composited desktops, but also greatly simplify the problems involved in avoiding tearing while maintaining low latency from generation to display, etc, since the problem wouldn’t be spread across multiple domains anymore, but entirely in the control of the graphics driver writers. Furthermore, compositors wouldn’t use constant CPU time anymore, could be stacked (which has its uses, from a software point of view), and even esoteric things like direct graphics hardware access from inside virtual machines through I/O virtualization don’t require anything particularly unusual on this layer, which I think is one of the marks of a good solution. In order to implement this though, aside from the things that might need to be done at the hardware and driver levels (about which I know very little), one would need a different kind of graphics API for compositors to allow it to send these compositing descriptions to the GPU; simply extending an existing API wouldn’t cut it.

I don’t pretend to know just how difficult this would be, but I do think it’s the right solution.