## Calculating hierarchical animation transformation matrices

This is a short post with my thoughts on what’s the best way to design a transformation hierarchy for animation, and how the “ideal design” changes over time.

### Bottom-up lazy evaluation magic

For a long time, I was a big fan of backwards (bottom-up) lazy evaluation of transformation hierarchies. Essentially having an `XFormNode` class for each node in the hierarchy with a `get_matrix(long msec)` function which calculates the current node’s transformation matrix for the current time, then calls `parent->get_matrix(msec)` and returns the concatenated matrix.

``` Matrix4x4 XFormNode::get_matrix(long msec) const { Matrix4x4 xform; calc_matrix(&xform, msec); if(parent) { xform = parent->get_xform(msec) * xform; } return xform; } ```

Of course, such a scheme would be wasteful if these matrices where calculated every time `get_matrix` functions are called. For instance if a node is part of a hierarchy, then its `get_matrix` will be called when we need to draw the object corresponding to this node, and also every time the `get_matrix` of any one of its descendants is called, due to the recursive bottom-up evaluation of matrices outlined previously. If one considers the posibility of drawing an object multiple times for various multi-pass algorithms the problem gets worse, with the limiting worse case scenario being if we’re doing ray-tracing which would require these functions to be called at the very least once per ray cast.

It follows then, that such a design goes hand in hand with lazy evaulation and caching of calculated node matrices. The XFormNode class would hold the last requested time and corresponding matrix, and when `get_matrix` is called, if the requested time matches the last one, we just return the cached matrix instead of recalculating it.

``` const Matrix4x4 &XFormNode::get_matrix(long msec) const { if(msec == cached_msec) { return cached_matrix; } calc_matrix(&cached_matrix, msec); cached_msec = msec; if(parent) { cached_matrix = parent->get_xform(msec) * cached_matrix; } return cached_matrix; } ```

This worked nicely for a long time, and it worked like magic. At any point I could just ask for the transform at time X and would get the matrix automatically, including any effects of hierarchy, keyframe interpolations, etc. It’s all good… until suddenly processors stopped getting faster any more, moore’s law went belly up, and after the shock passed we all sooner or later realised that single-threaded graphics programs are a thing of the past.

### Multithreading pissed on my rug

In the brave new multithreaded world, lazy evaluation becomes a pain in the ass. The first knee-jerk reaction is to add a mutex in XFormNode, and lock the hell out of the cached matrices. And while that might be ok for an OpenGL program which won’t have more than a couple of threads working with the scene database at any point (since rendering itself can only be done safely from a single thread), it throws out of the window a lot of concurrency that can take place in a raytracer where at any point there could be 8 or more threads asking for the matrix of any arbitrary node.

A second way to deal with this issue is to have each thread keep its own copy of the matrix cache, keeping it in thread-specific memory. I’m shamed to admit I never got around to doing any actual performance comparisons on this, though I’ve used it for quite some time in my programs. In theory it avoids having to wait for any other thread to access the cache, so it should be faster in theory, but it needs a `pthread_getspecific` call in every `get_matrix` invocation which comes with its own overhead.

``` const Matrix4x4 &XFormNode::get_matrix(long msec) const { MatrixCache *cache = pthread_getspecific(cache_key); // cache_key created in the constructor for each node if(!cache) { // first time we need a cache for this thread we'll have to create it cache = new MatrixCache; pthread_mutex_lock(&cache_list_lock); cache_list.push_back(cache); pthread_mutex_unlock(&cache_list_lock); pthread_setspecific(cache_key, cache); } if(msec == cache->msec) { return cache->matrix; } calc_matrix(&cache->matrix, msec); cache->msec = msec; if(parent) { cache->matrix = parent->get_xform(msec) * cache->matrix; } return cache->matrix; } ```

This works fine, and although we managed to avoid blocking concurrent use of `get_matrix`, we had to add some amount of overhead for thread-specific storage calls, and the code became much more complex all over the place: invalidations must also access this thread-specific storage, we need cleanup for the per-thread MatrixCache objects, etc.

### Return of the top-down evaluation

So nowadays I’m starting to lean more towards the simpler, less automagic design of top-down evaluation. It boils down to just going through the hierarchy once to calculate all the matrices recursively, then at any point we can just grab the previously calculated matrix of any node and use it.

``` void XFormNode::eval(long msec) { calc_matrix(&matrix, msec); if(parent) { matrix = parent->matrix * matrix; } for(size_t i=0; i<children.size(); i++) { children[i]->eval(msec); } } ```

The simplicity of this two-pass approach is hard to overlook, however it’s just not as good for some things as my original ideal method. It works fine for OpenGL programs where it suffices to calculate transformations once per frame, it even works fine for simple raytracers where we have again a single time value for any given frame. However it breaks down for ray-tracers doing distribution ray tracing for motion blur.

### No rest for the wicked

The best way to add motion blur to a ray tracer is through a monte-carlo method invented by Cook, Porter and Carpenter, called “distribution ray tracing“. In short, when spawining primary rays we have to choose a random time in the interval centered around the frame time and extending to the past and future, as far as dictated by the shutter speed of the camera. This time is then used both to calculate the position and direction of the ray, which thus might differ between sub-pixels if the camera is moving, and to calculate the positions of the objects we’re testing for intersections against. Then if we cast many rays per pixel and average the results, we’ll get motion blur on anything that moves significantly while the virtual shutter is open (example from my old s-ray renderer).

It’s obvious that calculating matrices once per frame won’t cut it with advanced ray-tracers, so there’s no getting rid of the complexity of the lazy bottom-up scheme in that case. Admittedly, however, caching won’t do much for us either because every sub-pixel will request the matrix at a different time anyway, so we might as well just calculate matrices from scratch all the time, and skip the thread-specific access overhead. The jury is still out on that one.

Do you have a favourite design for hierarchical animation code? Feel free to share it by leaving a comment below!

Color grading is an easily overlooked, but extremely powerful way to add character to a game. Subtle color changes make day-night cycling much more atmospheric. Different areas can have their own signature “feel” based on how saturated the colors are. Dark games can shift the unlit areas of an environment to cool bluish tint that can remain visible but still feel like darkness. The possibilities are endless.

I haven’t given much thought to color grading before, until a friend (Samurai), told me of an extremely simple and powerful way to add color grading to a game. So simple in fact, that I had to try it as soon as possible!

The idea has two parts. First the obvious bit: Use a 3D texture as a look-up table, to map the RGB colors produced by the renderer to a different set of RGB colors which is the color-graded output. That translates to pretty much the following GLSL post-processing fragment shader:

``` uniform sampler2D framebuf; uniform sampler3D gradelut; void main() { vec3 col = texture2D(framebuf, gl_TexCoord[0].st).xyz; gl_FragColor = vec4(texture3D(lut, col).xyz, 1.0); } ```

And now the brilliant bit: write a bit of code to save a screenshot of the game with the “identity” 3D lookup-table serialized in the last few scanlines. Give that screenshot to an artist, and let him work his magic, color-grading it in photoshop or whatever… Did you get that? In the process of color-grading that screenshot, the artist automatically produces the look-up table which can be used to reproduce the same color-grading in-game, as part of the last few scanlines of the image! Feed that palette back into the game and it’s automatically color-graded!

So I used the dungeon crawler I’ve been writing recently to try out this algorithm. The output of my dungeon crawler as it stands, is not the best material to try color-grading on as it’s already very dark and highly tinted, leaving too little space for tweaking without banding everything to oblivion, but nevertheless I wrote the code, gave the screenshot to my friend Rawnoise to play with it in photoshop for a couple of minutes, fed it back into the game, and the result can be seen below.

Lower-left part of the screenshot produced by the game, with the identity palette attached:

Screenshot of the game before and after color grading:

Of course you can always opt for a completely bizarre effect just as easily. This is the result of me moving color curves in gimp randomly, and then feeding the resulting palette into the game:

Obviously this algorithm opens up all sorts of interesting possibilities, such as having two palettes and interpolating between them during sunset, or when a player crosses the boundary between two areas, etc. Simple, yet effective.

## VSync-driven shutter glasses

All my previous stereoscopic attempts are fun and cool, but what I really wanted was to manage to connect my cheap-o shutter glasses to my computer, and use them for stereoscopic rendering. The main barrier is that consumer nvidia cards do not include a stereo port (unlike expensive quadros), and their drivers don’t support stereoscopic OpenGL visuals.

I had already side-stepped the second problem by writing stereowrap, an LD_PRELOAD-based tool that fakes OpenGL stereo contexts for GLX apps and presents the stereo pair in a number of ways, such as various anaglyphs, side-by-side, etc.

So at some point I decided to attack the first problem. Turns out there’s a simple way to drive shutter glasses. It’s a brilliant idea, and I didn’t come up with it, but it boils down to making a simple circuit that toggles the shutter glasses when it detects a pulse on the montior vsync wire!

I immediately designed a circuit based on this design, but modified to work with the signals expected by my ASUS VR-100 shutter glasses. Then I wired it up on a perfboard, and it worked like a charm! Finally I added a sequential stereo presentation method to stereowrap, synchronized with vsync, and suddenly I can view all my stereoscopic programs in awesome full-color stereo glory.

The downside to this simple contraption is that it doesn’t really know whether the left or the right image is presented at any given time, it only knows when to switch between them. That’s why the switch is included in the circuit: if the image appears wrong, and you can really tell by your brain attempting to blow up while looking at it, the switch can be used to flip the glasses around instantly. If however the application can’t catch up with the refresh rate of the monitor and misses a vsync interval the images will flip again.

I plan to build a more intelligent, microcontroller-based, driver circuit at some point. But for now, the simple vsync driver works well enough.

## OpenGL video editing hack

Hello, just a quick hack report cause I really liked how useful it turned out to be.

It all started when I located a small program I wrote, for an interesting coursework assignment, back when I did my graphics MSc at the University of Hull. I wanted to upload a video capture to youtube to show 4rknova who’s going through the same MSc course right now.

So I did capture the video, and saved it as an image sequence for further editing, because I wanted to add titles at the bottom describing what was demonstrated at each part of the video (the program was basically a sequence of arbitrary shader effects).

But how was I supposed to add the captions? The thought of wrestling with one of those fucking GUI video editing programs made me cringe. They are all slughish, heavy and unweildy, and I always have to fight for a few hours to do even simple things. I was more inclined to use ffmpeg from the command line, but then adding transitions to the captions, like having them fade and slide in from below would be a complete pain in the ass. Btw take a look at the final video to understand what I was trying to achieve with the caption transitions. Should be really simple right?

But then it dawned … hey, I could easily write that transition in a couple lines of C/OpenGL code instead of fighting with all those video editing programs! I started by writing a simple program that iterates over an input image sequence, opening each one in turn and then feeding them one by one to a dlopened plugin which could do whatever it likes with the frame. When that plugin processing function returns, I just dump the image back to the disk. It’s that simple!

Then the plugin was almost trivial as well. I just drop the frame into the OpenGL framebuffer and use my new text rendering library to draw the captions at the appropriate times with the appropriate alpha and position. The timing was derived by a simple event script file containing the frame numbers where each part starts and ends. I fed the script into my new event sequencing (demosystem) library which gives me back a nice linearly increasing [0, 1] value for each event (part) during the time when it is active according to the script. That makes it a piece of cake to fiddle with trig and some factors here and there to transition my captions just the way I wanted.

Here’s the code in case you want to play around with it:

I’m impressed, it’s fucking awesome and powerful to edit videos like that, and I’m definitely going to use it again.

## Stereoscopic fun on iOS

[Edit: this app is now available on the appstore, and has a dedicated web page]

The fun never stops with stereoscopic rendering. I posted previously about my earlier attempts with anaglyphs and shutter glasses, and all that was really fun, but not without drawbacks. Shutter glasses are awesome, but the only computer I have with a stereo output is an old SGI workstation, which isn’t up to the task to render modern 3D graphics, and doesn’t event give me stereo OpenGL visuals and a depth buffer at the same time. Anaglyph glasses are cheap and work everywhere, but they mess up the colors and they have a serious problem with ghosting, ruining the stereoscopic effect.

So, it was with great enthousiasm that I learned there’s a cheap and simple stereoscopic viewing contraption for the iphone produced by hasbro. It’s sortof like a viewmaster, only instead of cardboard reel with stereoscopic pictures, it has a place to attach an iphone or ipod touch on the back of it, using it as the source of the stereoscopic image presented to the user. What needs to be done iphone-side is simple enough. Just make it display a stereo pair side by side in a split-screen. The only drawback of this approach, is that since the iphone display is split in half, the achievable aspect ratio is slightly less than 1 which has an impact on immersion, making the perception more like looking through a squarish window into the 3D world rather than being surrounded by it. Still very impressive for a 28 dollar plastic widget.

Buying this apparatus gave me the final push I needed to get onto iOS programming. I find Objective-C unspeakably ugly and the Apple APIs needlessly convoluted, which was why I kept pushing this back, but I really wanted to see my code in glorious stereoscopic … glory, so I bit the bullet and ported over the stereoscopic tunnel program I’ve written originally for the SGI when I bought the shutter glasses.

The result is awesome; full stereo 3d without color degradation on modern programmable graphics hardware. Unfortunately one has to use the crippled version of OpenGL that’s become so popular on mobile devices lately: OpenGL ES 2.0 (see webgl post for my rant on that issue), but it was easy enough to make a wrapper that brings back immediate mode and the matrix stack.

In case you’d like to play around with the code, here’s a tarball. Feel free to use it under the terms of the GPLv3. It includes an Xcode project that compiles it for the iphone and a makefile for normal systems. If you run the program on your iphone tap anywhere on the screen to go to the options GUI to enable stereo rendering or change between the simple and the normal-mapped tunnel (keys s and b on the PC version).

## WebGL hacks

That’s it, I finally found something fun in web development! I never thought I’d live to see the day when I would feel the motivation to learn javascript, but here we are. WebGL is fun, because you can do all the things you could with regular OpenGL, but now you can send URLs to all your friends to show off. I can’t say I liked javascript really, but I guess it’s passable as long as you can avoid the horrible conventions people have established for pretending to write object-oriented code with it.

So what I did, after experimenting to see how WebGL and javascript programming works, is a port of a GPU-raytracer for 4D quaternion Julia fractals, and a simple 360-panorama viewer. You can find those on my webgl hacks page I put up yesterday.

About WebGL itself now, I’m really disappointed they chose to base it on OpenGL ES 2.0, which is the bastard child of a slashed down OpenGL subset initially spec’ed for fixed point embedded devices, and Khronos’ OpenGL >= 3 d3d10-buttlicking madness. I understand why they chose that, because they intend to have WebGL easily implementable on mobile phones and tablets, but I’m still disappointed.

For those of you not well versed in the differences between the various OpenGL versions that suddenly crept up when Khronos group took control of OpenGL and apparently surrendered it over to inmates of the nearest insane asylum, I’ll give you a short overview of what sucks in OpenGL >= 3.x, OpenGL ES 2.0, and WebGL:

• No fixed function pipeline. Yeah I know shaders are awesome, I love them too. But it’s convenient to be able to put a goddamn texture on a quad without having to write a bloody shader for it. OpenGL is not just used for video games you know.
• No immediate mode (glBegin). Again, yes immediate mode is slow if you use it to draw multimillion vertex meshes, but having to make a vertex buffer for a quad representing a button in a GUI or a simple overlay, is insanity.
• No matrix stack. Obviously when I’m writing a full 3D engine, with hierarchical keyframe animation, I have to ignore the matrix stack and write my own quaternion/matrix code. But for everything else, the OpenGL matrix functions are unbelievably useful.

So anyway, while I was playing around with it these past few days, I had to bring back a little bit of sanity to WebGL. For that reason I wrote SaneGL, a small piece of code that implements immediate mode drawing, and the OpenGL matrix stack on top of WebGL. I bundled that along with a small matrix math library and some helper functions for WebGL programs in a project called webgl-tools, which you can find in my mercurial repository: http://nuclear.mutantstargoat.com/hg/webgl-tools.

Oh by the way, if you’re one of those misguided sods that keep using windows, and you try to run any webgl apps right now you will probably be disappointed. In an unprecedented inspiration of pure stupidity, both firefox4 and chrome chose to implement WebGL over Direct3D by default on windows, using a project called ANGLE. The reason for that, they say, is that most graphics card vendors provide buggy OpenGL implementations on windows, so apparently it makes sense to write an even more buggy OpenGL->D3D translator and use that.

Initially I thought that ANGLE fails to translate huge shaders such as the one on my fractal raytracer, but in fact it seems to fail on pretty much everything, complex or trivial. The only way for windows users to use WebGL at the moment until mozilla and google comes to their senses and make ANGLE a fallback for known buggy OpenGL implementations instead of the default choice, is to go and force the browsers to use OpenGL instead. On firefox you can do that by setting the about:config variable “webgl.prefer-native-gl” to true, while chrome requires the command-line argument: –use-gl=desktop.

On GNU/Linux, as long as you have an nvidia card everything should be peachy from the get-go. Other cards are apparently blacklisted by firefox, so you’ll have to set webgl.force-enable to true, and pray to Odin.

## Escaping glutMainLoop

Let’s say you’re writing a distinctly glut-like window-system abstraction library for OpenGL context creation, event handling, etc. For those not familliar with the way one uses OpenGL to draw graphics, what happens is you talk to the native window system (X11, Win32 API, etc) to create a window and process events, then you create an OpenGL context and you bind it to that window using again platform-specific calls (GLX, WGL, etc).

So let’s say you’re writing that code, but you decided your library will allow the user to keep control of the main loop, so you provide a funcion called something like `process_events` to run a single iteration of your event processing, so that the user may call it in a loop. How do you implement that on top of glut, which has a single glutMainLoop function that doesn’t ever return?

By the way, for those curious on why would you do that in the first place, the reason to write a glut backend for this library, would be as a catch-all fallback to be able to run on platforms for which no native backend is yet written.

On GNU/Linux systems generally we don’t have the original GLUT, but rather FreeGLUT, which is nice enough to provide a `glutMainLoopEvent` function which runs a single iteration of the event loop, so we just call that from process_events and we’re done. But I have actually written an X11/GLX backend for my library, so I don’t need GLUT there, I need it on other systems. So how to break the chains of `glutMainLoop` and return after each iteration of the event handling loop?

The solution is obvious, use setjmp/longjmp. In `process_events` we call setjmp which obviously returns 0 the first time around, in which case glutMainLoop is called. Now glut enters its infinite loop and waits for events from the window system. As soon as all pending events are processed, or if there are no events to be processed, it calls our idle callback, then when that returns it loops back to the top again, and again, and again.

Of course we set up an idle callback that doesn’t actually return. It calls the user’s idle callback if there’s one registered, and then calls longjmp which unwinds the stack until we end up back into `process_events` at which point setjmp returns non-zero and we return execution to the user.

One minor issue that needs to be addressed is that since we set an idle function, if the user didn’t set one with our library, we’re wasting cpu cycles busy-looping because GLUT will never block waiting for events when there’s an idle callback. This again is easily remedied. If the user didn’t register an idle function with us, we don’t actually set our idle function to GLUT a-priori, instead we wait for one of the other event callbacks to trigger, and at the end of those callbacks we set the idle callback, and remove it again when it gets called, before longjmping back to the user.

Here are a few snippets of the actual code demonstrating the above:
Read the rest of this entry »