## Calculating hierarchical animation transformation matrices

This is a short post with my thoughts on what’s the best way to design a transformation hierarchy for animation, and how the “ideal design” changes over time.

### Bottom-up lazy evaluation magic

For a long time, I was a big fan of backwards (bottom-up) lazy evaluation of transformation hierarchies. Essentially having an XFormNode class for each node in the hierarchy with a get_matrix(long msec) function which calculates the current node’s transformation matrix for the current time, then calls parent->get_matrix(msec) and returns the concatenated matrix.

 

Matrix4x4 XFormNode::get_matrix(long msec) const
{
Matrix4x4 xform;
calc_matrix(&xform, msec);

if(parent) {
xform = parent->get_xform(msec) * xform;
}
return xform;
}

 

Of course, such a scheme would be wasteful if these matrices where calculated every time get_matrix functions are called. For instance if a node is part of a hierarchy, then its get_matrix will be called when we need to draw the object corresponding to this node, and also every time the get_matrix of any one of its descendants is called, due to the recursive bottom-up evaluation of matrices outlined previously. If one considers the posibility of drawing an object multiple times for various multi-pass algorithms the problem gets worse, with the limiting worse case scenario being if we’re doing ray-tracing which would require these functions to be called at the very least once per ray cast.

It follows then, that such a design goes hand in hand with lazy evaulation and caching of calculated node matrices. The XFormNode class would hold the last requested time and corresponding matrix, and when get_matrix is called, if the requested time matches the last one, we just return the cached matrix instead of recalculating it.

 

const Matrix4x4 &XFormNode::get_matrix(long msec) const
{
if(msec == cached_msec) {
return cached_matrix;
}

calc_matrix(&cached_matrix, msec);
cached_msec = msec;

if(parent) {
cached_matrix = parent->get_xform(msec) * cached_matrix;
}
return cached_matrix;
}

 

This worked nicely for a long time, and it worked like magic. At any point I could just ask for the transform at time X and would get the matrix automatically, including any effects of hierarchy, keyframe interpolations, etc. It’s all good… until suddenly processors stopped getting faster any more, moore’s law went belly up, and after the shock passed we all sooner or later realised that single-threaded graphics programs are a thing of the past.

### Multithreading pissed on my rug

In the brave new multithreaded world, lazy evaluation becomes a pain in the ass. The first knee-jerk reaction is to add a mutex in XFormNode, and lock the hell out of the cached matrices. And while that might be ok for an OpenGL program which won’t have more than a couple of threads working with the scene database at any point (since rendering itself can only be done safely from a single thread), it throws out of the window a lot of concurrency that can take place in a raytracer where at any point there could be 8 or more threads asking for the matrix of any arbitrary node.

A second way to deal with this issue is to have each thread keep its own copy of the matrix cache, keeping it in thread-specific memory. I’m shamed to admit I never got around to doing any actual performance comparisons on this, though I’ve used it for quite some time in my programs. In theory it avoids having to wait for any other thread to access the cache, so it should be faster in theory, but it needs a pthread_getspecific call in every get_matrix invocation which comes with its own overhead.

 

const Matrix4x4 &XFormNode::get_matrix(long msec) const
{
MatrixCache *cache = pthread_getspecific(cache_key); // cache_key created in the constructor for each node
if(!cache) {
// first time we need a cache for this thread we'll have to create it
cache = new MatrixCache;
cache_list.push_back(cache);
}

if(msec == cache->msec) {
return cache->matrix;
}

calc_matrix(&cache->matrix, msec);
cache->msec = msec;

if(parent) {
cache->matrix = parent->get_xform(msec) * cache->matrix;
}
return cache->matrix;
}

 

This works fine, and although we managed to avoid blocking concurrent use of get_matrix, we had to add some amount of overhead for thread-specific storage calls, and the code became much more complex all over the place: invalidations must also access this thread-specific storage, we need cleanup for the per-thread MatrixCache objects, etc.

### Return of the top-down evaluation

So nowadays I’m starting to lean more towards the simpler, less automagic design of top-down evaluation. It boils down to just going through the hierarchy once to calculate all the matrices recursively, then at any point we can just grab the previously calculated matrix of any node and use it.

 

void XFormNode::eval(long msec)
{
calc_matrix(&matrix, msec);

if(parent) {
matrix = parent->matrix * matrix;
}

for(size_t i=0; i<children.size(); i++) {
children[i]->eval(msec);
}
}

 

The simplicity of this two-pass approach is hard to overlook, however it’s just not as good for some things as my original ideal method. It works fine for OpenGL programs where it suffices to calculate transformations once per frame, it even works fine for simple raytracers where we have again a single time value for any given frame. However it breaks down for ray-tracers doing distribution ray tracing for motion blur.

### No rest for the wicked

The best way to add motion blur to a ray tracer is through a monte-carlo method invented by Cook, Porter and Carpenter, called “distribution ray tracing“. In short, when spawining primary rays we have to choose a random time in the interval centered around the frame time and extending to the past and future, as far as dictated by the shutter speed of the camera. This time is then used both to calculate the position and direction of the ray, which thus might differ between sub-pixels if the camera is moving, and to calculate the positions of the objects we’re testing for intersections against. Then if we cast many rays per pixel and average the results, we’ll get motion blur on anything that moves significantly while the virtual shutter is open (example from my old s-ray renderer).

It’s obvious that calculating matrices once per frame won’t cut it with advanced ray-tracers, so there’s no getting rid of the complexity of the lazy bottom-up scheme in that case. Admittedly, however, caching won’t do much for us either because every sub-pixel will request the matrix at a different time anyway, so we might as well just calculate matrices from scratch all the time, and skip the thread-specific access overhead. The jury is still out on that one.

Do you have a favourite design for hierarchical animation code? Feel free to share it by leaving a comment below!

## Generating multiple sample positions per pixel

Anti-aliasing in ray tracing, requires casting multiple rays per pixel, to sample the whole solid angle subtended by the imaginary surface of each pixel, if we consider it to be a small rectangular part of the view plane (see diagram).

It’s obvious that to be able to generate multiple primary rays for each pixel, we need to have an algorithm that given the sample number, calculates a sample position within the area of the a pixel. Since it’s trivial to map points in the unit square, onto the actual area of an arbitrary pixel, it makes sense to write this sample position generation function, so that it calculates points in the unit square.

The easiest way to write such a function would be to generate random points in the unit square, like this:
 

void get_sample_pos(float *pos)
{
pos[0] = (float)rand() / (float)RAND_MAX - 0.5;
pos[1] = (float)rand() / (float)RAND_MAX - 0.5;
}

 

The problem with this approach is that, even if our random number generator really has a perfectly uniform distribution, any finite number of sample positions generated, especially if that number is in the low tens, will probably not cover the area of the pixel in anything resembling a uniform sampling. Clusters are very likely to occur, leaving large areas of the pixel space unexplored by our rays.
As the number of samples gets larger and larger, this problem is somewhat mitigated, but especially if we’re not writting a path tracer, we’re usually dealing with anything between 4 to 20 rays per pixel, no more.

The following animation shows random sample positions generated by the code above. Even at about 40 samples, the left part of the pixel is inadequately sampled.

Another approach is to avoid randomness. The following function gets the sample number as input, and calculates its position by recursively subdividing the pixel area, taking care to spread the samples of each recursion level around as much as possible instead of focusing on one quadrant at a time.
 

void get_sample_pos(int sidx, float *pos)
{
pos[0] = pos[1] = 0.0f;
get_sample_pos_rec(sidx, 1.0f, 1.0f, pos);
}

static void get_sample_pos_rec(int sidx, float xsz, float ysz, float *pos)
{
static const float subpt[4][2] = {
{-0.25, -0.25}, {0.25, -0.25}, {-0.25, 0.25}, {0.25, 0.25}
};

/* base case: sample 0 is always in the middle, do nothing */
if(!sidx) return;

/* determine which quadrant to recurse into */
quadrant = ((sidx - 1) % 4);
pos[0] += subpt[quadrant][0] * xsz;
pos[1] += subpt[quadrant][1] * ysz;

get_sample_pos_rec((sidx - 1) / 4, xsz / 2, ysz / 2, pos);
}

 

And here’s the animation showing that code in action (colors denote the recursion depth):

This sampling is perfectly uniform, but it’s still not ideal. The problem is that whenever we’re sampling in a regular grid, no matter how fine that grid is, we will introduce aliasing. By breaking up each pixel into multiple subpixels like this we effectively increase the cutoff frequency after which aliasing occurs, but we do not eliminate it.

The best solution is to combine both techniques. We need randomness to convert aliasing into noise, which is much less perceptible by human brains trained by evolution to recognize patterns. But we also need uniform sampling to properly explore the whole area of each pixel.

So, we’ll employ a technique known as jittering: first we uniformly subdivide the pixel into subpixels, and then we randomly perturb the sample position of each subpixel inside the area of that subpixel. The following code implements this algorithm:
 

void get_sample_pos(int sidx, float *pos)
{
pos[0] = pos[1] = 0.0f;
get_sample_pos_rec(sidx, 1.0f, 1.0f, pos);
}

static void get_sample_pos_rec(int sidx, float xsz, float ysz, float *pos)
{
static const float subpt[4][2] = {
{-0.25, -0.25}, {0.25, -0.25}, {-0.25, 0.25}, {0.25, 0.25}
};

if(!sidx) {
/* we're done, just add appropriate jitter */
pos[0] += (float)rand() / (float)RAND_MAX * xsz - xsz / 2.0;
pos[1] += (float)rand() / (float)RAND_MAX * ysz - ysz / 2.0;
return;
}

/* determine which quadrant to recurse into */
quadrant = ((sidx - 1) % 4);
pos[0] += subpt[quadrant][0] * xsz;
pos[1] += subpt[quadrant][1] * ysz;

get_sample_pos_rec((sidx - 1) / 4, xsz / 2, ysz / 2, pos);
}

 

And here’s the animation showing the jittered sample position generator in action:

## Color space linearity and gamma correction

Ok so it’s a well known fact among graphics practitioners, that pretty much every game does rendering incorrectly. Since performance, and not correctness is always the prime consideration in game graphics, usually we tend to turn a blind eye towards such considerations. However with todays ultra-high performance programmable shading processors, and hardware LUT support for gamma correction, excuses for why we continue doing it the wrong way, become progressively more and more lame. :)

The gist of the problem with traditional real-time rendering, is that we’re trying to do linear operations, in non-linear color spaces.

Let’s take lighting calculations for example, when light hits a plane with 60 degrees incidence angle from the normal vector of the plane, Lambert’s cosine law states that the intensity of the diffusely reflected light off the plane (radiant exitance), is exactly half of the intensity of the incident light (irradiance) from that light source. However the monitor, responsible for taking all those pixel values and sending them rushing into our retinas, does not play along with our assumptions. That half intensity grey light we expect from that surface, becomes much darker due to the exponential response curve of the electron gun.

Simply put, when half the voltage of the full input range is applied to the electron gun, much less than half the possible electrons hit the phosphor in the glass, making it emmit lower than half-intensity light to the user. That’s not a defect of the CRT monitors; all kinds of monitors, tv screens, projectors, or other display devices work the same way.

So how do we correct that? We need to use the inverse of the monitor response curve, to correct our output colors, before they are fed to the monitor, so that we can be sure that our linear color space where we do our calculations, does not get bent out of shape before it reaches our eyes. Since the monitor response curve is approximately a function of the form: $x^\gamma$ where $\gamma = 2.2$ usually, it mostly suffices to do the following calculation before we write the color value to the framebuffer: $x^\frac{1}{\gamma}$. Or in a pixel shader:
 

gl_FragColor.rgb = pow(color.rgb, vec3(1.0 / 2.2));
 

That’s not entirely correct, because if we are doing any blending, it happens after the pixel shader writes the color value, which means it would operate after this gamma correction, in a non-linear color space. It would be fine if this shader is a final post-processing shader which writes the whole framebuffer without any blending operations, but there is a better and more efficient way. If we just tell OpenGL that we want to output a gamma-corrected framebuffer, or more precisely a framebuffer in the sRGB color space, it can do this calculation using hardware lookup tables, after any blending takes place, which is efficient and correct. This fucntionality is exposed by the ARB_framebuffer_sRGB extension, and should be available on all modern graphics cards. To use it we need to request an sRGB-capable framebuffer during context creation (GLX_FRAMEBUFFER_SRGB_CAPABLE_ARB / WGL_FRAMEBUFFER_SRGB_CAPABLE_ARB), and enable it with glEnable(GL_FRAMEBUFFER_SRGB).

Now if we do just that, we’re probably going to see the following ghastly result:

The problem is that our textures are already gamma-corrected with a similar process, which makes them now completely washed out when we apply gamma correction in the end a second time. The solution is to make color values looked up from textures linear before using them, by raising them to the power of 2.2. This can either be done in the shader simply by: pow(texture2D(tex, tcoord).rgb, vec3(2.2)), or by using the GL_SRGB_EXT internal texture format instead of GL_RGB (EXT_texture_sRGB extension), to let OpenGL know that our textures aren’t linear and need conversion on lookups.

The result is correct rendering output, with all operations in a linear color space:

A final pitfall we may encounter is if we use intermediate render targets during rendering, with 8 bit per color channel resolution, we will observe noticable banding in the darker areas. That is because our 8bit/channel textures are now raised to a power and the result is again placed in an 8bit/channel render target, which obviously wastes color resolution and loses details, which cannot be replaced later on when we gamma correct the values again. Bottom-line is that we need higher precision intermedate render targets if we are going to work in a trully linear color space. The following screenshots show a dark area of the game when using a regular GL_RGBA intermediate render target (top), and when using a half-float GL_RGBA16F render target (bottom):

Color artifacts are clearly visible in the first image, around the dark unlit area.

## Simple color-grading for games

Color grading is an easily overlooked, but extremely powerful way to add character to a game. Subtle color changes make day-night cycling much more atmospheric. Different areas can have their own signature “feel” based on how saturated the colors are. Dark games can shift the unlit areas of an environment to cool bluish tint that can remain visible but still feel like darkness. The possibilities are endless.

I haven’t given much thought to color grading before, until a friend (Samurai), told me of an extremely simple and powerful way to add color grading to a game. So simple in fact, that I had to try it as soon as possible!

The idea has two parts. First the obvious bit: Use a 3D texture as a look-up table, to map the RGB colors produced by the renderer to a different set of RGB colors which is the color-graded output. That translates to pretty much the following GLSL post-processing fragment shader:

 uniform sampler2D framebuf; uniform sampler3D gradelut; void main() { vec3 col = texture2D(framebuf, gl_TexCoord[0].st).xyz; gl_FragColor = vec4(texture3D(lut, col).xyz, 1.0); } 

And now the brilliant bit: write a bit of code to save a screenshot of the game with the “identity” 3D lookup-table serialized in the last few scanlines. Give that screenshot to an artist, and let him work his magic, color-grading it in photoshop or whatever… Did you get that? In the process of color-grading that screenshot, the artist automatically produces the look-up table which can be used to reproduce the same color-grading in-game, as part of the last few scanlines of the image! Feed that palette back into the game and it’s automatically color-graded!

So I used the dungeon crawler I’ve been writing recently to try out this algorithm. The output of my dungeon crawler as it stands, is not the best material to try color-grading on as it’s already very dark and highly tinted, leaving too little space for tweaking without banding everything to oblivion, but nevertheless I wrote the code, gave the screenshot to my friend Rawnoise to play with it in photoshop for a couple of minutes, fed it back into the game, and the result can be seen below.

Lower-left part of the screenshot produced by the game, with the identity palette attached:

Screenshot of the game before and after color grading:

Of course you can always opt for a completely bizarre effect just as easily. This is the result of me moving color curves in gimp randomly, and then feeding the resulting palette into the game:

Obviously this algorithm opens up all sorts of interesting possibilities, such as having two palettes and interpolating between them during sunset, or when a player crosses the boundary between two areas, etc. Simple, yet effective.

## Dungeon crawler game prototype

I started writting a first person dungeon crawler game recently. Nothing ground-breaking, but I intend to fill up the void of the simplistic gameplay with over the top eye-candy. My main inspiration comes from eye-of-the-beholder-esque dungeon crawlers with awesome graphics (for their respective times) such as stonekeep and the legend of grimrock, without necessarily intending to stay true to the retro 90 degree grid-based movement.

Before starting, I wanted to try out a couple of things to see how they feel in practice, so I decided to make a prototype. The main thing I wanted to try out was a suggestion of a friend of mine, for keeping the level creation as simple as possible, to use a regular grid tile-based system with a couple of enhancments. Namely:

• Allowing multiple detail tiles on a single grid cell. Which makes it easy to lay down the whole level’s corridors and then add details such as torches on the walls, furnitures or whatever here and there.
• Allowing arbitrary geometry for each tile, not necessarily contained in the volume of the grid cell it occupies. This would allow, for instance, elaborate prefab rooms to be attached at various places of the dungeon.

It turns out I don’t like the extended grid-based idea that much, and for the actual game I will revert to a more powerful level organization, I came up with some time ago. More on that when I actually implement it.

The rendering is done with “deferred shading“, a neat technique I implemented once before in the Theseis engine, which makes it possible to have hundreds of actual dynamic light sources active at the same time. This is the cornerstone of my “lots of eye-candy” idea, because it enables each and every torch, spell effect, flame, or magical glow to illuminate the dungeons and its denizens dynamically.

Finally I implemented a nice positional audio and music playback system, on top of OpenAL. It keeps static audio sources in a kd-tree for efficient selection of the nearest ones within a certain radius around the player and enables/disables the appropriate ones automatically.

In case you are curious to see it, I just uploaded a video on youtube. The actual tileset is obviously placeholder, since I made it myself in blender, to be replaced by proper artwork later on. The music and sound effects are made by George Savinidis, who will be in change of all the audio production for this game.

## PCB Etching

I’m going through an electronics phase at the moment, and I did a few circuits on stripboards (a kind of perfboard), which are ok but it’s always a pain in the ass to wire them up correctly. Btw here’s a relatively complex one I did a few months ago. So, I thought it would be awesome to create my own PCBs instead of using messy error-prone perfboards all the time, plus I always wanted to try the laser-printer method for homemade PCB creation back from when I didn’t actually own a laser printer.

I didn’t want to start with a huge complex circuit, so I decided to make a PCB version of my vsync shutter glasses driver.

First step was to draw a schematic Read the rest of this entry »

## VSync-driven shutter glasses

All my previous stereoscopic attempts are fun and cool, but what I really wanted was to manage to connect my cheap-o shutter glasses to my computer, and use them for stereoscopic rendering. The main barrier is that consumer nvidia cards do not include a stereo port (unlike expensive quadros), and their drivers don’t support stereoscopic OpenGL visuals.

I had already side-stepped the second problem by writing stereowrap, an LD_PRELOAD-based tool that fakes OpenGL stereo contexts for GLX apps and presents the stereo pair in a number of ways, such as various anaglyphs, side-by-side, etc.

So at some point I decided to attack the first problem. Turns out there’s a simple way to drive shutter glasses. It’s a brilliant idea, and I didn’t come up with it, but it boils down to making a simple circuit that toggles the shutter glasses when it detects a pulse on the montior vsync wire!

I immediately designed a circuit based on this design, but modified to work with the signals expected by my ASUS VR-100 shutter glasses. Then I wired it up on a perfboard, and it worked like a charm! Finally I added a sequential stereo presentation method to stereowrap, synchronized with vsync, and suddenly I can view all my stereoscopic programs in awesome full-color stereo glory.

The downside to this simple contraption is that it doesn’t really know whether the left or the right image is presented at any given time, it only knows when to switch between them. That’s why the switch is included in the circuit: if the image appears wrong, and you can really tell by your brain attempting to blow up while looking at it, the switch can be used to flip the glasses around instantly. If however the application can’t catch up with the refresh rate of the monitor and misses a vsync interval the images will flip again.

I plan to build a more intelligent, microcontroller-based, driver circuit at some point. But for now, the simple vsync driver works well enough.