Tutorial: Practical makefiles, by example

I know a lot of programmers, who are afraid of makefiles. Most of them come from a windows background, and when they migrate to GNU/Linux or MacOS X they naturally gravitate towards clunky IDEs to manage their build process. Others, fearing that makefiles are going to be too complex to manage manually, try to use even clunkier makefile (or project file) generators like cmake.

While none of the above solutions is without some merit in particular cases, I strongly believe that there’s no faster and simpler way to build your project, than writing a small makefile. And in order to dispel the fear, I decided to write a tutorial about how to write simple, practical makefiles, for your programs.

The article is too big and too … structured, for my idea of what a blog post should be, so I’m hosting it on my web site along with some of my previous articles instead.

So anyway here it is, let me know if you find it useful: Practical makefiles, by example.
Feel free to use the comments section in this post for any feedback, or send me an email if you prefer.

Btw, I used reStructuredText, and the docutils translators for this. So there’s also a PDF version, produced through the rst2latex translator. It’s not as good as it could have been if I wrote it directly in LaTeX, but I suppose it’s serviceable if you prefer a printable off-line version.

How to use standard output streams for logging in android apps

android_dev

Android applications are strange beasts. Even though under the hood it’s basically a UNIX system using Linux as its kernel, there are other layers between (native) android apps and the bare system, that some things are bound to fall through the cracks.

When developing android apps, one generally doesn’t login to the actual device to compile and execute the program; instead a cross-compiler is provided by google, along with a series of tools to package, install, and execute the app. The degrees of separation from the actual controlling terminal of the application, make it harder to just print debugging messages to the stdout and stderr streams and watch the output like one could do while hacking a regular program. For this reason, the android NDK provides a set of logging functions, which append the messages to a global log buffer. The log buffer can be viewed and followed remotely, from the development machine, over USB, by using the “adb logcat” tool.

This is all well and good, but if you’re porting a program which prints a lot of messages to stdout/stderr, or if you just like the convenience of using the standard printf/cout/etc functions, instead of funny looking stuff like __android_log_print(), you might wonder if there is a way to just use the standard I/O streams instead, right? Well so did I, and guess what… this is still UNIX so the answer is of course you can!

Read the rest of this entry »

OculusVR SDK and simple oculus rift DK2 / OpenGL test program

Edit: I have since ported this test program to use LibOVR 0.4.4, and works fine on both GNU/Linux and Windows.

I’ve been hacking with my new Oculus Rift DK2 on and off for the past couple of weeks now. I won’t go into how awesome it is, or how VR is going to change the world, and save the universe or whatever; everybody who cares, knows everything about it by now. I’ll just share my experiences so far in programming the damn thing, and post a very simple OpenGL test program I wrote last week to try it out, that might serve as a baseline.

OculusDK2 OpenGL test program

Read the rest of this entry »

Fixing old bugs, without the source

Once upon a time, I made a special kind of demoscene production: a wedtro. Which is a kind of small demo, made as a present to some other member of the demoscene, who is getting married. This wedtro, turned out to be the buggiest piece of shit I’ve ever released, and it’s been bugging me for the past decade. Until today I decided to fix it.

Read the rest of this entry »

Plugin reloading for 3dsmax exporters

I revisited recently a dormant project of mine, for which I unfortunately need to write a 3dsmax exporter plugin.

Now, I’m always pissed off from the start when I have to write code on windows and visual studio, but having to deal with 3dsmax on top of that, really just adds insult to injury. It’s not just that maxsdk is a convoluted mess. Or that it needs a very specific version of visual studio to write plugins for it (which is really Microsoft’s fault, to be fair). No, my biggest issue so far is that 3dsmax takes about 3 years to start up, and there is no way to unload, or reload a plugin without restarting it.

Whenever I fix a tiny thing in the exporter plugin I’m writting, and I want to try it out and see if it does the buissiness, I have to shut down 3dsmax, start it up again (which takes forever), load my test scene, then try to export again and see what happens. This is obviously unacceptable, so I really had to do something about it.

Read the rest of this entry »

Calculating hierarchical animation transformation matrices

This is a short post with my thoughts on what’s the best way to design a transformation hierarchy for animation, and how the “ideal design” changes over time.

Bottom-up lazy evaluation magic

For a long time, I was a big fan of backwards (bottom-up) lazy evaluation of transformation hierarchies. Essentially having an XFormNode class for each node in the hierarchy with a get_matrix(long msec) function which calculates the current node’s transformation matrix for the current time, then calls parent->get_matrix(msec) and returns the concatenated matrix.

Matrix4x4 XFormNode::get_matrix(long msec) const
{
    Matrix4x4 xform;
    calc_matrix(&xform, msec);

    if(parent) {
        xform = parent->get_xform(msec) * xform;
    }
    return xform;
}

Of course, such a scheme would be wasteful if these matrices where calculated every time get_matrix functions are called. For instance if a node is part of a hierarchy, then its get_matrix will be called when we need to draw the object corresponding to this node, and also every time the get_matrix of any one of its descendants is called, due to the recursive bottom-up evaluation of matrices outlined previously. If one considers the posibility of drawing an object multiple times for various multi-pass algorithms the problem gets worse, with the limiting worse case scenario being if we’re doing ray-tracing which would require these functions to be called at the very least once per ray cast.

It follows then, that such a design goes hand in hand with lazy evaulation and caching of calculated node matrices. The XFormNode class would hold the last requested time and corresponding matrix, and when get_matrix is called, if the requested time matches the last one, we just return the cached matrix instead of recalculating it.

const Matrix4x4 &XFormNode::get_matrix(long msec) const
{
    if(msec == cached_msec) {
        return cached_matrix;
    }

    calc_matrix(&cached_matrix, msec);
    cached_msec = msec;

    if(parent) {
        cached_matrix = parent->get_xform(msec) * cached_matrix;
    }
    return cached_matrix;
}

This worked nicely for a long time, and it worked like magic. At any point I could just ask for the transform at time X and would get the matrix automatically, including any effects of hierarchy, keyframe interpolations, etc. It’s all good… until suddenly processors stopped getting faster any more, moore’s law went belly up, and after the shock passed we all sooner or later realised that single-threaded graphics programs are a thing of the past.

Multithreading pissed on my rug

In the brave new multithreaded world, lazy evaluation becomes a pain in the ass. The first knee-jerk reaction is to add a mutex in XFormNode, and lock the hell out of the cached matrices. And while that might be ok for an OpenGL program which won’t have more than a couple of threads working with the scene database at any point (since rendering itself can only be done safely from a single thread), it throws out of the window a lot of concurrency that can take place in a raytracer where at any point there could be 8 or more threads asking for the matrix of any arbitrary node.

A second way to deal with this issue is to have each thread keep its own copy of the matrix cache, keeping it in thread-specific memory. I’m shamed to admit I never got around to doing any actual performance comparisons on this, though I’ve used it for quite some time in my programs. In theory it avoids having to wait for any other thread to access the cache, so it should be faster in theory, but it needs a pthread_getspecific call in every get_matrix invocation which comes with its own overhead.

const Matrix4x4 &XFormNode::get_matrix(long msec) const
{
    MatrixCache *cache = pthread_getspecific(cache_key); // cache_key created in the constructor for each node
    if(!cache) {
        // first time we need a cache for this thread we'll have to create it
        cache = new MatrixCache;
        pthread_mutex_lock(&cache_list_lock);
        cache_list.push_back(cache);
        pthread_mutex_unlock(&cache_list_lock);
        pthread_setspecific(cache_key, cache);
    }

    if(msec == cache->msec) {
        return cache->matrix;
    }

    calc_matrix(&cache->matrix, msec);
    cache->msec = msec;

    if(parent) {
        cache->matrix = parent->get_xform(msec) * cache->matrix;
    }
    return cache->matrix;
}

This works fine, and although we managed to avoid blocking concurrent use of get_matrix, we had to add some amount of overhead for thread-specific storage calls, and the code became much more complex all over the place: invalidations must also access this thread-specific storage, we need cleanup for the per-thread MatrixCache objects, etc.

Return of the top-down evaluation

So nowadays I’m starting to lean more towards the simpler, less automagic design of top-down evaluation. It boils down to just going through the hierarchy once to calculate all the matrices recursively, then at any point we can just grab the previously calculated matrix of any node and use it.

void XFormNode::eval(long msec)
{
    calc_matrix(&matrix, msec);

    if(parent) {
        matrix = parent->matrix * matrix;
    }

    for(size_t i=0; i<children.size(); i++) {
        children[i]->eval(msec);
    }
}

The simplicity of this two-pass approach is hard to overlook, however it’s just not as good for some things as my original ideal method. It works fine for OpenGL programs where it suffices to calculate transformations once per frame, it even works fine for simple raytracers where we have again a single time value for any given frame. However it breaks down for ray-tracers doing distribution ray tracing for motion blur.

No rest for the wicked

The best way to add motion blur to a ray tracer is through a monte-carlo method invented by Cook, Porter and Carpenter, called “distribution ray tracing“. In short, when spawining primary rays we have to choose a random time in the interval centered around the frame time and extending to the past and future, as far as dictated by the shutter speed of the camera. This time is then used both to calculate the position and direction of the ray, which thus might differ between sub-pixels if the camera is moving, and to calculate the positions of the objects we’re testing for intersections against. Then if we cast many rays per pixel and average the results, we’ll get motion blur on anything that moves significantly while the virtual shutter is open (example from my old s-ray renderer).

It’s obvious that calculating matrices once per frame won’t cut it with advanced ray-tracers, so there’s no getting rid of the complexity of the lazy bottom-up scheme in that case. Admittedly, however, caching won’t do much for us either because every sub-pixel will request the matrix at a different time anyway, so we might as well just calculate matrices from scratch all the time, and skip the thread-specific access overhead. The jury is still out on that one.

Do you have a favourite design for hierarchical animation code? Feel free to share it by leaving a comment below!

Generating multiple sample positions per pixel

Anti-aliasing in ray tracing, requires casting multiple rays per pixel, to sample the whole solid angle subtended by the imaginary surface of each pixel, if we consider it to be a small rectangular part of the view plane (see diagram).

framebuffer pixel area

It’s obvious that to be able to generate multiple primary rays for each pixel, we need to have an algorithm that given the sample number, calculates a sample position within the area of the a pixel. Since it’s trivial to map points in the unit square, onto the actual area of an arbitrary pixel, it makes sense to write this sample position generation function, so that it calculates points in the unit square.

The easiest way to write such a function would be to generate random points in the unit square, like this:

void get_sample_pos(float *pos)
{
    pos[0] = (float)rand() / (float)RAND_MAX - 0.5;
    pos[1] = (float)rand() / (float)RAND_MAX - 0.5;
}

The problem with this approach is that, even if our random number generator really has a perfectly uniform distribution, any finite number of sample positions generated, especially if that number is in the low tens, will probably not cover the area of the pixel in anything resembling a uniform sampling. Clusters are very likely to occur, leaving large areas of the pixel space unexplored by our rays.
As the number of samples gets larger and larger, this problem is somewhat mitigated, but especially if we’re not writting a path tracer, we’re usually dealing with anything between 4 to 20 rays per pixel, no more.

The following animation shows random sample positions generated by the code above. Even at about 40 samples, the left part of the pixel is inadequately sampled.
random sample generation

Another approach is to avoid randomness. The following function gets the sample number as input, and calculates its position by recursively subdividing the pixel area, taking care to spread the samples of each recursion level around as much as possible instead of focusing on one quadrant at a time.

void get_sample_pos(int sidx, float *pos)
{
    pos[0] = pos[1] = 0.0f;
    get_sample_pos_rec(sidx, 1.0f, 1.0f, pos);
}

static void get_sample_pos_rec(int sidx, float xsz, float ysz, float *pos)
{
    int quadrant;
    static const float subpt[4][2] = {
        {-0.25, -0.25}, {0.25, -0.25}, {-0.25, 0.25}, {0.25, 0.25}
    };

    /* base case: sample 0 is always in the middle, do nothing */
    if(!sidx) return;

    /* determine which quadrant to recurse into */
    quadrant = ((sidx - 1) % 4);
    pos[0] += subpt[quadrant][0] * xsz;
    pos[1] += subpt[quadrant][1] * ysz;

    get_sample_pos_rec((sidx - 1) / 4, xsz / 2, ysz / 2, pos);
}

And here’s the animation showing that code in action (colors denote the recursion depth):
regular subdivision sampling

This sampling is perfectly uniform, but it’s still not ideal. The problem is that whenever we’re sampling in a regular grid, no matter how fine that grid is, we will introduce aliasing. By breaking up each pixel into multiple subpixels like this we effectively increase the cutoff frequency after which aliasing occurs, but we do not eliminate it.

The best solution is to combine both techniques. We need randomness to convert aliasing into noise, which is much less perceptible by human brains trained by evolution to recognize patterns. But we also need uniform sampling to properly explore the whole area of each pixel.

So, we’ll employ a technique known as jittering: first we uniformly subdivide the pixel into subpixels, and then we randomly perturb the sample position of each subpixel inside the area of that subpixel. The following code implements this algorithm:

void get_sample_pos(int sidx, float *pos)
{
    pos[0] = pos[1] = 0.0f;
    get_sample_pos_rec(sidx, 1.0f, 1.0f, pos);
}

static void get_sample_pos_rec(int sidx, float xsz, float ysz, float *pos)
{
    int quadrant;
    static const float subpt[4][2] = {
        {-0.25, -0.25}, {0.25, -0.25}, {-0.25, 0.25}, {0.25, 0.25}
    };

    if(!sidx) {
        /* we're done, just add appropriate jitter */
        pos[0] += (float)rand() / (float)RAND_MAX * xsz - xsz / 2.0;
        pos[1] += (float)rand() / (float)RAND_MAX * ysz - ysz / 2.0;
        return;
    }

    /* determine which quadrant to recurse into */
    quadrant = ((sidx - 1) % 4);
    pos[0] += subpt[quadrant][0] * xsz;
    pos[1] += subpt[quadrant][1] * ysz;

    get_sample_pos_rec((sidx - 1) / 4, xsz / 2, ysz / 2, pos);
}

And here’s the animation showing the jittered sample position generator in action:
Jittered sample generator

Follow

Get every new post delivered to your Inbox.