Catching up: homebrew retro computer projects (Z80 and 68010)

Another catchup post, about my homebrew retro-computer projects. This started in 2016 when I made an extremely rudimentary, completely lacking in I/O capability, but extremely educational Z80 computer. I uploaded a video about it on youtube, while it was still on the breadboard stage, which demonstrated very clearly how the Z80 computer fetches and executes instructions by single-stepping the clock and inspecting the state of the data and address busses, as well as some important processor control signals at each step. Later I uploaded a short followup video to let it execute the program with a free-running clock. I also did a talk with the same demonstration at fosscomm 2016 with the final PCB version of this computer, but unfortunately there is no video of the event.

rudimentary Z80 8-bit computer

The next step was to make an improved homebrew computer, with proper I/O capabilities, which would be actually usable as a computer instead of just an educational demonstration. Initially I started designing an improved 8-bit computer based again on the Z80 processor, but soon I scrapped those plans and designed a 16-bit computer based on the Motorola 68000 processor instead.

The first video I uploaded about the 68k computer project was actually an attempt to familiarize myself with the processor by performing a similar single-stepping experiment again. This time I thought it would be fun to use a switches and LEDs panel similar to the old PDP or Altair interfaces to feed the processor with opcodes during the bus cycles. The video is somewhat long-winded, but I think if you skip the boring introduction, it’s also very educational, showing how to assemble a test program by hand, and how to coerce the 68k to single-step bus cycles while still keeping a continuously running clock, which is required for this processor to maintain its internal state.

I also uploaded a short followup progress report video shortly thereafter, with the computer constructed on a proper PCB, which shows the serial I/O interface and a bug in my initial design.

The computer works in this simple state, and I’m able to upload cross-compiled C programs through the serial port, and run them. It was a lot of fun reaching this stage, but then I got too lazy and didn’t even upload a proper demonstration video for that, or continue improving it for some time now.

homebrew 16bit motorola 68010 computer.

To make it more interesting, I made a test program for my 68k computer which calculates a koch snowflake fractal, and sends graphics commands to the serial terminal to draw it. Here’s a screenshot of xterm, which supports the vt330/vt340 ReGIS graphics commands.

My 68010 computer drew its first fractal

Edit: video of the computer running two test programs.

Catching up: SEGA Mega Drive (Genesis) PAL/NTSC mod

Continuing with the 2017 catchup, another retro venture was acquiring a SEGA Mega Drive for the first time to hack, play some retro games, and watch mega drive demos. None of my software hacks are interesting enough to post at this point, but I did like a hardware mod I did, to add a switch to the back of the unit, that flips the megadrive between PAL (europe) and NTSC (US) modes. This was necessary, because some games and demos only work in one of the two modes, while some NTSC games work on PAL consoles like mine, but run way too slow.

If you want to see the details, head over to my regular website, where I uploaded a complete writeup of the PAL/NTSC hack.

SEGA MegaDrive PAL/NTSC hack

Catching up: Amiga hardware hacks of 2017

So since I’m not really posting anything on this blog regularly. I decided to do a few catching up posts from time to time, with any cool stuff I did since last I posted here. First up, my amiga hardware hacks.

I bought an Amiga 500 in 2017, something that I meant to do for a long time now, to play with it, hack it, and see what makes it tick. So I ended up doing a couple of hardware hacks for it, to improve what I thought needed improving to make it more enjoyable to use and hack on.

First thing I did was make a “trapdoor” RAM expansion using 30pin SIMMs. 512k RAM expansions (to bring the total RAM to 1MB) was so ubiquitous back in the A500’s hayday that not having one would be crippling.

RAM expansion board

Next up, I wanted to set up a convenient hacking environment on the amiga itself, which is extremely annoying with my single floppy drive. Most people on the A500 used to use an external floppy for this purpose, but I decided to skip that and make an ATA hard disk interface for the amiga 500 instead.

A500 hard disk interface

Finally, it was really inconvenient having to move the whole amiga, with all the dangling cables behind it in front of my monitor to use it, which is necessary since the keyboard and the computer is one unit. Having to clear up space and manage the tangling mess of wires is not fun, so I decided to make a PS/2 keyboard controller for the A500, to allow me to connect an external standard PC keyboard, and leave the amiga tucked away on the corner of my desk out of my way.

PS/2 keyboard controller

How to use standard output streams for logging in android apps

android_dev

Android applications are strange beasts. Even though under the hood it’s basically a UNIX system using Linux as its kernel, there are other layers between (native) android apps and the bare system, that some things are bound to fall through the cracks.

When developing android apps, one generally doesn’t login to the actual device to compile and execute the program; instead a cross-compiler is provided by google, along with a series of tools to package, install, and execute the app. The degrees of separation from the actual controlling terminal of the application, make it harder to just print debugging messages to the stdout and stderr streams and watch the output like one could do while hacking a regular program. For this reason, the android NDK provides a set of logging functions, which append the messages to a global log buffer. The log buffer can be viewed and followed remotely, from the development machine, over USB, by using the “adb logcat” tool.

This is all well and good, but if you’re porting a program which prints a lot of messages to stdout/stderr, or if you just like the convenience of using the standard printf/cout/etc functions, instead of funny looking stuff like __android_log_print(), you might wonder if there is a way to just use the standard I/O streams instead, right? Well so did I, and guess what… this is still UNIX so the answer is of course you can!

Read the rest of this entry »

OculusVR SDK and simple oculus rift DK2 / OpenGL test program

Edit: I have since ported this test program to use LibOVR 0.4.4, and works fine on both GNU/Linux and Windows.

Edit2: Updated the code to work with LibOVR 0.5.0.1, but unfortunately they removed the handy function which I was using to disable the obnoxious “health and safety warning”. See the oculus developer guide on how to disable it system-wide instead.

I’ve been hacking with my new Oculus Rift DK2 on and off for the past couple of weeks now. I won’t go into how awesome it is, or how VR is going to change the world, and save the universe or whatever; everybody who cares, knows everything about it by now. I’ll just share my experiences so far in programming the damn thing, and post a very simple OpenGL test program I wrote last week to try it out, that might serve as a baseline.

OculusDK2 OpenGL test program

Read the rest of this entry »

Fixing old bugs, without the source

Once upon a time, I made a special kind of demoscene production: a wedtro. Which is a kind of small demo, made as a present to some other member of the demoscene, who is getting married. This wedtro, turned out to be the buggiest piece of shit I’ve ever released, and it’s been bugging me for the past decade. Until today I decided to fix it.

Read the rest of this entry »

Plugin reloading for 3dsmax exporters

I revisited recently a dormant project of mine, for which I unfortunately need to write a 3dsmax exporter plugin.

Now, I’m always pissed off from the start when I have to write code on windows and visual studio, but having to deal with 3dsmax on top of that, really just adds insult to injury. It’s not just that maxsdk is a convoluted mess. Or that it needs a very specific version of visual studio to write plugins for it (which is really Microsoft’s fault, to be fair). No, my biggest issue so far is that 3dsmax takes about 3 years to start up, and there is no way to unload, or reload a plugin without restarting it.

Whenever I fix a tiny thing in the exporter plugin I’m writting, and I want to try it out and see if it does the buissiness, I have to shut down 3dsmax, start it up again (which takes forever), load my test scene, then try to export again and see what happens. This is obviously unacceptable, so I really had to do something about it.

Read the rest of this entry »

Calculating hierarchical animation transformation matrices

This is a short post with my thoughts on what’s the best way to design a transformation hierarchy for animation, and how the “ideal design” changes over time.

Bottom-up lazy evaluation magic

For a long time, I was a big fan of backwards (bottom-up) lazy evaluation of transformation hierarchies. Essentially having an XFormNode class for each node in the hierarchy with a get_matrix(long msec) function which calculates the current node’s transformation matrix for the current time, then calls parent->get_matrix(msec) and returns the concatenated matrix.

Matrix4x4 XFormNode::get_matrix(long msec) const
{
    Matrix4x4 xform;
    calc_matrix(&xform, msec);

    if(parent) {
        xform = parent->get_xform(msec) * xform;
    }
    return xform;
}

Of course, such a scheme would be wasteful if these matrices where calculated every time get_matrix functions are called. For instance if a node is part of a hierarchy, then its get_matrix will be called when we need to draw the object corresponding to this node, and also every time the get_matrix of any one of its descendants is called, due to the recursive bottom-up evaluation of matrices outlined previously. If one considers the posibility of drawing an object multiple times for various multi-pass algorithms the problem gets worse, with the limiting worse case scenario being if we’re doing ray-tracing which would require these functions to be called at the very least once per ray cast.

It follows then, that such a design goes hand in hand with lazy evaulation and caching of calculated node matrices. The XFormNode class would hold the last requested time and corresponding matrix, and when get_matrix is called, if the requested time matches the last one, we just return the cached matrix instead of recalculating it.

const Matrix4x4 &XFormNode::get_matrix(long msec) const
{
    if(msec == cached_msec) {
        return cached_matrix;
    }

    calc_matrix(&cached_matrix, msec);
    cached_msec = msec;

    if(parent) {
        cached_matrix = parent->get_xform(msec) * cached_matrix;
    }
    return cached_matrix;
}

This worked nicely for a long time, and it worked like magic. At any point I could just ask for the transform at time X and would get the matrix automatically, including any effects of hierarchy, keyframe interpolations, etc. It’s all good… until suddenly processors stopped getting faster any more, moore’s law went belly up, and after the shock passed we all sooner or later realised that single-threaded graphics programs are a thing of the past.

Multithreading pissed on my rug

In the brave new multithreaded world, lazy evaluation becomes a pain in the ass. The first knee-jerk reaction is to add a mutex in XFormNode, and lock the hell out of the cached matrices. And while that might be ok for an OpenGL program which won’t have more than a couple of threads working with the scene database at any point (since rendering itself can only be done safely from a single thread), it throws out of the window a lot of concurrency that can take place in a raytracer where at any point there could be 8 or more threads asking for the matrix of any arbitrary node.

A second way to deal with this issue is to have each thread keep its own copy of the matrix cache, keeping it in thread-specific memory. I’m shamed to admit I never got around to doing any actual performance comparisons on this, though I’ve used it for quite some time in my programs. In theory it avoids having to wait for any other thread to access the cache, so it should be faster in theory, but it needs a pthread_getspecific call in every get_matrix invocation which comes with its own overhead.

const Matrix4x4 &XFormNode::get_matrix(long msec) const
{
    MatrixCache *cache = pthread_getspecific(cache_key); // cache_key created in the constructor for each node
    if(!cache) {
        // first time we need a cache for this thread we'll have to create it
        cache = new MatrixCache;
        pthread_mutex_lock(&cache_list_lock);
        cache_list.push_back(cache);
        pthread_mutex_unlock(&cache_list_lock);
        pthread_setspecific(cache_key, cache);
    }

    if(msec == cache->msec) {
        return cache->matrix;
    }

    calc_matrix(&cache->matrix, msec);
    cache->msec = msec;

    if(parent) {
        cache->matrix = parent->get_xform(msec) * cache->matrix;
    }
    return cache->matrix;
}

This works fine, and although we managed to avoid blocking concurrent use of get_matrix, we had to add some amount of overhead for thread-specific storage calls, and the code became much more complex all over the place: invalidations must also access this thread-specific storage, we need cleanup for the per-thread MatrixCache objects, etc.

Return of the top-down evaluation

So nowadays I’m starting to lean more towards the simpler, less automagic design of top-down evaluation. It boils down to just going through the hierarchy once to calculate all the matrices recursively, then at any point we can just grab the previously calculated matrix of any node and use it.

void XFormNode::eval(long msec)
{
    calc_matrix(&matrix, msec);

    if(parent) {
        matrix = parent->matrix * matrix;
    }

    for(size_t i=0; i<children.size(); i++) {
        children[i]->eval(msec);
    }
}

The simplicity of this two-pass approach is hard to overlook, however it’s just not as good for some things as my original ideal method. It works fine for OpenGL programs where it suffices to calculate transformations once per frame, it even works fine for simple raytracers where we have again a single time value for any given frame. However it breaks down for ray-tracers doing distribution ray tracing for motion blur.

No rest for the wicked

The best way to add motion blur to a ray tracer is through a monte-carlo method invented by Cook, Porter and Carpenter, called “distribution ray tracing“. In short, when spawining primary rays we have to choose a random time in the interval centered around the frame time and extending to the past and future, as far as dictated by the shutter speed of the camera. This time is then used both to calculate the position and direction of the ray, which thus might differ between sub-pixels if the camera is moving, and to calculate the positions of the objects we’re testing for intersections against. Then if we cast many rays per pixel and average the results, we’ll get motion blur on anything that moves significantly while the virtual shutter is open (example from my old s-ray renderer).

It’s obvious that calculating matrices once per frame won’t cut it with advanced ray-tracers, so there’s no getting rid of the complexity of the lazy bottom-up scheme in that case. Admittedly, however, caching won’t do much for us either because every sub-pixel will request the matrix at a different time anyway, so we might as well just calculate matrices from scratch all the time, and skip the thread-specific access overhead. The jury is still out on that one.

Do you have a favourite design for hierarchical animation code? Feel free to share it by leaving a comment below!

Simple color-grading for games

Color grading is an easily overlooked, but extremely powerful way to add character to a game. Subtle color changes make day-night cycling much more atmospheric. Different areas can have their own signature “feel” based on how saturated the colors are. Dark games can shift the unlit areas of an environment to cool bluish tint that can remain visible but still feel like darkness. The possibilities are endless.

I haven’t given much thought to color grading before, until a friend (Samurai), told me of an extremely simple and powerful way to add color grading to a game. So simple in fact, that I had to try it as soon as possible!

The idea has two parts. First the obvious bit: Use a 3D texture as a look-up table, to map the RGB colors produced by the renderer to a different set of RGB colors which is the color-graded output. That translates to pretty much the following GLSL post-processing fragment shader:

uniform sampler2D framebuf;
uniform sampler3D gradelut;

void main()
{
    vec3 col = texture2D(framebuf, gl_TexCoord[0].st).xyz;
    gl_FragColor = vec4(texture3D(lut, col).xyz, 1.0);
}

And now the brilliant bit: write a bit of code to save a screenshot of the game with the “identity” 3D lookup-table serialized in the last few scanlines. Give that screenshot to an artist, and let him work his magic, color-grading it in photoshop or whatever… Did you get that? In the process of color-grading that screenshot, the artist automatically produces the look-up table which can be used to reproduce the same color-grading in-game, as part of the last few scanlines of the image! Feed that palette back into the game and it’s automatically color-graded!

So I used the dungeon crawler I’ve been writing recently to try out this algorithm. The output of my dungeon crawler as it stands, is not the best material to try color-grading on as it’s already very dark and highly tinted, leaving too little space for tweaking without banding everything to oblivion, but nevertheless I wrote the code, gave the screenshot to my friend Rawnoise to play with it in photoshop for a couple of minutes, fed it back into the game, and the result can be seen below.

Lower-left part of the screenshot produced by the game, with the identity palette attached:
grade-palette

Screenshot of the game before and after color grading:

Original dungeon crawler output Dungeon crawler output after color grading test

Of course you can always opt for a completely bizarre effect just as easily. This is the result of me moving color curves in gimp randomly, and then feeding the resulting palette into the game:
bizarr-o-dungeon

Obviously this algorithm opens up all sorts of interesting possibilities, such as having two palettes and interpolating between them during sunset, or when a player crosses the boundary between two areas, etc. Simple, yet effective.

VSync-driven shutter glasses

All my previous stereoscopic attempts are fun and cool, but what I really wanted was to manage to connect my cheap-o shutter glasses to my computer, and use them for stereoscopic rendering. The main barrier is that consumer nvidia cards do not include a stereo port (unlike expensive quadros), and their drivers don’t support stereoscopic OpenGL visuals.

I had already side-stepped the second problem by writing stereowrap, an LD_PRELOAD-based tool that fakes OpenGL stereo contexts for GLX apps and presents the stereo pair in a number of ways, such as various anaglyphs, side-by-side, etc.

So at some point I decided to attack the first problem. Turns out there’s a simple way to drive shutter glasses. It’s a brilliant idea, and I didn’t come up with it, but it boils down to making a simple circuit that toggles the shutter glasses when it detects a pulse on the montior vsync wire!

Original vsync shutter glasses driver schematic.
vsync stereo driver prefboard prototype
I immediately designed a circuit based on this design, but modified to work with the signals expected by my ASUS VR-100 shutter glasses. Then I wired it up on a perfboard, and it worked like a charm! Finally I added a sequential stereo presentation method to stereowrap, synchronized with vsync, and suddenly I can view all my stereoscopic programs in awesome full-color stereo glory.

The downside to this simple contraption is that it doesn’t really know whether the left or the right image is presented at any given time, it only knows when to switch between them. That’s why the switch is included in the circuit: if the image appears wrong, and you can really tell by your brain attempting to blow up while looking at it, the switch can be used to flip the glasses around instantly. If however the application can’t catch up with the refresh rate of the monitor and misses a vsync interval the images will flip again.

I plan to build a more intelligent, microcontroller-based, driver circuit at some point. But for now, the simple vsync driver works well enough.