there are so many cores

Just another site

Monthly Archives: January 2012

Heterogeneous computing is hard for compilers

I was overly optimistic with the PeakStream conjugate gradient sample code. It only works for some OpenCL compute devices and depends on the vendor SDK used. It works on a Core 2 Duo with the AMD SDK. It works on a Core i7 920 with the Intel SDK. It doesn’t work properly on any other combination I’ve tried. Specifically, it’s not working on the Radeon HD 5870 and GTX 480. I have a few more discrete GPUs to test with too.

This makes me suspicious of my code.

This last week was more cleanup. The build is much better. The configure script is quite a bit more robust. Logging in the virtual machine is done through an object (instead of print statements scattered everywhere) and can be added or removed as a preprocessor option. I need this to follow what’s going on.

Now that I’ve started regularly testing with the big three vendors (AMD, Intel, NVIDIA), I begin to suspect that GPGPU compilers cheat. I know I did. There are certainly happy paths through my JIT. It’s easy to write code that will cause the compiler to fail. It will generate something wrong.

Programming with OpenCL on a compute device with one vendor SDK gives a different impression than developing for half a dozen compute devices with three vendor SDKs. When you see a single failure, you just think it’s an isolated bug. When you see a pattern of failures, then there is something systematic going on.

It is difficult to write an OpenCL compiler that generates fast and correct code in all cases. Every vendor has the same problem. It’s hard for everyone.

Changing the subject…

Yesterday, I had a moment of inspiration. I saw the next evolutionary step for this project (besides making it work right).

The virtual machine and JIT must themselves be configurable and generated. This doesn’t necessarily mean metaprogramming. It may be something more like Buildroot. It certainly does not mean metacircularity.

This is necessary to support the market trend towards CPU/GPU hybrids (e.g. AMD Fusion APUs). GPUs are starting to become standard in low-end portable and mobile devices. Intel processors incorporate graphics now. Power and memory efficiency are important. That’s why customized virtual machines and JITs that strip out everything an application doesn’t need make sense (if nothing else, this reduces the memory footprint).

The functional guys would say this is nothing new. Everything is language, including the software platform itself. I’m not trying to go that far.

Why didn’t Java do this? At my first job out of grad school, there was a top secret project for one of the big telecoms companies on a Java phone. Remember, this was back in 1997! So the prototype was a mobile handset tethered to a workstation under a desk. People made jokes about this. (Ten years later, that workstation really did fit inside the handset!)

There was a lot of interest in embedded Java back then. Actually, that was where Java started, with Oak. It was originally intended to be an embedded platform.

The issue with Java, then as now, is that the footprint is large. Back in 1997, it was too fat to fit in a phone. So there were cut-down versions of Java available for embedded applications. They were so reduced from the full language that it made more sense just to write in C/C++ (which is probably what everyone did).

What they should have done is create a “Buildjava” that allowed configuring a runtime platform tailor made to a user’s embedded application. Maybe that was just too hard. Also, it works against the “write once, run anywhere” model. Everyone would speak their own dialect of Java.

So people just waited a few years until the hardware became strong enough to support their software. Still, ten years is a long time to wait.


Fixed the build for good out of box experience

I regularly copy source code tarballs off-site and to several external mass storage drives. In case of disaster, no more than a day’s work will be lost.

Yes… perhaps I should be pushing to GitHub instead of checkpointing tarballs. That would serve as a revision controlled off-site backup service. One thing though: even GitHub is not completely safe. I have another project hosted there which lost history from a big outage in 2010. Not complaining at all, only love for git and GitHub.

The real risk is development downtime. I do almost all work on a several years old X-series Thinkpad. There should be a backup if that laptop fails.

So I wiped an old Pentium M laptop, installed Ubuntu 10.04 LTS and set up a duplicate development environment on it. This was an excellent exercise. I found the build for Chai hardcoded x86_64 and lib64 in places. Even worse, it relied on having both AMD/ATI and NVIDIA OpenCL SDKs installed to work!

This is all fixed now. The build follows the standard “./configure; make” idiom. It’s also robust and should handle any combination of AMD/ATI and NVIDIA SDKs, always picking the latest release versions from both vendors (yet is very easy to manually change), and 32/64 bit processor architecture. So much nicer.

Even with good software, when an automatic configure/build breaks, I am annoyed. It’s a bad out of box experience. Fail.

I wanted to try Intel’s OpenCL which can auto-vectorize (Chai does too!). My experience is limited to AMD/ATI and NVIDIA. Unfortunately, even my x86_64 Thinkpad is too old with an unsupported CPU.

The Core i7 920 in my GPU compute server is supported. So I’m going to see what happens with Chai using a “shader compiler” designed for a modern (hyper-threading) multi-core CPU. I’ll see what happens tonight.

Another PeakStream clone in Sweden

Gpu Systems is a Swedish company with a managed platform for GPGPU. Here’s their conjugate gradient example. It looks a lot like PeakStream.

I’m not alone.

Parallel evolution – this is interesting. We both saw the same thing in the market.

Our design biases were somewhat different. Libra SDK chose syntactic elegance and foreign platform integration. Chai chose robust extensibility and multi-thread/device scheduling.

Libra SDK is modeled after commercial SDK products.

Chai is modeled after open source languages like Perl and Ruby.

Different visions from the same origins of:

  1. GPGPU research at Stanford under Hanrahan produces Brook
  2. Silicon Valley startup PeakStream is the next generation vision after Brook
  3. Google acquires PeakStream in 2007 and discontinues the products
  4. OpenCL appears as a vendor portable GPGPU language

So the vision was there five years ago. But the development costs were too high for the return on investment. To build a universal managed platform for GPGPU, you had to write your own shader compiler back-ends. That’s why PeakStream had been working for two years and was 35 people at the end.

Today, the technology stack is more mature. It’s a lot cheaper to do this.

This actually scares me a little. At the bleeding edge, technology is moving very quickly. Ideas I think of as science fiction are much more likely to appear just a few years from now.

One conclusion I’ve reached is that hand crafting a virtual machine and JIT compiler in C++ is crazy. That’s what I’ve done.

Writing complex systems in C++ is like building a house without power tools. The labor is enormous. This doesn’t mean the system shouldn’t be built on C++, just that it should not be written in C++.

I have a deeper appreciation for PyPy and why it is interesting. I think end users think of compatibility and performance. Virtual machine and compiler engineers look at development cost. Everyone has a different utility function. Something that is good but costs too much to make and maintain is not sustainable. That’s why writing a virtual machine and JIT in a managed language (not necessarily kind-of metacircular as with PyPy and RPython).

However, my first priority is useful tools. After the first release (soon, there will be something by the time I talk at PP12), I should re-factor the code. There are multiple .cpp files that are thousands of lines long. Now you see what I’m talking about. I know… this is bad. I agree. But it’s different when you are fighting to build something you have never seen before rather than maintaining legacy IT applications in the enterprise.

Finally working: PeakStream conjugate gradient demo code

I’ve been working on the memory management and JIT interactions this last week. The PeakStream conjugate gradient demo code works correctly now (compiled by the JIT to OpenCL and not only interpreted as before). So do the rest of my (quite meager) set of samples. It’s all working.

// Sample code from a presentation given by Matthew Papakipos at Stanford in
// 2007 (page 13 of "Public Google PeakStream.ppt")
int Conj_Grad_GPU_PS(int N, float *cpuA, float *cpux, float *cpub)
    int iter;
    Arrayf32 x = Arrayf32::zeros(N);
        Arrayf32 A = Arrayf32::make2(N, N, cpuA);
        Arrayf32 b = Arrayf32::make1(N, cpub);
        Arrayf32 residuals = b - matmul(A, x);
        Arrayf32 p = residuals;
        Arrayf32 newRR = dot_product(residuals, residuals);

        for (iter = 0; iter < N; iter++) {
            Arrayf32 oldRR = newRR;
            Arrayf32 newX, newP, newResiduals;
            Arrayf32 Ap = matmul(A, p);
            Arrayf32 dp = dot_product(p, Ap);
            newX = x + p * oldRR / dp;
            newResiduals = residuals - Ap * oldRR / dp;
            newRR = dot_product(newResiduals, newResiduals);
            newP = newResiduals + p * newRR / oldRR;

            p = newP;
            residuals = newResiduals;

            float oldRRcpu = oldRR.read_scalar();
            if (oldRRcpu <= TOLERANCE) {
            x = newX;
    x.read1(cpux, N * sizeof(float));

    return iter;

This example is relatively complicated and caused me some grief. I’ve learned lessons about code generation for GPUs. These are still not clear to me. So I won’t go into details. I can say that the solutions I ended up with are unexpected.

So maybe that is one lesson I can relate. Have an open mind about solutions. What is unorthodox may just not appear immediately sensible. If a solution is cheap to try, it is worth investigation.

Back to solving puzzles and having more fun

I submitted the final presentation slide deck to SIAM for the conference. There’s no NDA or other restriction on posting them elsewhere. However, it seems a little rude to do that before the conference.

I joined and RSVP’ed to the Mountain View HPC-GPU-Supercomputing Meetup. The next meeting is January 30th at Carnegie Mellon University Silicon Valley – that seems weird to me. Whenever I hear CMU, I think of back East.

With the presentation ready for the conference, I can get back to doing technical stuff.

The way I handled vectorization is kind of a hack. Here’s where NVIDIA makes it easy and ATI makes it hard. You don’t (always?) need to vectorize code on NVIDIA to have good performance (although it’s better when you vectorize). There’s no option on ATI. Vectorization is required for acceptable performance.

This leads to issues with mixtures of vector lengths when the JIT is generating code. OpenCL anticipates this (as you would expect) by making it easy to slice vector elements. For example,

__kernel void foo(__global float4 *a, __global double2 *b, __global float4 *c)
    a[get_global_id(0)].lo = b[2*get_global_id(0)] + c[get_global_id(0)].lo;
    a[get_global_id(0)].hi = b[2*get_global_id(0)+1] + c[get_global_id(0)].hi;

It’s not rocket science. However, if you don’t have an elegant solution, a hacked approach ends up somewhat ugly.

Another issue is with continuation. I didn’t really understand what that meant before. Actually, it was in writing the presentation that I realized there are deeper issues around this with GPUs.

When we think of an execution trace, I think most of us don’t initially think of that trace, really a boxed calculation, as having much state beyond the historical record. It’s something that returns a value when evaluated. This viewpoint is misleading. Traces, at least for this platform, have a lot of contextual state. They rely on side-effects.

Traces stick to devices. The cost of data movement with GPUs is so high that once a trace is scheduled there, the memory objects associated with it really should not move unless absolutely necessary.

This implies that traces inherit memory. A trace is really a sequence of traces (which are vectorized when scheduled). They all share the context of the compute device on which they are scheduled.

Anyway, it’s nice to return to solving puzzles.

New presentation draft for SIAM PP12

The presentation slide deck and last rehearsal (21MB 23min Ogg Vorbis) yesterday works much better. It’s still rough in spots. However, now there is a consistent narrative, a story, for the audience to follow.

Preparing for this presentation has really helped me understand what I’ve done over the last one to two years. So much has happened, both in real life and with the engineering involved in this project, that I lost track.

My first brain dump for this presentation reflected this. It lacked a cohesive story because that is what happened in real life. Nothing has made any sense, even if the technology has turned out to in the end.

That’s what I realized – good stories are not real life. They are the concentrated, distilled essence of experience. In the real world, there isn’t always a good story. At best, there may be many stories, disjointed and woven together.

Rehearsal for SIAM PP12

I wrote a deck of presentation slides (PDF) over the last five days. Yesterday, I spent a few hours rehearsing. Here’s the last take (22MB 25min Ogg Vorbis). Feedback from friends has been very constructive.

It’s overwhelming. People are confused. Am I just snowing the audience? Everyone will be lost.

Today, I’m going for a long walk.