there are so many cores

Just another site

Monthly Archives: October 2011

A JIT is more endogenous

I am extending and rewriting the GATLAS kernel generation to support mixed precision, vector length, memory buffer, and image arguments. The virtual machine model almost forces this. When a JIT is generating code dynamically, there is more freedom required than the case of math kernels from a static library.

The original PeakStream didn’t support mixed precision (AFAIK). Older ATI GPUs were single precision only. Double precision was exclusive to the CPU.

I don’t know what the performance implications will be from this modification. It won’t run slower. It may run quite a bit faster. OpenCL on ATI GPUs is very sensitive to register pressure. Performance can vary dramatically.

Also, I’ve been debating whether the JIT should have an AOT (ahead of time) component. The answer is absolutely yes. Here’s why.

Let’s say you have a fleet of 1000 identical hosts with GPUs. If every virtual machine is independent, then there will be a lot of wasted JIT optimization. Each JIT must relearn the same optimal kernels. It is more efficient for the JITs to share a common understanding.

More conventional JITs gain from dynamically compiling hot functions and traces to native code. This avoids interpreter overhead. The optimization is in identifying the code to compile. This is not that expensive.

A GPU JIT (at least how I approach the problem) solves a much more expensive optimization problem. It would be combinatorial and terrible except that measured throughput happens to have a convex shape. This allows much more efficient search.

Still, it is not cheap. To fully characterize a GPU for matrix multiply, to know the problem dimensions, vector lengths, etc for which it runs fastest, takes many hours. There are also issues of platform stability. In the real world, device drivers and runtimes do crash. Labeling the kernels that cause the GPU to fail can be very expensive (as crashes often force a reboot).

This implies there should be an external persistent database that allows sharing between JITs. It is natural to preload that database ahead-of-time. So the JIT is really a hybrid with some AOT optimization too.


Short talk at SIAM Parallel Processing 2012

I am giving a 25 minute mini-symposium talk at SIAM Parallel Processing 2012 next year. I have no idea what that really means. This is my first conference.

To be honest, when I was on an academic track as an applied/computational math PhD, I didn’t have any original ideas. I certainly did not have an opinion about the future. I was very young. And then I dropped out, twice, abandoning math entirely to work as a programmer during the Dot-com bubble.

No software development the last two weeks. I was mostly reading Hull. It’s not a book mathematicians would like. It is a good cultural overview of basic concepts in financial engineering. The narrative reminds me of Stroustrup (specifically the 2nd edition that reads like a stream of consciousness encyclopedia).

At my last real job, I was a “support quant” for a very large retail supply chain. After three months of Geanakoplos, Hull, and research papers, I have a perspective on what my job really was (and more importantly, what it was not). It was all very new for me and outside of “my cultures” of applied mathematics and software engineering. So I read a lot more into what we were doing there than was actually the case.

Anyway, I haven’t written my talk yet. I already have enough material from the work last year with GATLAS. Now that this is being integrated into an open source (clean room) PeakStream clone, there is potentially more to talk about. And last year, I had only a vague notion of auto-tuning as a statistical procedure. There’s much more today than last year.

Better get back to work. I expect it will take a week to integrate the GATLAS style auto-tuning “as-is” into the current code.

GATLAS stuff is working ok

It wasn’t as bad as I thought. The code in GitHub from last year is ok.

There were two bugs in the new stuff. Just careless details with enqueued memory transfers. Everything else was fine. In fact, the new stuff is slightly faster (as much as 5% for small matrices).

GATLAS enqueues kernels as an ordered sequence using event objects. Each kernel event depends on the previous one. The new code takes a simpler approach with independent events. There is no difference on ATI GPUs as OpenCL kernels execute sequentially anyway. (However, NVIDIA may support concurrent kernel execution so I’ll have to do things a little differently.)

My GPU server is an unusual configuration as it typically has both ATI and NVIDIA GPUs running at the same time. Most users have a single ATI or NVIDIA GPU. If they have multiple GPUs, these will be the same model and vendor. My development environment has been at the other extreme of heterogeneous GPUs.

It’s actually a hack to get an ATI and NVIDIA GPU to run at the same time on one host. A few years ago, gamers started using ATI GPUs for graphics and NVIDIA GPUS for game physics. To stop this, the NVIDIA driver checks for an ATI GPU and won’t start if it detects one.

But wait, there’s more. The ATI driver depends on a running X server. When NVIDIA changed the driver to add the check for an ATI GPU, they also made it depend on X. Both vendors’ GPU drivers depend on the same libraries. After installing the NVIDIA driver, those libraries are modified in a way so that the ATI driver can not run! (and vice-versa)

I wouldn’t say this is deliberate so much as a rare use-case. Very few users will use an ATI and NVIDIA GPU on the same host. Most users only have one GPU. If they have more than one, then it is two, three or four of the same make and model.

Anyway, I was confused earlier as I compiled the code against the ATI OpenCL SDK but was running against the NVIDIA GPU. It’s an easy mistake to make. While both vendors’ OpenCL runtimes can see all GPUs, they can only correctly work with their own hardware. A consequence: without dynamic loading and linking magic, it is not possible for a process to use an ATI and NVIDIA GPU at the same time. A process must be linked to one or the other.

I know. That was confusing.

The good news is that stuff looks like it works. Now I have to connect the XGEMM and XGEMV to the PeakStream matmul() through the JIT. On an ATI 5870, my SGEMM exceeds 1400 GFLOPS. DGEMM exceeds 350 GFLOPS. With auto-tuning, performance scales linearly across the Evergreen architecture (5000 series, have not tested on 6000 series hardware). Of course, things slow down a lot when counting PCIe bus data transfers. One way of looking at something like PeakStream is as a very complex memory manager to minimize that data movement back and forth between host and GPU.

Astounded by GATLAS

Finished porting over GATLAS. Now I’m testing on my GPU development/test/compute server (a headless gaming PC with a ATI 5870 and NVIDIA 480 in it right now).

Here’s the crazy thing.

The GATLAS code committed to GitHub a year ago is wrong. It doesn’t work. The correct code is sitting on the GPU host. This is a big surprise. I mean, I am astounded.

It’s kind of embarrassing to admit. I haven’t booted the GPU compute server in about a year.

Although this project is a clone of PeakStream, which is all about GPU-based HPC, the last year has been working on the virtual machine and JIT front and middle ends. There was never any need for a GPU device. I was struggling with more basic problems (and life stuff).

Even more amusing – the lack of any complaints about the obviously broken code in GitHub (it runs but if you check the output, it is mostly wrong) confirms what I concluded about GATLAS – it’s so complex that no one except me can use it.

Oh well, you know, I am not really promoting this project at all now. I have my own plans for the technology. It’s sort of what PeakStream envisioned except for financial quants.

So the traders, they generally keep their technology secret. Why release this? Here’s an answer an economist might like: I think there is room in the world for positive as well as negative externalities!