there are so many cores

Just another WordPress.com site

Monthly Archives: September 2011

Matrix multiply ported

Matrix multiply is working.

This brings back bad memories… every compute device understands a different dialect of OpenCL. A kernel that works perfectly on one compute device may fail on another. That could mean anything from garbage output to crashing the driver. Kernel performance depends on the specific combination of: compute device; runtime SDK; device driver version.

The ugly reality of OpenCL is that while it is a step in the right direction, it is not very portable in practice. The vendors don’t like to mention this.

That’s why performance tuning is a lot of work. It’s also why auto-tuning is attractive – let a compiler do the tedious search for the fast kernels. Easier said than done.

My plan is to first port over all of GATLAS as-is. Then it will be connected to the runtime and JIT. That should be pretty cool.

Porting GATLAS

GATLAS is perhaps 30% integrated. It’s more work than I expected. My notion of rewriting this to make it slimmer now seems ridiculous.

I made some changes to the JIT so auto-tuning can happen before kernels must be scheduled. This allows the JIT to make better decisions about use of memory buffers, images, and vectorization. It’s the kind of efficient memory optimization you would do yourself.

The main thing now is porting the code over. The wrapper library around OpenCL has changed. OpenCL is kind of like Xlib (which no one still working in the software industry remembers). It’s a good C API but generally too awkward to use directly. Applications use a higher level wrapper library.

So like others, I wrote an OO wrapper around OpenCL. When I wrote GATLAS, the wrapper was very C-like with integer handles to objects maintained by the runtime. Since then, it has been rewritten to be more object-based. (I don’t want to say object-oriented as it really is not.)

That’s why GATLAS must be ported. It’s something that had to happen anyway. In terms of performance and binary size, it should be about the same. However, it’s cleaner and easier to read.

GATLAS bolts onto the kernel cache nicely

Diagramming the JIT turned out to be more useful than expected. I was imagining that plugging in GATLAS would be ugly. It looks like it naturally bolts onto the kernel cache (note the red box). Only minor surgery is required inside the JIT core.

By the way, the code in GitHub is more than two months old and very much out of date. If all goes well, the next commit should have a baseline fast matrix multiply integrated. However, I do not expect optimal sustained performance without modifications to the scheduler and memory manager.

Diagram the design a little

After this last pause in development, I found myself asking, “How does this work?” The design has become quite complex. Any time away and it fades from memory.

So I am spending time to create design artifacts.

Flow from user application to the scheduler:

Interaction between scheduler threads:

wait(ptid, trace)        _traceMap        _boss        _work

   ----------lock---------->
   ---SingleTrace(trace)--->
   -------------------lock------------------>
   ------------------signal-----------------> WAKE UP
   ------------------unlock----------------->
   WHILE FLAG
   ----------wait---------->
                            <------lock------
                                              COLLATE
                            <-----unlock-----
                                             ----lock---->
                                              ENQUEUE
                                             --broadcast-> WAKE UP
                                             ---unlock--->
                                              SLEEP        COLLATE
                                                           DISPATCH
                                                           HISTORY
                            <------------lock-------------
                                                           REMOVE FLAG
                            <----------broadcast----------
                            <-----------unlock------------
                                                           SLEEP
   ---------unlock--------->

Flow from the scheduler to devices:

The complex stuff is under the translator inside the jit/ module. I want to spend a little time diagramming the JIT before integrating GATLAS.

Integrating GATLAS with minimal modifications

No software development this last month… I was studying the standard financial theory. I had to learn more about finance to validate my vision for quantitative GPGPU. If my vision was wrong, then I might work towards a certain dead end. (By the way, Professor Geanakoplos’ course is awesome. It’s a lot of fun. He is a lively speaker with deep practical experience. The math isn’t hard but enough to derive results more rigorously if you want.)

The production black box systems I have seen were designed from a viewpoint of statistical filtering and clustering (and did not use any GPUs). Computationally, this required solving regression problems with training data while trying to avoid overfitting. The arithmetic intensity arose from machine learning.

Options pricing is the well known financial application. Prices are calculated from backward induction over a tree or lattice. What are the transition probabilities due to uncertainty inside the tree? They must be found with machine learning from historical data.

I feel confident of my machine learning driven vision for quantitative GPGPU now.

Anyway, now I will integrate GATLAS with minimal modifications into the JIT. That’s the right thing to do first. A modular virtual machine and JIT is better even if it is less efficient. Extensibility, flexibility and maintainability are more important at this point than optimizing for performance, so long as performance is good enough.