there are so many cores

Just another site

Monthly Archives: February 2012

Adding first class integer type support

Chai should have first-class integer support in addition to single and double precision. This extends the PeakStream API with “i32” in addition to “f32” and “f64”.

Stencils led to this.

The JIT needs to distinguish between statically known integral subscript indices and values known only at runtime on the GPU. This allows generating different code. It’s very useful for auto-tuning. Specifically, generated kernels can prefetch from global into local memory if gather subscript indices are known at compile time. (By compile time, I mean when the JIT runs, not when the C++ application is compiled.)

My first thought was adding special case support just for the gather operation. The more I looked at this, it just seemed more natural to extend the language uniformly. This begins to address the major omission of support for algorithms that operate on integer data.

Support stencils but autotune filters

I started to refactor the JIT code. It used to be in one directory. Now it is in four: jitast, jitbc, jitstmt, jitx. This reflects the structure of the JIT stages through different representations (bytecode to abstract syntax trees to statements to kernels).

What motivated this is supporting the index and gather operations.

Here’s what index.cpp generated two weeks ago with the alpha release. The JIT generated code much like an interpreter with an inefficient direct translation. That’s why it uses temporary arrays a1 and a2 with a memory barrier.

__kernel void f_2324385871888670784_1(
  __global float * a1 ,
  __global float * a2 ,
  __global float * a3 )
  float r1 ;
  a1 [(get_global_id(0)) + get_global_id(1) * 10] = (get_global_id(0)) ;
  a2 [(get_global_id(0)) + get_global_id(1) * 10] = (0.25 * get_global_id(1)) ;
  r1 = (float)(0);
  for (int i0 = 0; i0 < 10; i0++)
    for (int i1 = 0; i1 < 10; i1++)
      r1 += (exp(a1 [(i1) + i0 * 10]) + a2 [(i0) + i1 * 10]) ;
  a3 [0] = r1  ;

This is what it does now. The temporary memory is elided away and with it the memory barrier too.

__kernel void f_2324385871888670784_1(
  __global float * a1 )
  float r1 ;
  r1 = (float)(0);
  for (int i0 = 0; i0 < 10; i0++)
    for (int i1 = 0; i1 < 10; i1++)
      r1 += (exp((float)((float)(i0))) + (0.25 * (float)(i1))) ;
  a1 [0] = r1  ;

The Kirchoff migration example also works now. Here’s a code fragment from kirch.cpp.

Arrayf32 modl = zeros_f32(NT, N);
    Arrayf32 x = dx * index_f32(1, NT, N);
    Arrayf32 z = dt * index_f32(0, NT, N);
    Arrayf32 data = Arrayf32::make2(NT, N, datagpu);

    for (int iy = 0; iy < N; iy++)
        float y = float(iy) * dx;
        Arrayf32 index1 = float(iy) * ones_f32(NT, N);
        Arrayf32 it = 0.5 + sqrt(z * z + (x - y) * (x - y) * factor) * idt;
        modl += gather2_floor(data, it, index1);
modl.read1(modlgpu, NTN * sizeof(float));

It generates this OpenCL (change: N = NT = 10).

__kernel void f_14658931760346705960_1(
  __global float * a1 ,
  __global const float * a2 )
  float r1 ;
  r1 = (float)(0);
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 0) * ((1 * get_global_id(1)) - 0)) * 1)))) * 1)))) + convert_int_rtn((0 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 1) * ((1 * get_global_id(1)) - 1)) * 1)))) * 1)))) + convert_int_rtn((1 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 2) * ((1 * get_global_id(1)) - 2)) * 1)))) * 1)))) + convert_int_rtn((2 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 3) * ((1 * get_global_id(1)) - 3)) * 1)))) * 1)))) + convert_int_rtn((3 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 4) * ((1 * get_global_id(1)) - 4)) * 1)))) * 1)))) + convert_int_rtn((4 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 5) * ((1 * get_global_id(1)) - 5)) * 1)))) * 1)))) + convert_int_rtn((5 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 6) * ((1 * get_global_id(1)) - 6)) * 1)))) * 1)))) + convert_int_rtn((6 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 7) * ((1 * get_global_id(1)) - 7)) * 1)))) * 1)))) + convert_int_rtn((7 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 8) * ((1 * get_global_id(1)) - 8)) * 1)))) * 1)))) + convert_int_rtn((8 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 9) * ((1 * get_global_id(1)) - 9)) * 1)))) * 1)))) + convert_int_rtn((9 * 1)) * 10]) ;
  a1 [(get_global_id(0)) + get_global_id(1) * 10] = r1 ;

The Chai JIT doesn’t know how to loop roll these statements. (note: PeakStream could do it) I will have to add this later. However, my experience is that OpenCL generally has better performance with fully unrolled loops, so long as the kernel is small enough to compile.

People ask really good questions at conferences. One question someone had was about sparse matrix support. That’s related to stencils.

My answer was that while I recognize the importance of sparsity, my focus is on dense linear algebra like Dongarra.

Traditional numerical linear algebra problems have a few large sparse matrices. They come from solving differential equations with some kind of stencil for the derivative operators. Every element is related to its neighbors in a regular way. The result is a system matrix that is sparse with a highly regular structure.

After working on the JIT this last week, I think I have a better grasp of what this means.

Traditional problems start with a model (e.g. a system of differential equations) and use number crunching as a form of simulation. This is also like having a geometric model and rendering to an image. It’s the forward problem.

The kinds of problems I’m interested in go the other way. It’s when you are trying to learn a model from structure in data. So it’s like having an image and trying to infer a geometric model. It’s the inverse problem.

Stencils come from the forward problem. It’s when you know what the operators look like.

Filters come from the inverse problem. It’s when you don’t know what the operators look like.

Chai supports both directions. But the emphasis is on inverse problems. The auto-tuned kernels are for dense operators. Examples are matrix multiply and convolution. There is no stencil per se. The structure is in the data.

The PeakStream DSL supports stencil kernels

The PeakStream DSL supports stencil kernels with the gather1_floor() and gather2_floor() operations.

From the PeakStream Kirchoff migration example:

float dx = LX / float(N);
float dt = LT / float(NT);
float factor = 1./ (velhalf * velhalf);
float idt = 1./ dt;
Arrayf32 modl = zeros_f32(NT, N);
    Arrayf32 x = dx * index_f32(1, NT, N);
    Arrayf32 z = dt * index_f32(0, NT, N);
    Arrayf32 data = Arrayf32::make2(NT, N, datagpu);

    for (int iy = 0; iy < N; iy++)
        float y = float(iy) * dx;
        Arrayf32 index1 = float(iy) * ones_f32(NT, N);
        Arrayf32 it = 0.5 + sqrt(z * z + (x - y) * (x - y) * factor) * idt;
        modl += gather2_floor(data, it, index1);

Choose convenient values for the constants (just to show what is going on).

LX = N = 10;
LT = NT = 5;
velhalf = 1;

The statement

        modl += gather2_floor(data, it, index1);

accumulates data in this pattern to modl:

Kirchoff Migration stencil kernel

I begin to see the genius in PeakStream’s DSL. It is an imperative language for specifying functional kernels.

I did not understand how the PeakStream gather operations really worked until this morning. This is cliché – it became obvious while taking a shower.

Some significant changes in the JIT will be required to support this. This is a good thing as it forces me to refactor some of the mess I’ve made.

Build the right thing the first time

I am unsure if this will work but will try.

  1. shorter iterations, monthly public mini-releases
  2. beta quality release this summer
  3. production quality release by end of this year

If you are interested in this work, please influence it. Tell me what to build. The biggest risk for this project is that I fail to assess value correctly and build the wrong thing.

An example of what I mean is stencil kernels. It was Korneliussen’s comment to this blog and then seeing Christen’s presentation at SIAM PP12 on stencil kernels that pointed out a glaring omission. Part of this was also Robert Rossi’s presentation at the HPC & GPU Supercomputing Meetup in Silicon Valley.

Really, I have done some computer vision before. So I understand the importance of stencils for non-linearity and feature detection. Yet, it was not until I was prompted by all three of these that it became a project priority.

It is very easy to miss what is important. That’s o.k. if you have a lot of resources. I am very resource constrained. So I must build the right thing the first time.

Stencil kernels will be next major feature

The conference is over for me. My presentation went over well. It was recorded so I will post a link once it is available through SIAM/Blue Sky Broadcast.

Sincerest thanks and gratitude to Keita Teranishi for inviting me.

Jan Tore Korneliussen inquired about per-element specification of computational kernels. My answer was image convolution. I have changed my mind. General stencil kernels are needed. This will require extending the DSL.

Matthias Christen’s presentation Automatic Code Generation and Tuning on Stencil Kernels on Modern Microarchitecture changed my mind. It is not much harder to support stencils. Non-linearity is also incredibly important for computer vision and feature detection. Linear image convolution is not enough.

Stencil kernels is the highest priority major new feature in Chai.

I have seen Chai in the context of the kernel trick and scoring functions using fast matrix multiply. This is marking positions to value quantitative strategies. As with non-linear stencils versus linear convolution, my vision has changed.

In the last two weeks, two major themes have loomed larger for me.

  1. GPGPU inverts the original intent of the GPU: instead of rendering image data from a geometric model, a model is fitted to data
  2. Compilers must write software for us

The biggest theme I saw from the conference, although it seemed novel to everyone except Teranishi-san, is that supercomputing and statistical learning theory are converging.

What drives this is scale.

The human mind, even when we work together in groups, is unable to solve problems once they become too large and complex. As supercomputers grow ever larger, the ability of human beings to write software begins to break down. We must rely on machines to help us write software.

GPU compute devices may be more significant in embedded applications than in HPC. There are two reasons for this.

The risk of new technology is an extra cost. Most traditional high performance computing is done using CPU cores. Users are government and large institutions. Throwing hardware at the problem and minimizing the change to how software is written is cheaper for them.

At the same time, embedded applications are highly constrained by power and volume. The pace of design and architectural change is much faster. You can not throw hardware at this problem. Traditional software development techniques break down. Human beings can not write and optimize software fast enough to keep up.

IBM’s Watson that defeated Ken Jennings had 2880 cores. The system was almost entirely Java.

Someone in the audience asked Eric Brown (IBM Research), “Why Java?” Pose the same question to Amazon, Google or almost any enterprise IT organization with a software development engineering arm. The answer would be the same.

After his presentation about Watson, I asked Dr. Brown if he thought there was a connection between ensembles and the Markowitz portfolio. That’s what I kept thinking during his talk.

Watson was developed around the same time as the Netflix Prize. That contest ended up being BellKor’s Pragmatic Chaos versus The Ensemble. Watson is a large ensemble of scoring functions trained using a regression model.

This reminds me of standard portfolio theory in which the optimal strategy is betting on everything. The market basket of the world is the largest possible ensemble. Yet, the standard theory has broken down. This is why mutual funds have been in decline while hedge funds have been ascendant. The world is not growing but becoming more volatile.

So my thought was, will the ensemble market basket of machine learning similarly break down? I guess no one knows.

Then I said, “Ensemble of Watson. Skynet.”

I was speaking for everyone who has ever read Slashdot.

These three presentations:

  1. Evaluation of Numerical Policy Function on Generalized Auto-Tuning Interface Openatlib, Takao Sakurai
  2. Spiral: Black Belt Autotuning for Parallel Platforms, Franz Franchetti
  3. High-order DG Wave Propagation on GPUs: Infrastructure and Implementation, Andreas Klöckner

showed the need for:

  1. mixed strategy ensembles with a meta-JIT
  2. kernel specification as a tensor algebra
  3. code generation from a meta-compiler

In preparing for this conference, I learned consideration of the audience. Think of the user. I am not working for myself. I am working for you.

After attending the conference, I understand this in my bones.

Most research and development work is done with the patronage of large institutions, both public and private. The people who create new things may have a larger vision for the future. But they are also constrained to work in the narrow silo defined by their patron.

It is the cost for being in the establishment or funded by it.

What I am doing is working inside the silo defined by you. I am trying to make something that maximizes total social welfare. Build the right thing that lasts and helps people live better.

About Savannah – it makes me sad to leave – it is like a smaller, more relaxed, Seattle, except with Southern hospitality – not perfect and obviously with a legacy of social issues dating back to slavery – but no place is perfect – I really enjoyed my time here.

Alpha release

I added a download page to this blog. It points to GitHub. The alpha release is uploaded and available. The download is a simple gzipped tarball with the git meta-data removed. Some people may just want the source without converting to git.

Next post will be the alpha release

I think I’m done with this alpha release. The next post will include a link to a GitHub download. The source will be checked in there too. I’ll probably create a new WordPress page just for releases to avoid forcing people to search through this blog narrative.

This release has shifted my mindset from self absorbed virtual machine language builder to considering users. So although it’s not much and some of it makes me look bad, I decided to be more open about the state of this project and technology. I’m taking a long view – if this technology has merit and is good, then it will live. Otherwise, it will die. It doesn’t matter too much how I portray it.


Here’s my test matrix. It’s a meagre collection of small demo code applications on every compute device I have.

OK - everything works
AF - autotuning fails for batched kernels, scheduled to interpreter
IR - interpreter support only (compute device RNG not implemented)
IS - interpreter support only (single precision only compute device)
OF - everything works except may fail rarely (enough to be noticed?)
SV - segmentation fault (also bus error)
KF - segmentation fault in kernel generated by shader compiler
UC - usually correct results, sometimes wrong
UW - results generally wrong or out of order

Sample        PentiumM  Core2Duo  Corei7_920  5440  5670  5770  5870  480
Application   AMD       AMD       AMD  INTEL  AMD   AMD   AMD   AMD   NVIDIA
============  ========  ========  ===  =====  ====  ====  ====  ===   ======
cg            KF        OK        OK   OK     OK    OK    OK    OK    OK
cg64          KF        OK        OK   OK     IS    IS    IS    OK    OK
index         OK        OK        OK   OK     OK    OK    OK    OK    OK
kirch         SV        SV        SV   SV     OK    OK    OK    OK    OK
loopsum_omp   KF        UC        SV   SV     UC    UC    UW    OK    OK
loopsum_pth   KF        UC        SV   SV     UC    UC    UC    OK    OK
loopsum_uni   KF        OK        OK   OK     OK    OK    OK    OK    OK
loopsum_vec   KF        UW        UW   UW     UW    UC    UW    OK    OK
matmul_omp    KF        AF        AF   AF     OK    OK    OK    OK    OK
matmul_pth    KF        AF        AF   AF     OK    OK    OK    OK    OK
matmul_uni    KF        OK        OK   OK     OK    OK    OK    OK    OK
matmul_vec    KF        AF        AF   AF     OK    OK    OK    OK    OK
matmul64_omp  KF        AF        AF   AF     IS    IS    IS    OK    OK
matmul64_pth  KF        AF        AF   AF     IS    IS    IS    OK    OK
matmul64_uni  KF        OK        OK   OK     IS    IS    IS    OK    OK
matmul64_vec  KF        AF        AF   AF     IS    IS    IS    OK    OK
mingle        KF        OK        OK   OK     OK    OK    OK    OK    OK
mingle64      KF        OK        OK   OK     IS    IS    IS    OK    OK
monte         IR        IR        IR   IR     IR    IR    IR    IR    IR
pi            IR        IR        IR   IR     IR    IR    IR    IR    IR
sum_omp       KF        OK        SV   SV     OF    OF    OF    OF    OK
sum_pth       KF        OK        SV   SV     OF    OF    OF    OF    OK
sum_uni       KF        OK        OK   OK     OK    OK    OK    OK    OK
sum_vec       KF        OK        OK   OK     OK    OK    OK    OK    OK

You can see some of the skeletons in the virtual machine closet through this table.

Some of that is the inherently complex nature of the technology. Vendors have to market GPGPU as a solved problem and mature technology. If that were true, it wouldn’t be interesting (easy stuff generally lacks mystique).

Some of that is also bugs, especially race conditions and logic errors. I’ve come to rely on testing as an integral part of development. That is just being realistic. I can’t keep the entire design in my head at the same time any more.

Other stuff…

  1. Tomorrow, I’ll read The Java Native Interface Programmer’s Guide and Specification by Sheng Liang. I’ve done JNI before. That was ten years ago. I haven’t done any Java for over two years either. I am very rusty. My thinking is to create a simple bridge from Java to native code libraries, analogous to how BerkeleyDB works.
  2. I am flying out to Savannah for SIAM Parallel Processing 2012 this coming Tuesday (Valentine’s Day).

Big compute on small iron

The needs of the many applications outweigh the needs of the few platforms, or the one. –Spock

Applications should not serve the needs of platforms. Platforms should serve the needs of applications. –V

I should support Java. C/C++ and Java cover the major app platforms. For many applications, Java is close to C/C++ in performance. (By close, I mean C/C++ is not usually 10x faster, it might be 2x faster.)

The odd man is Python. What has held scripting languages back is that they are too slow (like 100x slower than C/C++). Python JIT technology could change this. However, this is very new. It will be years before it could approach the maturity of HotSpot.

One reason people write JVM based languages like Scala is to get the HotSpot JIT for free. They write a virtual machine in Java. The new language platform is actually a Java application and benefits from HotSpot just like any other.

I believe that HotSpot has turned out to be a key strategic reason Java is not going away. Writing JITs is very difficult. The technology is changing with LLVM. What I am doing today would be impossible without it. In a similar way, it is easier for JIT designers now. But still Sun poured years of work into HotSpot which now works very well.

A guess – last I saw, HotSpot did not perform register optimizations. It does not vectorize. If processor architecture changes enough so there are 128 cores on a CPU with an attached GPU on the same die, then I could see other platforms displacing Java.

I was at the Legion of Honor yesterday. There was the head of Medusa on a pedestal. A big crowd gathered around it. One young woman had an iPad which she used like a camera. She may have been using it as a map too.

As others have observed, the future of digital cameras is mostly devices like this. With enough computational power, an iPad could show a “heads up display” and augment reality in a compelling enough way to be useful. I’m imagining something that is dynamically doing HDR and photostitching a bundle adjusted scene geometry (ok, I know that sounds crazy today). That would take a massive amount of power but be so cool.

At the last meetup, Robert Rossi observed that he was trying to invert the use of GPUs. The GPU takes geometry and renders a scene image. But you can also solve the inverse problem and use the GPU to take a scene image and calculate geometry. That is computer vision.

So my next evolutionary step is becoming a roadmap. The virtual machine needs to be configurable. Why? To support the needs of applications, some of which will be written in languages other than C/C++.

Does this seem more like the sort of thing with speculative value? It ties into a clear future trend in hardware. Big iron is becoming the cloud. Like the mainframe or supercomputers, it will always be there. But the market is comparatively small (even though that big iron is critical to our society). The growth is in smaller devices, not just phones and tablets but also netbooks, ultraportables and laptops. Power efficiency is important. So is graphics (as almost all notebook screens are 16:9, they are really optimized for watching HD movies and playing video games).

By the way, I’m pretty close to an alpha release. I still want another round of testing first. I’m kind of amazed how many little things I’ve caught over the last few days, not so much bugs in code but usability issues. I don’t want to release something with clearly poor fit and finish.

PeakStream CG example works on AMD, Intel, and NVIDIA

Fixed the bug with the PeakStream CG example. Root caused all symptoms. Now it is working on both AMD and NVIDIA GPUs as well as Intel CPUs.

As you might guess, there were several effects which obscured the cause.

I normally leave the scheduler configured to use any compatible compute device. GPU devices have a higher priority than CPUs. This can give the appearance of intermittent success on the GPU when traces are really being scheduled to an OpenCL CPU device. This is just careless.

The configuration comes from a simple text file. Here’s what it looks like right now on my develop-integration-test GPU server:

@HD5870@AMD           Cypress Advanced Micro Devices
@GTX480@NVIDIA        GeForce GTX 480 NVIDIA
@Corei7920@AMD@INTEL  Intel Core i7 920
@Core2Duo@AMD         Intel Core 2 Duo
@PentiumM@AMD         Intel Pentium M

@HD5870        Evergreen FP64 Images
@GTX480        Evergreen FP64 ShutdownNOP
@Corei7920     Evergreen FP64
@Core2Duo      Evergreen FP64
@PentiumM      Evergreen FP64

@HD5870        Paranoid=0.1 SearchTrials=5 TimingTrials=10 Watchdog=60
@GTX480        Paranoid=0.1 SearchTrials=5 TimingTrials=10 Watchdog=60
@Corei7920     Paranoid=0.1 SearchTrials=5 TimingTrials=10 Watchdog=120
@Core2Duo      Paranoid=0.1 SearchTrials=5 TimingTrials=10 Watchdog=60
@PentiumM      Paranoid=0.1 SearchTrials=5 TimingTrials=10 Watchdog=60
@Core2Duo@AMD  PragmaFP64=cl_amd_fp64
@PentiumM@AMD  PragmaFP64=cl_amd_fp64

Just remove all of the Intel CPU lines and the scheduler will only use GPUs.

The second part of the bug was more pernicious. There are separate code paths for synthesized and autotuned kernels. Memory was managed slightly differently.

Autotuned kernels were ignoring the static analysis done by the JIT that determines memory transfers. These kernels were always sending buffer data from the CPU to GPU, even when the CPU buffer is garbage and the GPU memory object is current.

I had nightmares of scheduler race conditions dancing in my imagination. It turned out to be simple logic error. This is wonderful.