there are so many cores

Just another WordPress.com site

Support stencils but autotune filters

I started to refactor the JIT code. It used to be in one directory. Now it is in four: jitast, jitbc, jitstmt, jitx. This reflects the structure of the JIT stages through different representations (bytecode to abstract syntax trees to statements to kernels).

What motivated this is supporting the index and gather operations.

Here’s what index.cpp generated two weeks ago with the alpha release. The JIT generated code much like an interpreter with an inefficient direct translation. That’s why it uses temporary arrays a1 and a2 with a memory barrier.

__kernel void f_2324385871888670784_1(
  __global float * a1 ,
  __global float * a2 ,
  __global float * a3 )
{
  float r1 ;
  a1 [(get_global_id(0)) + get_global_id(1) * 10] = (get_global_id(0)) ;
  a2 [(get_global_id(0)) + get_global_id(1) * 10] = (0.25 * get_global_id(1)) ;
  barrier(CLK_GLOBAL_MEM_FENCE) ;
  r1 = (float)(0);
  for (int i0 = 0; i0 < 10; i0++)
  {
    for (int i1 = 0; i1 < 10; i1++)
    {
      r1 += (exp(a1 [(i1) + i0 * 10]) + a2 [(i0) + i1 * 10]) ;
    }
  }
  a3 [0] = r1  ;
}

This is what it does now. The temporary memory is elided away and with it the memory barrier too.

__kernel void f_2324385871888670784_1(
  __global float * a1 )
{
  float r1 ;
  r1 = (float)(0);
  for (int i0 = 0; i0 < 10; i0++)
  {
    for (int i1 = 0; i1 < 10; i1++)
    {
      r1 += (exp((float)((float)(i0))) + (0.25 * (float)(i1))) ;
    }
  }
  a1 [0] = r1  ;
}

The Kirchoff migration example also works now. Here’s a code fragment from kirch.cpp.

Arrayf32 modl = zeros_f32(NT, N);
{
    Arrayf32 x = dx * index_f32(1, NT, N);
    Arrayf32 z = dt * index_f32(0, NT, N);
    Arrayf32 data = Arrayf32::make2(NT, N, datagpu);

    for (int iy = 0; iy < N; iy++)
    {
        float y = float(iy) * dx;
        Arrayf32 index1 = float(iy) * ones_f32(NT, N);
        Arrayf32 it = 0.5 + sqrt(z * z + (x - y) * (x - y) * factor) * idt;
        modl += gather2_floor(data, it, index1);
    }
}
modl.read1(modlgpu, NTN * sizeof(float));

It generates this OpenCL (change: N = NT = 10).

__kernel void f_14658931760346705960_1(
  __global float * a1 ,
  __global const float * a2 )
{
  float r1 ;
  r1 = (float)(0);
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 0) * ((1 * get_global_id(1)) - 0)) * 1)))) * 1)))) + convert_int_rtn((0 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 1) * ((1 * get_global_id(1)) - 1)) * 1)))) * 1)))) + convert_int_rtn((1 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 2) * ((1 * get_global_id(1)) - 2)) * 1)))) * 1)))) + convert_int_rtn((2 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 3) * ((1 * get_global_id(1)) - 3)) * 1)))) * 1)))) + convert_int_rtn((3 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 4) * ((1 * get_global_id(1)) - 4)) * 1)))) * 1)))) + convert_int_rtn((4 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 5) * ((1 * get_global_id(1)) - 5)) * 1)))) * 1)))) + convert_int_rtn((5 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 6) * ((1 * get_global_id(1)) - 6)) * 1)))) * 1)))) + convert_int_rtn((6 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 7) * ((1 * get_global_id(1)) - 7)) * 1)))) * 1)))) + convert_int_rtn((7 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 8) * ((1 * get_global_id(1)) - 8)) * 1)))) * 1)))) + convert_int_rtn((8 * 1)) * 10]) ;
  r1 = (r1  + a2 [(convert_int_rtn((0.5 + (sqrt((float)((((1 * (get_global_id(0))) * (1 * (get_global_id(0)))) + ((((1 * get_global_id(1)) - 9) * ((1 * get_global_id(1)) - 9)) * 1)))) * 1)))) + convert_int_rtn((9 * 1)) * 10]) ;
  a1 [(get_global_id(0)) + get_global_id(1) * 10] = r1 ;
}

The Chai JIT doesn’t know how to loop roll these statements. (note: PeakStream could do it) I will have to add this later. However, my experience is that OpenCL generally has better performance with fully unrolled loops, so long as the kernel is small enough to compile.

People ask really good questions at conferences. One question someone had was about sparse matrix support. That’s related to stencils.

My answer was that while I recognize the importance of sparsity, my focus is on dense linear algebra like Dongarra.

Traditional numerical linear algebra problems have a few large sparse matrices. They come from solving differential equations with some kind of stencil for the derivative operators. Every element is related to its neighbors in a regular way. The result is a system matrix that is sparse with a highly regular structure.

After working on the JIT this last week, I think I have a better grasp of what this means.

Traditional problems start with a model (e.g. a system of differential equations) and use number crunching as a form of simulation. This is also like having a geometric model and rendering to an image. It’s the forward problem.

The kinds of problems I’m interested in go the other way. It’s when you are trying to learn a model from structure in data. So it’s like having an image and trying to infer a geometric model. It’s the inverse problem.

Stencils come from the forward problem. It’s when you know what the operators look like.

Filters come from the inverse problem. It’s when you don’t know what the operators look like.

Chai supports both directions. But the emphasis is on inverse problems. The auto-tuned kernels are for dense operators. Examples are matrix multiply and convolution. There is no stencil per se. The structure is in the data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: