there are so many cores

Just another WordPress.com site

Monthly Archives: August 2011

Time to bolt GATLAS into the JIT back-end

My master plan has always been a JIT built around auto-tuned math library kernels. The performance comes from these. That’s what GATLAS was all about – how to make matrix multiply fast in a way that generalizes across a GPU architecture.

The JIT also generates application kernels dynamically. These are usually of low arithmetic intensity. The cost of reading and writing memory dominates the cost of processor operations. So the gain from tuning is marginal. It’s there. You can measure it. However, these kernels are so slow that throughput is still terrible even after performance tuning (automated or manual).

So why have a JIT if the application kernels generated dynamically at runtime are slow?

With current GPU technology, data movement is the dominant cost. The host CPU and compute device GPU are connected by the PCIe bus or some other relatively slow interconnect. (APUs with integrated CPU/GPU may change this somewhat but that’s in the future.) Just as with disk drives and memory, moving data around is expensive. That’s why computers try to minimize this with caches and memory hierarchies.

Computing on the GPU needs to do the same thing.

A GPU is really good at some calculations but bad at everything else. A naive implementation would move data to compute devices that are best for the current calculation. The problem is that data would then move too much. I/O would dominate total cost.

That’s where the JIT comes in. It allows generating kernels for arbitrary compute devices so the data does not have to move. It can stay where it is. Even though the compute device may be slower than others, the cost of moving the data to a faster compute device may exceed the gains.

That’s why optimization depends so much on scheduling and the JIT working together.

Anyway, the next thing for me to do is add in an auto-tuned matrix multiply (which includes data parallel GEMV when the matrix is common to all threads). The JIT is already designed around doing this. I just need to add the GATLAS back-end into the JIT.

Advertisements

Less trivial JIT example

Here’s an application code sample which illustrates the current state of the JIT with a data parallel scalar reduction. Text in boldface is the PeakStream API. Variables and data that connect the host application with the virtual machine are in italics.

#include <iostream>
#include <omp.h>
#include <peakstream.h>

using namespace chai;
using namespace std;

int main(int argc, char *argv[])
{
    const size_t NUMBER_THREADS = 10;
    const size_t ARRAY_SIZE     = 10;

    // calculated results
    float outValues[NUMBER_THREADS];

    // common data to all threads
    float cpuA[ARRAY_SIZE];
    for (size_t i = 0; i < ARRAY_SIZE; i++)
        cpuA[i] = i * i;

    init(); // initialize virtual machine

    // use OpenMP for easy concurrency in data parallel style
    #pragma omp parallel for
    for (size_t threadIdx = 0; threadIdx < NUMBER_THREADS; threadIdx++)
    {
        // data specific to each thread
        float cpuB[ARRAY_SIZE];
        for (size_t i = 0; i < ARRAY_SIZE; i++)
            cpuB[i] = i * threadIdx;

        // PeakStream trace
        Arrayf32 D;
        {
            Arrayf32 A = make1(ARRAY_SIZE, cpuA);
            Arrayf32 B = make1(ARRAY_SIZE, cpuB);
            Arrayf32 C = zeros_f32(ARRAY_SIZE);

            C = A * B;

            for (size_t i = 0; i < 2; i++)
            {
                for (size_t j = 0; j < 3; j++)
                    C -= mean(C);

                C += 1;
            }

            D = sum(C);
        }
        outValues[threadIdx] = D.read_scalar();
    }

    shutdown(); // shutdown virtual machine

    for (size_t i = 0; i < NUMBER_THREADS; i++)
        cout << "outValues[" << i << "] is " << outValues[i] << endl;

    return 0;
}

This example is still very simple as the intermediate representation in the JIT has only three parts.

The first part creates memory buffers on the device and transfers data from host to compute device.

SEND makeX_f32(10, 1, PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)) TO 10x1 var32(0.1)
SEND makeX_f32(10, 1, PTR(0x9ad180)PTR(0x9ad1a8)PTR(0x9ad1d0)PTR(0x9ad1f8)PTR(0x9ad220)PTR(0x9ad248)PTR(0x9ad270)PTR(0x9ad298)PTR(0x9ad2c0)PTR(0x9ad2e8)) TO 10x1 var32(1.1)
CREATE 10x1 var32(2.1)
CREATE 1x1 VAR32(3.1)

The second part is a kernel for the compute device.

LITERAL zerosX_f32(10, 1) TO 10x1 var32(2.1)
10x1 var32(2.2) <- (var32(0.1) * var32(1.1))
for (int i0 = 0; i0 < 2; i0++)
{
  for (int i1 = 0; i1 < 3; i1++)
  {
    CREATE 1x1 var32(0x9b1950)
    REDUCE mean(var32(2.2)) TO 1x1 var32(0x9b1950)
    10x1 var32(2.3) <- (var32(2.2) - var32(0x9b1950))
  }
  10x1 var32(2.6) <- (var32(2.3) + 1)
}
for (int i2 = 0; i2 < 3; i2++)
{
  CREATE 1x1 var32(0x9b1950)
  REDUCE mean(var32(2.2)) TO 1x1 var32(0x9b1950)
  10x1 var32(2.3) <- (var32(2.2) - var32(0x9b1950))
}
CREATE 1x1 var32(0x9b1950)
REDUCE mean(var32(2.2)) TO 1x1 var32(0x9b1950)
10x1 var32(2.3) <- (var32(2.2) - var32(0x9b1950))
10x1 var32(2.6) <- (var32(2.3) + 1)
BARRIER
REDUCE sum(var32(2.6)) TO 1x1 VAR32(3.1)

The third part is a data transfer from the compute device back to the host.

READ readout(PTR(0x9aad80)PTR(0x9aad84)PTR(0x9aad88)PTR(0x9aad8c)PTR(0x9aad90)PTR(0x9aad94)PTR(0x9aad98)PTR(0x9aad9c)PTR(0x9aada0)PTR(0x9aada4), VAR32(3.1)) TO 1x1 VAR32(3.2)

The intermediate representation is translated by the JIT into an OpenCL kernel.

__kernel void f_524073289323672962_1(
  __global float * a1 ,
  __global float * a2 ,
  __global float * a3 ,
  __global float * a4 )
{
  float r1 ;
  float r2 ;
  for (int i0 = 0; i0 < 10; i0++)
  {
    a3 [i0 + get_global_id(0) * 10] = 0 ;
  }
  for (int i0 = 0; i0 < 10; i0++)
  {
    a3 [i0 + get_global_id(0) * 10] = (a1 [i0] * a2 [i0 + get_global_id(0) * 10]) ;
  }
  for (int i0 = 0; i0 < 2; i0++)
  {
    for (int i1 = 0; i1 < 3; i1++)
    {
      r2 = 0 ;
      for (int i2 = 0; i2 < 10; i2++)
      {
        r2 = r2 + a3 [i2 + get_global_id(0) * 10] ;
      }
      r1 = r2 / 10 ;
      for (int i2 = 0; i2 < 10; i2++)
      {
        a3 [i2 + get_global_id(0) * 10] = (a3 [i2 + get_global_id(0) * 10] - r1 ) ;
      }
    }
    for (int i1 = 0; i1 < 10; i1++)
    {
      a3 [i1 + get_global_id(0) * 10] = (a3 [i1 + get_global_id(0) * 10] + 1) ;
    }
  }
  barrier(CLK_LOCAL_MEM_FENCE) ;
  r2 = 0 ;
  for (int i0 = 0; i0 < 10; i0++)
  {
    r2 = r2 + a3 [i0 + get_global_id(0) * 10] ;
  }
  a4 [0 + get_global_id(0) * 1] = r2 ;
}

As is obvious, the JIT is still primitive. It is incomplete and still full of bugs. No vectorized memory access. Primitive use of compute device memory hierarchy. No auto-tuning. Fast math library kernels (e.g. matrix multiply) are missing. No random number generation support.

However, the design is working end-to-end which proves the viability of the solution. To be honest, there have been many times over the last year in which I was unsure if this would ever work. Well, it’s working now.

Now it’s becoming real

It’s working. Application threads are gathered and scheduled to the JIT which enqueues data transfers and dynamically generated kernels. OpenCL compute devices and the interpreter seamlessly intermingle with the same traces and data. Results are scattered back transparently without APIs or “out-of-band” idioms beyond what is natural for C++ (a managed platform should manage this for you!).

It feels like magic again.

This makes me think of a quote from Arthur C. Clarke, “Any sufficiently advanced technology is indistinguishable from magic.” Maybe I’m being very narcissistic here? Yep.

At the start of this project, I knew nothing. It was all magic. I could project any wish fulfillment dream as I did not know differently. With progress, I learned more so the magic faded. It became engineering. After that there came a period of extreme self doubt. Now it’s becoming real.