there are so many cores

Just another WordPress.com site

Less trivial JIT example

Here’s an application code sample which illustrates the current state of the JIT with a data parallel scalar reduction. Text in boldface is the PeakStream API. Variables and data that connect the host application with the virtual machine are in italics.

#include <iostream>
#include <omp.h>
#include <peakstream.h>

using namespace chai;
using namespace std;

int main(int argc, char *argv[])
{
    const size_t NUMBER_THREADS = 10;
    const size_t ARRAY_SIZE     = 10;

    // calculated results
    float outValues[NUMBER_THREADS];

    // common data to all threads
    float cpuA[ARRAY_SIZE];
    for (size_t i = 0; i < ARRAY_SIZE; i++)
        cpuA[i] = i * i;

    init(); // initialize virtual machine

    // use OpenMP for easy concurrency in data parallel style
    #pragma omp parallel for
    for (size_t threadIdx = 0; threadIdx < NUMBER_THREADS; threadIdx++)
    {
        // data specific to each thread
        float cpuB[ARRAY_SIZE];
        for (size_t i = 0; i < ARRAY_SIZE; i++)
            cpuB[i] = i * threadIdx;

        // PeakStream trace
        Arrayf32 D;
        {
            Arrayf32 A = make1(ARRAY_SIZE, cpuA);
            Arrayf32 B = make1(ARRAY_SIZE, cpuB);
            Arrayf32 C = zeros_f32(ARRAY_SIZE);

            C = A * B;

            for (size_t i = 0; i < 2; i++)
            {
                for (size_t j = 0; j < 3; j++)
                    C -= mean(C);

                C += 1;
            }

            D = sum(C);
        }
        outValues[threadIdx] = D.read_scalar();
    }

    shutdown(); // shutdown virtual machine

    for (size_t i = 0; i < NUMBER_THREADS; i++)
        cout << "outValues[" << i << "] is " << outValues[i] << endl;

    return 0;
}

This example is still very simple as the intermediate representation in the JIT has only three parts.

The first part creates memory buffers on the device and transfers data from host to compute device.

SEND makeX_f32(10, 1, PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)PTR(0x9aae20)) TO 10x1 var32(0.1)
SEND makeX_f32(10, 1, PTR(0x9ad180)PTR(0x9ad1a8)PTR(0x9ad1d0)PTR(0x9ad1f8)PTR(0x9ad220)PTR(0x9ad248)PTR(0x9ad270)PTR(0x9ad298)PTR(0x9ad2c0)PTR(0x9ad2e8)) TO 10x1 var32(1.1)
CREATE 10x1 var32(2.1)
CREATE 1x1 VAR32(3.1)

The second part is a kernel for the compute device.

LITERAL zerosX_f32(10, 1) TO 10x1 var32(2.1)
10x1 var32(2.2) <- (var32(0.1) * var32(1.1))
for (int i0 = 0; i0 < 2; i0++)
{
  for (int i1 = 0; i1 < 3; i1++)
  {
    CREATE 1x1 var32(0x9b1950)
    REDUCE mean(var32(2.2)) TO 1x1 var32(0x9b1950)
    10x1 var32(2.3) <- (var32(2.2) - var32(0x9b1950))
  }
  10x1 var32(2.6) <- (var32(2.3) + 1)
}
for (int i2 = 0; i2 < 3; i2++)
{
  CREATE 1x1 var32(0x9b1950)
  REDUCE mean(var32(2.2)) TO 1x1 var32(0x9b1950)
  10x1 var32(2.3) <- (var32(2.2) - var32(0x9b1950))
}
CREATE 1x1 var32(0x9b1950)
REDUCE mean(var32(2.2)) TO 1x1 var32(0x9b1950)
10x1 var32(2.3) <- (var32(2.2) - var32(0x9b1950))
10x1 var32(2.6) <- (var32(2.3) + 1)
BARRIER
REDUCE sum(var32(2.6)) TO 1x1 VAR32(3.1)

The third part is a data transfer from the compute device back to the host.

READ readout(PTR(0x9aad80)PTR(0x9aad84)PTR(0x9aad88)PTR(0x9aad8c)PTR(0x9aad90)PTR(0x9aad94)PTR(0x9aad98)PTR(0x9aad9c)PTR(0x9aada0)PTR(0x9aada4), VAR32(3.1)) TO 1x1 VAR32(3.2)

The intermediate representation is translated by the JIT into an OpenCL kernel.

__kernel void f_524073289323672962_1(
  __global float * a1 ,
  __global float * a2 ,
  __global float * a3 ,
  __global float * a4 )
{
  float r1 ;
  float r2 ;
  for (int i0 = 0; i0 < 10; i0++)
  {
    a3 [i0 + get_global_id(0) * 10] = 0 ;
  }
  for (int i0 = 0; i0 < 10; i0++)
  {
    a3 [i0 + get_global_id(0) * 10] = (a1 [i0] * a2 [i0 + get_global_id(0) * 10]) ;
  }
  for (int i0 = 0; i0 < 2; i0++)
  {
    for (int i1 = 0; i1 < 3; i1++)
    {
      r2 = 0 ;
      for (int i2 = 0; i2 < 10; i2++)
      {
        r2 = r2 + a3 [i2 + get_global_id(0) * 10] ;
      }
      r1 = r2 / 10 ;
      for (int i2 = 0; i2 < 10; i2++)
      {
        a3 [i2 + get_global_id(0) * 10] = (a3 [i2 + get_global_id(0) * 10] - r1 ) ;
      }
    }
    for (int i1 = 0; i1 < 10; i1++)
    {
      a3 [i1 + get_global_id(0) * 10] = (a3 [i1 + get_global_id(0) * 10] + 1) ;
    }
  }
  barrier(CLK_LOCAL_MEM_FENCE) ;
  r2 = 0 ;
  for (int i0 = 0; i0 < 10; i0++)
  {
    r2 = r2 + a3 [i0 + get_global_id(0) * 10] ;
  }
  a4 [0 + get_global_id(0) * 1] = r2 ;
}

As is obvious, the JIT is still primitive. It is incomplete and still full of bugs. No vectorized memory access. Primitive use of compute device memory hierarchy. No auto-tuning. Fast math library kernels (e.g. matrix multiply) are missing. No random number generation support.

However, the design is working end-to-end which proves the viability of the solution. To be honest, there have been many times over the last year in which I was unsure if this would ever work. Well, it’s working now.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: