there are so many cores

Just another WordPress.com site

End-to-end functionality after almost two years of work

It’s working end-to-end. PeakStream DSL code like this:

Arrayf64 D;
{
    Arrayf64 A = make2(N, N, cpuA);
    Arrayf64 B = make2(N, N, cpuB);

    D = sum(matmul(A, B) + 1.2 * B) + 3.4;
}
d = D.read_scalar();

is converted to bytecode:

HASHCODE 12342043972070902001 NUMTRACES 1
0.0    
0.1    15 40 40 PTR ( 0x1b4ed00 )
1.0    
1.1    15 40 40 PTR ( 0x1b4f0d0 )
2.0    
2.1    3 25 50 25 16 0.1 1.1 33 48 1.2 1.1 48 3.4
1.4294967295    
0.4294967295    
2.2    38 2.1 PTR ( 0x1b4f630 )

which is then kernelized into five parts:

1. MMAUTO 80x80f64 = matmulNN(80x80f64, 80x80f64)

2. SEND makeX_f64(80, 80, PTR(0x1a18390)) TO 80x80 var64(0.1)
   SEND makeX_f64(80, 80, PTR(0x1a24ba0)) TO 80x80 var64(1.1)
   CREATE 80x80 var64(0x1a33130)
   CREATE 1x1 var64(0x1a33770)
   CREATE 1x1 VAR64(2.1)

3. MATMUL matmulMM(var64(0.1), var64(1.1)) TO 80x80 var64(0x1a33130)

4. REDUCE sum((var64(0x1a33130) + (1.2 * var64(1.1)))) TO 1x1 var64(0x1a33770)
   BARRIER
   1x1 VAR64(2.1) <- (var64(0x1a33770) + 3.4)

5. READ readout(PTR(0x1a31470), VAR64(2.1)) TO 1x1 VAR64(2.2)

and scheduled on an ATI HD 5870 using an auto-tuned kernel for the matrix multiply:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void evergreenmatmul_8_8_8_0_2_2_1_0_0_80_80_80_0_0_10_10_2_2_8(
    __global double2* matrixC,
    __read_only image2d_t matrixA,
    __global const double2* matrixB,
    __local double2* localB,
    const int M,
    const int N,
    const int K)
{
    const sampler_t sampler = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE;
    __global const double2* ptrMatrixB;
    __local const double2* ptrLocalB;
    double2 accum00;
    double2 accum01;
    double2 valueA00;
    double2 valueA01;
    double2 valueB00;
    double2 valueB01;
    accum00 = (double2)(0);
    accum01 = (double2)(0);
    ptrMatrixB = ((matrixB + get_global_id(0)) + (get_local_id(1) * N));
    for (int outerIdx = 0; outerIdx < (K / 20); outerIdx++)
    {
        *(localB + ((get_local_id(0) * 22) + get_local_id(1))) = *ptrMatrixB;
        *((localB + ((get_local_id(0) * 22) + get_local_id(1))) + 11) = *(ptrMatrixB + (N / 2));
        barrier(CLK_LOCAL_MEM_FENCE);
        ptrMatrixB += (N * 10);
        ptrLocalB = (localB + (get_local_id(0) * 22));
        for (int innerIdx = 0; innerIdx < 10; innerIdx++)
        {
            valueA00 = as_double2(read_imageui(matrixA, sampler, (int2)(((outerIdx * 10) + innerIdx), (get_global_id(1) * 2))));
            valueA01 = as_double2(read_imageui(matrixA, sampler, (int2)(((outerIdx * 10) + innerIdx), ((get_global_id(1) * 2) + 1))));
            valueB00 = *ptrLocalB;
            valueB01 = *(ptrLocalB + 11);
            ptrLocalB += 1;
            accum00.s0 = mad(valueA00.s0, valueB00.s0, accum00.s0);
            accum01.s0 = mad(valueA01.s0, valueB00.s0, accum01.s0);
            accum00.s0 = mad(valueA00.s1, valueB01.s0, accum00.s0);
            accum01.s0 = mad(valueA01.s1, valueB01.s0, accum01.s0);
            accum00.s1 = mad(valueA00.s0, valueB00.s1, accum00.s1);
            accum01.s1 = mad(valueA01.s0, valueB00.s1, accum01.s1);
            accum00.s1 = mad(valueA00.s1, valueB01.s1, accum00.s1);
            accum01.s1 = mad(valueA01.s1, valueB01.s1, accum01.s1);
        }
    }
    *((matrixC + (((get_group_id(1) * 10) + get_local_id(1)) * ((2 * N) / 2))) + ((get_group_id(0) * 10) + get_local_id(0))) = accum00;
    *(((matrixC + (((get_group_id(1) * 10) + get_local_id(1)) * ((2 * N) / 2))) + ((get_group_id(0) * 10) + get_local_id(0))) + (N / 2)) = accum01;
}

with another dynamically generated kernel for the reduction:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void f_151313832001081542_3(
  __global const double2 * a1 ,
  __global double2 * a2 ,
  __global double2 * a3 ,
  __global double2 * a4 )
{
  double2 r1 ;
  r1 = (double2)(0);
  for (int i0 = 0; i0 < 80; i0++)
  {
    for (int i1 = 0; i1 < 40; i1++)
    {
      r1 = r1 + (a3 [i1 + i0 * 40] + (1.2 * a1 [i1 + i0 * 40])) ;
    }
  }
  a4 [0] = r1 ;
  barrier(CLK_LOCAL_MEM_FENCE) ;
  for (int i0 = 0; i0 < 1; i0++)
  {
    a2 [i0 + get_global_id(0) * 0] = (a4 [i0 + get_global_id(0) * 0] + 3.4) ;
  }
}

It has taken almost two years of work to reach this point.

Two years ago, I had a vague notion of using GPUs for desktop HPC. I had never done any GPGPU development before. Around a year ago, I settled on the PeakStream style C++ DSL with a virtual machine and JIT. The pieces are finally coming together. And they are working.

I want to make another comment about floating point error on the GPU. The mixed precision case of the last post may be egregiously bad. I notice that error for purely double precision kernels is quite low. Does IEEE floating point specify error bounds for mixed precision arithmetic? Even if it does, you have to read the fine print to know what parts of the standard your hardware conforms to.

On a CPU, a compiler that optimizes register use should give 1.5x to 2x speedup with single precision. Some compilers (e.g. Java HotSpot, last I checked) do not perform this optimization. So there is little to no performance advantage to single precision arithmetic.

GPUs are sometimes different. NVIDIA GPUs (Fermi) perform single precision at 2x the rate of double precision (actually, it’s more complicated, this may only be true of the professional models, on the consumer GPUs, double precision is locked at reduced performance). On ATI GPUs (Evergreen), single precision is 5x the rate of double precision.

In addition, GPU memory is a limited resource. Anything to reduce memory use could have a dramatic effect on total throughput. So single precision, if acceptable to a calculation, has many performance advantages.

However, most real world analytics are done in double precision for obvious reasons. So any use of single precision is likely to be mixed into the computational pipeline where we can get away with it. The thing to remember is that arithmetic error in the mixed precision case may be much higher and stranger than in the purely double precision case. So testing and validation is required before accepting a mixed precision solution.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: