there are so many cores

Just another WordPress.com site

Monthly Archives: November 2011

ATI and NVIDIA OpenCL chimera binaries

If ATI and NVIDIA OpenCL SDKs are installed on a host, it is very easy to accidentally create a chimera binary. An application may show correct linkage to shared object libraries using ldd. The problem is that linkage also happens at runtime. To be safe, it’s necessary to take precautions with shared library paths.

Specifically, a NVIDIA OpenCL application will somehow find, dynamically load, and link ATI shared libraries if they are in the LD_CONFIG_PATH. It doesn’t matter that the CUDA paths come before ATI. It’s necessary to remove the ATI paths entirely from LD_CONFIG_PATH.

Going in the other direction, an ATI OpenCL application that (mistakenly, due to a bug) tries to use an NVIDIA GPU compute device can link NVIDIA CUDA shared libraries with the running process.

These runtime chimeras will often run (although I am not sure what they are doing) but evidence mysterious failures involving memory buffer management, usually terminating with a segmentation fault or abort signal.

I’ve been very foolish. This should have been obvious. Mixing happens very easily and silently unless there are deliberate mechanisms to detect incompatibility. Both vendor SDKs assume they are the only one present on the host. So naturally they don’t check for linkage errors.

Everything is working now. ATI works. NVIDIA works. They are both using the same JIT and autotuned kernel templates (in the sense of parameterized design, not C++ code). There are still JIT bugs. But the calculated results are consistent between the reference interpreter and the GPUs for both ATI and NVIDIA.

This is great as the whole point of having a managed platform is:

  1. higher level application code
  2. automated performance tuning
  3. write once, run on ATI and NVIDIA

So it’s looking like all three points are working out.

End-to-end functionality after almost two years of work

It’s working end-to-end. PeakStream DSL code like this:

Arrayf64 D;
{
    Arrayf64 A = make2(N, N, cpuA);
    Arrayf64 B = make2(N, N, cpuB);

    D = sum(matmul(A, B) + 1.2 * B) + 3.4;
}
d = D.read_scalar();

is converted to bytecode:

HASHCODE 12342043972070902001 NUMTRACES 1
0.0    
0.1    15 40 40 PTR ( 0x1b4ed00 )
1.0    
1.1    15 40 40 PTR ( 0x1b4f0d0 )
2.0    
2.1    3 25 50 25 16 0.1 1.1 33 48 1.2 1.1 48 3.4
1.4294967295    
0.4294967295    
2.2    38 2.1 PTR ( 0x1b4f630 )

which is then kernelized into five parts:

1. MMAUTO 80x80f64 = matmulNN(80x80f64, 80x80f64)

2. SEND makeX_f64(80, 80, PTR(0x1a18390)) TO 80x80 var64(0.1)
   SEND makeX_f64(80, 80, PTR(0x1a24ba0)) TO 80x80 var64(1.1)
   CREATE 80x80 var64(0x1a33130)
   CREATE 1x1 var64(0x1a33770)
   CREATE 1x1 VAR64(2.1)

3. MATMUL matmulMM(var64(0.1), var64(1.1)) TO 80x80 var64(0x1a33130)

4. REDUCE sum((var64(0x1a33130) + (1.2 * var64(1.1)))) TO 1x1 var64(0x1a33770)
   BARRIER
   1x1 VAR64(2.1) <- (var64(0x1a33770) + 3.4)

5. READ readout(PTR(0x1a31470), VAR64(2.1)) TO 1x1 VAR64(2.2)

and scheduled on an ATI HD 5870 using an auto-tuned kernel for the matrix multiply:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void evergreenmatmul_8_8_8_0_2_2_1_0_0_80_80_80_0_0_10_10_2_2_8(
    __global double2* matrixC,
    __read_only image2d_t matrixA,
    __global const double2* matrixB,
    __local double2* localB,
    const int M,
    const int N,
    const int K)
{
    const sampler_t sampler = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE;
    __global const double2* ptrMatrixB;
    __local const double2* ptrLocalB;
    double2 accum00;
    double2 accum01;
    double2 valueA00;
    double2 valueA01;
    double2 valueB00;
    double2 valueB01;
    accum00 = (double2)(0);
    accum01 = (double2)(0);
    ptrMatrixB = ((matrixB + get_global_id(0)) + (get_local_id(1) * N));
    for (int outerIdx = 0; outerIdx < (K / 20); outerIdx++)
    {
        *(localB + ((get_local_id(0) * 22) + get_local_id(1))) = *ptrMatrixB;
        *((localB + ((get_local_id(0) * 22) + get_local_id(1))) + 11) = *(ptrMatrixB + (N / 2));
        barrier(CLK_LOCAL_MEM_FENCE);
        ptrMatrixB += (N * 10);
        ptrLocalB = (localB + (get_local_id(0) * 22));
        for (int innerIdx = 0; innerIdx < 10; innerIdx++)
        {
            valueA00 = as_double2(read_imageui(matrixA, sampler, (int2)(((outerIdx * 10) + innerIdx), (get_global_id(1) * 2))));
            valueA01 = as_double2(read_imageui(matrixA, sampler, (int2)(((outerIdx * 10) + innerIdx), ((get_global_id(1) * 2) + 1))));
            valueB00 = *ptrLocalB;
            valueB01 = *(ptrLocalB + 11);
            ptrLocalB += 1;
            accum00.s0 = mad(valueA00.s0, valueB00.s0, accum00.s0);
            accum01.s0 = mad(valueA01.s0, valueB00.s0, accum01.s0);
            accum00.s0 = mad(valueA00.s1, valueB01.s0, accum00.s0);
            accum01.s0 = mad(valueA01.s1, valueB01.s0, accum01.s0);
            accum00.s1 = mad(valueA00.s0, valueB00.s1, accum00.s1);
            accum01.s1 = mad(valueA01.s0, valueB00.s1, accum01.s1);
            accum00.s1 = mad(valueA00.s1, valueB01.s1, accum00.s1);
            accum01.s1 = mad(valueA01.s1, valueB01.s1, accum01.s1);
        }
    }
    *((matrixC + (((get_group_id(1) * 10) + get_local_id(1)) * ((2 * N) / 2))) + ((get_group_id(0) * 10) + get_local_id(0))) = accum00;
    *(((matrixC + (((get_group_id(1) * 10) + get_local_id(1)) * ((2 * N) / 2))) + ((get_group_id(0) * 10) + get_local_id(0))) + (N / 2)) = accum01;
}

with another dynamically generated kernel for the reduction:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void f_151313832001081542_3(
  __global const double2 * a1 ,
  __global double2 * a2 ,
  __global double2 * a3 ,
  __global double2 * a4 )
{
  double2 r1 ;
  r1 = (double2)(0);
  for (int i0 = 0; i0 < 80; i0++)
  {
    for (int i1 = 0; i1 < 40; i1++)
    {
      r1 = r1 + (a3 [i1 + i0 * 40] + (1.2 * a1 [i1 + i0 * 40])) ;
    }
  }
  a4 [0] = r1 ;
  barrier(CLK_LOCAL_MEM_FENCE) ;
  for (int i0 = 0; i0 < 1; i0++)
  {
    a2 [i0 + get_global_id(0) * 0] = (a4 [i0 + get_global_id(0) * 0] + 3.4) ;
  }
}

It has taken almost two years of work to reach this point.

Two years ago, I had a vague notion of using GPUs for desktop HPC. I had never done any GPGPU development before. Around a year ago, I settled on the PeakStream style C++ DSL with a virtual machine and JIT. The pieces are finally coming together. And they are working.

I want to make another comment about floating point error on the GPU. The mixed precision case of the last post may be egregiously bad. I notice that error for purely double precision kernels is quite low. Does IEEE floating point specify error bounds for mixed precision arithmetic? Even if it does, you have to read the fine print to know what parts of the standard your hardware conforms to.

On a CPU, a compiler that optimizes register use should give 1.5x to 2x speedup with single precision. Some compilers (e.g. Java HotSpot, last I checked) do not perform this optimization. So there is little to no performance advantage to single precision arithmetic.

GPUs are sometimes different. NVIDIA GPUs (Fermi) perform single precision at 2x the rate of double precision (actually, it’s more complicated, this may only be true of the professional models, on the consumer GPUs, double precision is locked at reduced performance). On ATI GPUs (Evergreen), single precision is 5x the rate of double precision.

In addition, GPU memory is a limited resource. Anything to reduce memory use could have a dramatic effect on total throughput. So single precision, if acceptable to a calculation, has many performance advantages.

However, most real world analytics are done in double precision for obvious reasons. So any use of single precision is likely to be mixed into the computational pipeline where we can get away with it. The thing to remember is that arithmetic error in the mixed precision case may be much higher and stranger than in the purely double precision case. So testing and validation is required before accepting a mixed precision solution.

Distribution of error in ATI GPU output

ATI GPUs do not have ECC memory. NVIDIA GPUs (at least the professional models) use ECC memory. In practical terms, what difference does this make to GPGPU applications?

The mixed precision kernel evergreenmatmul_4_8_4_2_0_0_10_0_400_400_400_1_0_10_10_4_4_23 from the last post is really ten 400x400x400 matrix multiplications tiled into a kernel invocation. One test with random positive numbers gives a final error against a reference calculation on the CPU as follows.

ABSDIFF 6
(6, 224, 121) diff by 1    host array: 93.9413    gpu calc: 92.9138
(6, 224, 135) diff by 1    host array: 102.461    gpu calc: 101.153
(6, 225, 135) diff by 1    host array: 100.383    gpu calc: 101.405
(9, 184, 51) diff by 1    host array: 95.3789    gpu calc: 96.4027
(9, 184, 54) diff by 1    host array: 104.34    gpu calc: 105.444
(9, 187, 49) diff by 1    host array: 102.789    gpu calc: 101.757

Out of 1.6 million numbers in calculated output (10 * 400 * 400 = 1600000), exactly six of them are wrong. The rest are within 1e-07 of the reference calculation. Repeating this test yields a different set of errors each time. Anywhere from a half dozen to two dozen numbers will be wrong. It’s random.

There was a bug. I used abs() instead of fabs(). That truncated most errors to zero. Here is what the error distribution really looks like:

                       1.0 <= error : 9
                 0.1 <= error < 1.0 : 369
                0.01 <= error < 0.1 : 67
              0.001 <= error < 0.01 : 2
            0.0001 <= error < 0.001 : 901
          0.00001 <= error < 0.0001 : 1116457
        0.000001 <= error < 0.00001 : 318057
      0.0000001 <= error < 0.000001 : 0
    0.00000001 <= error < 0.0000001 : 0
  0.000000001 <= error < 0.00000001 : 0
0.0000000001 <= error < 0.000000001 : 0
               error < 0.0000000001 : 164138

Each element of the output array is calculated from the sum of 400 double precision random numbers. There will be some round-off error. However, the error histogram shows much higher errors occur fairly often. I recall that PeakStream had done fairly detailed analysis of arithmetic error on the GPU and touted this as a selling point over other frameworks that largely ignored the issue. (It will be interesting to see what the error distribution looks like on NVIDIA GPUs. Despite ECC memory, I would be surprised if there are not similar issues with arithmetic error.)

Is this acceptable for your application? It is for mine. (Note: this is the stuff the vendors never talk about!)

When I was a support quant, error metrics of the forecast strategy against realized actuals could easily fluctuate by 50 to 100 basis points just due to accumulated round-off error. That was not too surprising as the data was long-tailed. Adding small and large floating point numbers together results in significant errors. Fortunately, this was not a problem for us.

In quantitative strategy problems, there is already so much uncertainty from data quality issues that a trade of higher number crunching performance for relaxed consistency in calculated output correctness may be good. That’s probably not true of many scientific computing problems, however. That explains why NVIDIA’s trade-off of reduced performance and higher cost for correctness is really a must-have for much of the HPC market. The higher performance of ATI GPUs is generally used for problems like Bitcoin mining and password cracking.

Matrix multiply for the JIT

The new design for matrix multiply looks good. I’ve been testing on an ATI HD 5870 for several days, shaking out bugs. The code generation handles the following parameters:

  • M, N, and K matrix dimensions, optionally inlined
  • row and column data layout (transposed matrices)
  • single and double precision floating point (may be mixed together)
  • combinations of images and memory buffers of any vector length
  • square outer blocking in work group
  • rectangular inner blocking in registers (width and height)
  • inner product loop order
  • using global versus group/local index space ID functions
  • optimized for general and simple matrix multiply
  • number of matrix multiplies packed as tiles into kernel

Endogenous parameters for the expectation-maximization auto-tuning are in bold. The rest are treated as exogenous with respect to the statistical optimization. I tried making more parameters endogenous but found this invoked the “curse of dimensionality” and gave inconsistent results. For this problem, brute force search over the exogenous parameters with stable auto-tuning optimization for the rest is the more practical approach.

The current design is what I always wanted but was scared to do. It handles all combinations, mixtures of precisions, memory types, vector lengths, etc. I haven’t seen this done anywhere else. Optimized math kernel designs generally constrain the input and output arrays to specific types and formats.

Implementing this also gave me a better sense for why the auto-tuning converges. Why should the benchmark surface be convex?

Here’s my intuition. The optimization is really optimizing register use. To some degree (perhaps due to quirks of the shader compiler), every parameter mentioned above affects the amount of private GPU memory used. This affects both GPU thread scheduling and arithmetic intensity. If a kernel uses too many registers, there isn’t enough memory left on the GPU for efficient scheduling. If a kernel uses too few registers, then each ID is doing too little work and arithmetic intensity is low. Somewhere in the middle is the optimal number of registers.

This is why the expectation-maximization converges when auto-tuning.

I will have to make some pretty charts and graphs, maybe an animation, illustrating this for the conference.

Here’s an example kernel the JIT back-end can generate. It’s something tricky that would be very difficult to do by hand. It is ten 400x400x400 matrix multiplies with mixed precision arrays of images and memory buffers. Instead of the usual float4 quad, it uses float2 (sometimes faster). This runs at about 300 GFLOPS.

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void evergreenmatmul_4_8_4_2_0_0_10_0_400_400_400_1_0_10_10_4_4_23(
    __write_only image2d_t matrixC,
    __global const float2* matrixA,
    __read_only image2d_t matrixB,
    __local float2* localA)
{
    const sampler_t sampler = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE;
    __global const float2* ptrMatrixA;
    __local const float2* ptrLocalA;
    float4 accum00;
    float4 accum01;
    float4 accum02;
    float4 accum03;
    float2 valueA00;
    float2 valueA01;
    float2 valueA02;
    float2 valueA03;
    float2 valueA10;
    float2 valueA11;
    float2 valueA12;
    float2 valueA13;
    double2 valueB00;
    double2 valueB01;
    double2 valueB02;
    double2 valueB03;
    double2 valueB10;
    double2 valueB11;
    double2 valueB12;
    double2 valueB13;
    for (int packIdx = 0; packIdx < 10; packIdx++)
    {
        accum00 = (float4)(0);
        accum01 = (float4)(0);
        accum02 = (float4)(0);
        accum03 = (float4)(0);
        ptrMatrixA = (((matrixA + (packIdx * ((400 * 400) / 2))) + (get_global_id(1) * 2)) + (get_local_id(0) * (400 * 2)));
        for (int outerIdx = 0; outerIdx < (400 / 40); outerIdx++)
        {
            *(localA + (((get_local_id(1) * 44) + get_local_id(0)) * 2)) = *ptrMatrixA;
            *((localA + (((get_local_id(1) * 44) + get_local_id(0)) * 2)) + 1) = *(ptrMatrixA + 400);
            *(localA + ((((get_local_id(1) * 44) + get_local_id(0)) + 11) * 2)) = *(ptrMatrixA + (400 / 2));
            *((localA + ((((get_local_id(1) * 44) + get_local_id(0)) + 11) * 2)) + 1) = *((ptrMatrixA + (400 / 2)) + 400);
            *(localA + ((((get_local_id(1) * 44) + get_local_id(0)) + 22) * 2)) = *(ptrMatrixA + 1);
            *((localA + ((((get_local_id(1) * 44) + get_local_id(0)) + 22) * 2)) + 1) = *((ptrMatrixA + 1) + 400);
            *(localA + ((((get_local_id(1) * 44) + get_local_id(0)) + 33) * 2)) = *((ptrMatrixA + 1) + (400 / 2));
            *((localA + ((((get_local_id(1) * 44) + get_local_id(0)) + 33) * 2)) + 1) = *(((ptrMatrixA + 1) + (400 / 2)) + 400);
            barrier(CLK_LOCAL_MEM_FENCE);
            ptrMatrixA += (400 * 20);
            ptrLocalA = (localA + (get_local_id(1) * 88));
            for (int innerIdx = 0; innerIdx < 10; innerIdx++)
            {
                valueA00 = *ptrLocalA;
                valueA10 = *(ptrLocalA + 1);
                valueA01 = *(ptrLocalA + 22);
                valueA11 = *((ptrLocalA + 22) + 1);
                valueA02 = *(ptrLocalA + 44);
                valueA12 = *((ptrLocalA + 44) + 1);
                valueA03 = *(ptrLocalA + 66);
                valueA13 = *((ptrLocalA + 66) + 1);
                ptrLocalA += 2;
                valueB00 = as_double2(read_imageui(matrixB, sampler, (int2)((get_global_id(0) * 2), ((packIdx * 400) + (((outerIdx * 10) + innerIdx) * 4)))));
                valueB10 = as_double2(read_imageui(matrixB, sampler, (int2)(((get_global_id(0) * 2) + 1), ((packIdx * 400) + (((outerIdx * 10) + innerIdx) * 4)))));
                valueB01 = as_double2(read_imageui(matrixB, sampler, (int2)((get_global_id(0) * 2), (((packIdx * 400) + (((outerIdx * 10) + innerIdx) * 4)) + 1))));
                valueB11 = as_double2(read_imageui(matrixB, sampler, (int2)(((get_global_id(0) * 2) + 1), (((packIdx * 400) + (((outerIdx * 10) + innerIdx) * 4)) + 1))));
                valueB02 = as_double2(read_imageui(matrixB, sampler, (int2)((get_global_id(0) * 2), (((packIdx * 400) + (((outerIdx * 10) + innerIdx) * 4)) + 2))));
                valueB12 = as_double2(read_imageui(matrixB, sampler, (int2)(((get_global_id(0) * 2) + 1), (((packIdx * 400) + (((outerIdx * 10) + innerIdx) * 4)) + 2))));
                valueB03 = as_double2(read_imageui(matrixB, sampler, (int2)((get_global_id(0) * 2), (((packIdx * 400) + (((outerIdx * 10) + innerIdx) * 4)) + 3))));
                valueB13 = as_double2(read_imageui(matrixB, sampler, (int2)(((get_global_id(0) * 2) + 1), (((packIdx * 400) + (((outerIdx * 10) + innerIdx) * 4)) + 3))));
                accum00.s0 = mad(valueA00.s0, valueB00.s0, accum00.s0);
                accum01.s0 = mad(valueA00.s1, valueB00.s0, accum01.s0);
                accum02.s0 = mad(valueA02.s0, valueB00.s0, accum02.s0);
                accum03.s0 = mad(valueA02.s1, valueB00.s0, accum03.s0);
                accum00.s1 = mad(valueA00.s0, valueB00.s1, accum00.s1);
                accum01.s1 = mad(valueA00.s1, valueB00.s1, accum01.s1);
                accum02.s1 = mad(valueA02.s0, valueB00.s1, accum02.s1);
                accum03.s1 = mad(valueA02.s1, valueB00.s1, accum03.s1);
                accum00.s2 = mad(valueA00.s0, valueB10.s0, accum00.s2);
                accum01.s2 = mad(valueA00.s1, valueB10.s0, accum01.s2);
                accum02.s2 = mad(valueA02.s0, valueB10.s0, accum02.s2);
                accum03.s2 = mad(valueA02.s1, valueB10.s0, accum03.s2);
                accum00.s3 = mad(valueA00.s0, valueB10.s1, accum00.s3);
                accum01.s3 = mad(valueA00.s1, valueB10.s1, accum01.s3);
                accum02.s3 = mad(valueA02.s0, valueB10.s1, accum02.s3);
                accum03.s3 = mad(valueA02.s1, valueB10.s1, accum03.s3);
                accum00.s0 = mad(valueA01.s0, valueB01.s0, accum00.s0);
                accum01.s0 = mad(valueA01.s1, valueB01.s0, accum01.s0);
                accum02.s0 = mad(valueA03.s0, valueB01.s0, accum02.s0);
                accum03.s0 = mad(valueA03.s1, valueB01.s0, accum03.s0);
                accum00.s1 = mad(valueA01.s0, valueB01.s1, accum00.s1);
                accum01.s1 = mad(valueA01.s1, valueB01.s1, accum01.s1);
                accum02.s1 = mad(valueA03.s0, valueB01.s1, accum02.s1);
                accum03.s1 = mad(valueA03.s1, valueB01.s1, accum03.s1);
                accum00.s2 = mad(valueA01.s0, valueB11.s0, accum00.s2);
                accum01.s2 = mad(valueA01.s1, valueB11.s0, accum01.s2);
                accum02.s2 = mad(valueA03.s0, valueB11.s0, accum02.s2);
                accum03.s2 = mad(valueA03.s1, valueB11.s0, accum03.s2);
                accum00.s3 = mad(valueA01.s0, valueB11.s1, accum00.s3);
                accum01.s3 = mad(valueA01.s1, valueB11.s1, accum01.s3);
                accum02.s3 = mad(valueA03.s0, valueB11.s1, accum02.s3);
                accum03.s3 = mad(valueA03.s1, valueB11.s1, accum03.s3);
                accum00.s0 = mad(valueA10.s0, valueB02.s0, accum00.s0);
                accum01.s0 = mad(valueA10.s1, valueB02.s0, accum01.s0);
                accum02.s0 = mad(valueA12.s0, valueB02.s0, accum02.s0);
                accum03.s0 = mad(valueA12.s1, valueB02.s0, accum03.s0);
                accum00.s1 = mad(valueA10.s0, valueB02.s1, accum00.s1);
                accum01.s1 = mad(valueA10.s1, valueB02.s1, accum01.s1);
                accum02.s1 = mad(valueA12.s0, valueB02.s1, accum02.s1);
                accum03.s1 = mad(valueA12.s1, valueB02.s1, accum03.s1);
                accum00.s2 = mad(valueA10.s0, valueB12.s0, accum00.s2);
                accum01.s2 = mad(valueA10.s1, valueB12.s0, accum01.s2);
                accum02.s2 = mad(valueA12.s0, valueB12.s0, accum02.s2);
                accum03.s2 = mad(valueA12.s1, valueB12.s0, accum03.s2);
                accum00.s3 = mad(valueA10.s0, valueB12.s1, accum00.s3);
                accum01.s3 = mad(valueA10.s1, valueB12.s1, accum01.s3);
                accum02.s3 = mad(valueA12.s0, valueB12.s1, accum02.s3);
                accum03.s3 = mad(valueA12.s1, valueB12.s1, accum03.s3);
                accum00.s0 = mad(valueA11.s0, valueB03.s0, accum00.s0);
                accum01.s0 = mad(valueA11.s1, valueB03.s0, accum01.s0);
                accum02.s0 = mad(valueA13.s0, valueB03.s0, accum02.s0);
                accum03.s0 = mad(valueA13.s1, valueB03.s0, accum03.s0);
                accum00.s1 = mad(valueA11.s0, valueB03.s1, accum00.s1);
                accum01.s1 = mad(valueA11.s1, valueB03.s1, accum01.s1);
                accum02.s1 = mad(valueA13.s0, valueB03.s1, accum02.s1);
                accum03.s1 = mad(valueA13.s1, valueB03.s1, accum03.s1);
                accum00.s2 = mad(valueA11.s0, valueB13.s0, accum00.s2);
                accum01.s2 = mad(valueA11.s1, valueB13.s0, accum01.s2);
                accum02.s2 = mad(valueA13.s0, valueB13.s0, accum02.s2);
                accum03.s2 = mad(valueA13.s1, valueB13.s0, accum03.s2);
                accum00.s3 = mad(valueA11.s0, valueB13.s1, accum00.s3);
                accum01.s3 = mad(valueA11.s1, valueB13.s1, accum01.s3);
                accum02.s3 = mad(valueA13.s0, valueB13.s1, accum02.s3);
                accum03.s3 = mad(valueA13.s1, valueB13.s1, accum03.s3);
            }
        }
        write_imagef(matrixC, (int2)(get_global_id(0), ((packIdx * 400) + (get_global_id(1) * 4))), accum00);
        write_imagef(matrixC, (int2)(get_global_id(0), (((packIdx * 400) + (get_global_id(1) * 4)) + 1)), accum01);
        write_imagef(matrixC, (int2)(get_global_id(0), (((packIdx * 400) + (get_global_id(1) * 4)) + 2)), accum02);
        write_imagef(matrixC, (int2)(get_global_id(0), (((packIdx * 400) + (get_global_id(1) * 4)) + 3)), accum03);
    }
}

Rewrote auto-tuning

The auto-tuning code is better, faster, and stronger. The C++ template bloat is removed. Code generation is now fully dynamic and not constrained by specialization of templates at compile time. The convenient string hacks in kernel metaprogramming have been removed – the left hand side of a statement is always a proper variable, never a value or a string.

The major functional change is: A matrix multiply kernel may use images and memory buffers of different precisions and vector lengths. There is no hard distinction between image (texture sampling) and memory buffer kernels. Code generation handles all combinations.

After this rewrite, I see that the PeakStream domain specific language should be extended. PeakStream was a HPC managed platform designed for GPU technology as it was five years ago.

Since then, GPUs have become stronger, increasing the imbalance between cores/ALU throughput and bus/memory bandwidth. It’s the supercomputing problem all over again. Data movement becomes the dominating cost.

The PeakStream language encapsulates compute devices and hides kernels in the compute graph. This is good for ease of use. It puts a heavy burden on the intelligence of the JIT. It’s unrealistic to expect that the JIT will be smart enough.

High arithmetic intensity kernels should be exposed for performance reasons. It’s difficult for a JIT to determine invariants that allow optimizations like using write-only images versus memory buffers. It’s even harder for a JIT to explore algorithm transformations. For instance, transposing a matrix, changing from row to column major data layout, can have dramatic effects on performance.