there are so many cores

Just another WordPress.com site

Monthly Archives: April 2012

AMD supports images on CPU devices

The AMD OpenCL SDK supports images on CPU devices. This is a surprise. Two years ago, images only worked on GPU devices. I never bothered to retest. I only noticed by accident when testing gather operations with images.

Here is some sample code:

const size_t N = 12;

void filter(float* cpuA, float* cpuB, const size_t N)
{
    Arrayf32 B;
    {
        Arrayf32 X = index_f32(0, N, N);
        Arrayf32 Y = index_f32(1, N, N);

        Arrayf32 A = make2(N, N, cpuA);

        B = zeros_f32(N, N);
        B += gather2_floor(A, X, Y);
        B += gather2_floor(A, X - 1, Y);
        B += gather2_floor(A, X + 1, Y);
        B += gather2_floor(A, X, Y - 1);
        B += gather2_floor(A, X, Y + 1);
    }
    B.read1(cpuB, N * N * sizeof(float));
}

When images are configured as a device capability, the JIT generates:

__kernel void f_9583626628478369588_1(
  __read_only image2d_t a1,
  __global float4 * a2)
{
  const sampler_t sampler = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE;
  const float4 a10 = read_imagef(a1, sampler, (int2)(((get_global_id(0) + 0) % 3), ((get_global_id(1) + 0) % 12)));
  const float4 a11 = read_imagef(a1, sampler, (int2)(((get_global_id(0) + 2) % 3), ((get_global_id(1) + 0) % 12)));
  const float4 a12 = read_imagef(a1, sampler, (int2)(((get_global_id(0) + 1) % 3), ((get_global_id(1) + 0) % 12)));
  const float4 a13 = read_imagef(a1, sampler, (int2)(((get_global_id(0) + 0) % 3), ((get_global_id(1) + 11) % 12)));
  const float4 a14 = read_imagef(a1, sampler, (int2)(((get_global_id(0) + 0) % 3), ((get_global_id(1) + 1) % 12)));
  float4 r1;
  r1= 0 ;
  r1= (r1 + (float4)(a10.s0, a10.s1, a10.s2, a10.s3)) ;
  r1= (r1 + (float4)(a11.s3, a10.s0, a10.s1, a10.s2)) ;
  r1= (r1 + (float4)(a10.s1, a10.s2, a10.s3, a12.s0)) ;
  r1= (r1 + (float4)(a13.s0, a13.s1, a13.s2, a13.s3)) ;
  r1= (r1 + (float4)(a14.s0, a14.s1, a14.s2, a14.s3)) ;
  a2[(get_global_id(0)) + get_global_id(1) * 3] = r1;
}

When the device is configured for memory buffers only, the JIT generates:

__kernel void f_9583626628478369588_1(
  __global const float4 * a1,
  __global float4 * a2)
{
  const float4 a10 = a1[((get_global_id(0) + 0) % 3) + ((get_global_id(1) + 0) % 12) * 3];
  const float4 a11 = a1[((get_global_id(0) + 2) % 3) + ((get_global_id(1) + 0) % 12) * 3];
  const float4 a12 = a1[((get_global_id(0) + 1) % 3) + ((get_global_id(1) + 0) % 12) * 3];
  const float4 a13 = a1[((get_global_id(0) + 0) % 3) + ((get_global_id(1) + 11) % 12) * 3];
  const float4 a14 = a1[((get_global_id(0) + 0) % 3) + ((get_global_id(1) + 1) % 12) * 3];
  float4 r1;
  r1= 0 ;
  r1= (r1 + (float4)(a10.s0, a10.s1, a10.s2, a10.s3)) ;
  r1= (r1 + (float4)(a11.s3, a10.s0, a10.s1, a10.s2)) ;
  r1= (r1 + (float4)(a10.s1, a10.s2, a10.s3, a12.s0)) ;
  r1= (r1 + (float4)(a13.s0, a13.s1, a13.s2, a13.s3)) ;
  r1= (r1 + (float4)(a14.s0, a14.s1, a14.s2, a14.s3)) ;
  a2[(get_global_id(0)) + get_global_id(1) * 3] = r1;
}

Right now, image support is configured per device in a file read by the virtual machine on startup. For gather operations, the JIT uses images or memory buffers based on the configuration file entry for the compute device.

GEMM and GEMV are auto-tuned. The JIT will pick between images and memory buffers based on kernel execution time. Note this is also not quite right. There is a tradeoff between: image creation overhead; the speed of texture sampling; the length/depth of the calculation graph. The optimal choice of images or memory buffers depends on more than the speed of one kernel. It depends on the total performance across a sequence of kernels and memory transfers.

All of this is less than ideal. At this point, I will leave it alone as I intend to rewrite much of the JIT anyway in a few months. For now, I’m just adding features. Next up is random number generation.

Chickens and eggs

Today is the first day I’ve been able to really do any work in two weeks. I’m finally adding image support for the gather operation. I was able to take a walk in the sun instead of hanging around in the hospital with a laptop and books (mostly sitting on a folding chair next to a red biohazard box and anti-bacterial hand sanitizer).

This should be obvious. New technology often has a nascent presence. For example: before iTunes, there was Napster – the innovation was a new commercial market for digital media.

Everything with GPGPU involves machine learning. That includes SETI/Folding@home. There’s a big combinatorial space to search for good stuff. That’s what data mining is about: take this raw data and refine value out of it as some processed or finished product or service.

These problems are not what we learn to do in university as part of the usual CS curriculum. All of our technology and applications are designed around the von Neumann machine.

I think that applications and platforms are a chicken and egg problem. If you build a platform that does new things before there are applications, it fails because no one needs it. If you have applications, then you build platforms for these old problems. Nothing new happens.

There must be bridging technology. That’s true even for something like Java. It started out as an embedded platform (Oak). Then it became thin clients (sandboxed applet), moved to the server back-end (enterprise IT), and now it is back on clients (Android). It was flexible enough to have utility for different markets.

Something like this must happen if GPGPU is to become more significant.

I gave a ten minute book review at a meetup. It was hard to concentrate in the hospital so I mostly read Rob Farber’s book a few times. Then I wrote up a slide deck and practiced a few times before the meetup. This went over quite well which surprised me.

Machine learning is a pervasive theme in Rob Farber’s book. It seems Rob Farber also perceives statistical learning theory as the egg that comes before the chickens.

I don’t have a PhD in statistics. I don’t even have a PhD. I dropped out. Twice. And I’ve only had one job with machine learning as a support quant. So I have to rely on intuition more than others.

I wonder if we haven’t seen the good GPGPU ML applications yet. My (very limited) real world experience is that ML and GPGPU meet just about when diminishing returns have set in. Sensible use of statistical methods, heuristic scoring functions, and map-reduce techniques are sufficient to pick the low hanging fruit. This could mean two things.

  1. Machine learning applications do not need GPGPU.
  2. GPGPU will allow new machine learning applications.

So if we try to use GPGPU for current applications, we do not realize much gain. The space of applications is what we conceive as tractable and practical for available technology. When new technology is invented, it may not yield much advantage for old problems.

Vectorized gather support added

Gathering supports vectorized memory access now. This only works for memory access patterns with simple stencil translation. The offsets to the current item ID must be literal constants. For example:

Arrayf32 X = index_f32(0, N, N);
Arrayf32 Y = index_f32(1, N, N);

B += gather2_floor(A, X, Y);
B += gather2_floor(A, X - 1, Y);
B += gather2_floor(A, X + 1, Y);
B += gather2_floor(A, X, Y - 1);
B += gather2_floor(A, X, Y + 1);

The stencils can be arbitrarily large, though.

I haven’t tested this with images and texture sampling yet. I’m currently working out of a hospital on a laptop with wireless. That’s also why I haven’t updated this blog for almost two weeks. Real life stuff happens sometimes. Interesting trivia: I used to think of “code” as source code. For a doctor, “full code” and “no code” have very different meanings.

After I get this working with images, the next thing to add is random number generation on the GPU. I’ve realized the JIT really needs to be redesigned. It’s too inflexible to support the translations and optimizations I would like.

So my priority is to be feature complete for the beta in a few months from a language viewpoint. The production release will include the JIT redesign and should generate higher quality GPGPU code.

Programmers are lazy, and this may be right

Stencil and filter kernels should rely on the texture cache. Seems obvious? It’s very natural for images as they are built on top of texture sampling for graphics. I don’t know why I didn’t see this earlier. I kept thinking about the much harder problem of prefetching into local memory (more about this later).

This will mean more big changes to the JIT. Right now, it only autotunes GEMM and GEMV. Everything else is just generated in one pass.

I also made a very simple change last night so work group dimensions may be influenced by the configuration file to better fit natural warp and wavefront sizes. This is not quite right either.

The JIT should autotune (some) generated kernels (not only the GEMM and GEMV templates) at both the TLP and ILP levels.

  • TLP – thread level parallelism: work group dimensions
  • ILP – instruction level parallelism: reordering statements

This is great. I’ll be really busy for the next few weeks. Development seems to go like this, in evolutionary spurts.

I’m (re-)reading Rob Farber’s book CUDA Application Design and Development. He briefly cites an observation of Vasily Volkov that is very true in my experience. “Volkov notes that the trend in parallel architecture design is towards an inverse memory hierarchy where the number of registers is increasing compared to cache and shared memory.”

What are the reasons for this trend?

One reason is the natural imbalance between processor and memory. That’s always been an issue. Memory bandwidth tends to lag behind processor throughput. At some point, memory falls behind.

The other reason is effective programmatic use of a memory hierarchy is difficult. In practice, it’s avoided due to high software development costs. Instead, users rely on automatic mechanisms in the GPU (e.g. L1/2 cache, register spillage into shared memory for NVIDIA) and use more private registers.

This leads back to the JIT in Chai.

The autotuned GEMM and GEMV kernel templates support prefetching into local memory. This was really difficult and caused me no end of troubles. It was very tricky.

I’ve been putting off dealing with this issue of more general JIT local memory prefetching because it scares me. For the wrong reason, I may have done the right thing. Local memory prefetching may be less important. Technology is evolving around it.