there are so many cores

Just another site

AMD supports images on CPU devices

The AMD OpenCL SDK supports images on CPU devices. This is a surprise. Two years ago, images only worked on GPU devices. I never bothered to retest. I only noticed by accident when testing gather operations with images.

Here is some sample code:

const size_t N = 12;

void filter(float* cpuA, float* cpuB, const size_t N)
    Arrayf32 B;
        Arrayf32 X = index_f32(0, N, N);
        Arrayf32 Y = index_f32(1, N, N);

        Arrayf32 A = make2(N, N, cpuA);

        B = zeros_f32(N, N);
        B += gather2_floor(A, X, Y);
        B += gather2_floor(A, X - 1, Y);
        B += gather2_floor(A, X + 1, Y);
        B += gather2_floor(A, X, Y - 1);
        B += gather2_floor(A, X, Y + 1);
    B.read1(cpuB, N * N * sizeof(float));

When images are configured as a device capability, the JIT generates:

__kernel void f_9583626628478369588_1(
  __read_only image2d_t a1,
  __global float4 * a2)
  const float4 a10 = read_imagef(a1, sampler, (int2)(((get_global_id(0) + 0) % 3), ((get_global_id(1) + 0) % 12)));
  const float4 a11 = read_imagef(a1, sampler, (int2)(((get_global_id(0) + 2) % 3), ((get_global_id(1) + 0) % 12)));
  const float4 a12 = read_imagef(a1, sampler, (int2)(((get_global_id(0) + 1) % 3), ((get_global_id(1) + 0) % 12)));
  const float4 a13 = read_imagef(a1, sampler, (int2)(((get_global_id(0) + 0) % 3), ((get_global_id(1) + 11) % 12)));
  const float4 a14 = read_imagef(a1, sampler, (int2)(((get_global_id(0) + 0) % 3), ((get_global_id(1) + 1) % 12)));
  float4 r1;
  r1= 0 ;
  r1= (r1 + (float4)(a10.s0, a10.s1, a10.s2, a10.s3)) ;
  r1= (r1 + (float4)(a11.s3, a10.s0, a10.s1, a10.s2)) ;
  r1= (r1 + (float4)(a10.s1, a10.s2, a10.s3, a12.s0)) ;
  r1= (r1 + (float4)(a13.s0, a13.s1, a13.s2, a13.s3)) ;
  r1= (r1 + (float4)(a14.s0, a14.s1, a14.s2, a14.s3)) ;
  a2[(get_global_id(0)) + get_global_id(1) * 3] = r1;

When the device is configured for memory buffers only, the JIT generates:

__kernel void f_9583626628478369588_1(
  __global const float4 * a1,
  __global float4 * a2)
  const float4 a10 = a1[((get_global_id(0) + 0) % 3) + ((get_global_id(1) + 0) % 12) * 3];
  const float4 a11 = a1[((get_global_id(0) + 2) % 3) + ((get_global_id(1) + 0) % 12) * 3];
  const float4 a12 = a1[((get_global_id(0) + 1) % 3) + ((get_global_id(1) + 0) % 12) * 3];
  const float4 a13 = a1[((get_global_id(0) + 0) % 3) + ((get_global_id(1) + 11) % 12) * 3];
  const float4 a14 = a1[((get_global_id(0) + 0) % 3) + ((get_global_id(1) + 1) % 12) * 3];
  float4 r1;
  r1= 0 ;
  r1= (r1 + (float4)(a10.s0, a10.s1, a10.s2, a10.s3)) ;
  r1= (r1 + (float4)(a11.s3, a10.s0, a10.s1, a10.s2)) ;
  r1= (r1 + (float4)(a10.s1, a10.s2, a10.s3, a12.s0)) ;
  r1= (r1 + (float4)(a13.s0, a13.s1, a13.s2, a13.s3)) ;
  r1= (r1 + (float4)(a14.s0, a14.s1, a14.s2, a14.s3)) ;
  a2[(get_global_id(0)) + get_global_id(1) * 3] = r1;

Right now, image support is configured per device in a file read by the virtual machine on startup. For gather operations, the JIT uses images or memory buffers based on the configuration file entry for the compute device.

GEMM and GEMV are auto-tuned. The JIT will pick between images and memory buffers based on kernel execution time. Note this is also not quite right. There is a tradeoff between: image creation overhead; the speed of texture sampling; the length/depth of the calculation graph. The optimal choice of images or memory buffers depends on more than the speed of one kernel. It depends on the total performance across a sequence of kernels and memory transfers.

All of this is less than ideal. At this point, I will leave it alone as I intend to rewrite much of the JIT anyway in a few months. For now, I’m just adding features. Next up is random number generation.


Chickens and eggs

Today is the first day I’ve been able to really do any work in two weeks. I’m finally adding image support for the gather operation. I was able to take a walk in the sun instead of hanging around in the hospital with a laptop and books (mostly sitting on a folding chair next to a red biohazard box and anti-bacterial hand sanitizer).

This should be obvious. New technology often has a nascent presence. For example: before iTunes, there was Napster – the innovation was a new commercial market for digital media.

Everything with GPGPU involves machine learning. That includes SETI/Folding@home. There’s a big combinatorial space to search for good stuff. That’s what data mining is about: take this raw data and refine value out of it as some processed or finished product or service.

These problems are not what we learn to do in university as part of the usual CS curriculum. All of our technology and applications are designed around the von Neumann machine.

I think that applications and platforms are a chicken and egg problem. If you build a platform that does new things before there are applications, it fails because no one needs it. If you have applications, then you build platforms for these old problems. Nothing new happens.

There must be bridging technology. That’s true even for something like Java. It started out as an embedded platform (Oak). Then it became thin clients (sandboxed applet), moved to the server back-end (enterprise IT), and now it is back on clients (Android). It was flexible enough to have utility for different markets.

Something like this must happen if GPGPU is to become more significant.

I gave a ten minute book review at a meetup. It was hard to concentrate in the hospital so I mostly read Rob Farber’s book a few times. Then I wrote up a slide deck and practiced a few times before the meetup. This went over quite well which surprised me.

Machine learning is a pervasive theme in Rob Farber’s book. It seems Rob Farber also perceives statistical learning theory as the egg that comes before the chickens.

I don’t have a PhD in statistics. I don’t even have a PhD. I dropped out. Twice. And I’ve only had one job with machine learning as a support quant. So I have to rely on intuition more than others.

I wonder if we haven’t seen the good GPGPU ML applications yet. My (very limited) real world experience is that ML and GPGPU meet just about when diminishing returns have set in. Sensible use of statistical methods, heuristic scoring functions, and map-reduce techniques are sufficient to pick the low hanging fruit. This could mean two things.

  1. Machine learning applications do not need GPGPU.
  2. GPGPU will allow new machine learning applications.

So if we try to use GPGPU for current applications, we do not realize much gain. The space of applications is what we conceive as tractable and practical for available technology. When new technology is invented, it may not yield much advantage for old problems.

Vectorized gather support added

Gathering supports vectorized memory access now. This only works for memory access patterns with simple stencil translation. The offsets to the current item ID must be literal constants. For example:

Arrayf32 X = index_f32(0, N, N);
Arrayf32 Y = index_f32(1, N, N);

B += gather2_floor(A, X, Y);
B += gather2_floor(A, X - 1, Y);
B += gather2_floor(A, X + 1, Y);
B += gather2_floor(A, X, Y - 1);
B += gather2_floor(A, X, Y + 1);

The stencils can be arbitrarily large, though.

I haven’t tested this with images and texture sampling yet. I’m currently working out of a hospital on a laptop with wireless. That’s also why I haven’t updated this blog for almost two weeks. Real life stuff happens sometimes. Interesting trivia: I used to think of “code” as source code. For a doctor, “full code” and “no code” have very different meanings.

After I get this working with images, the next thing to add is random number generation on the GPU. I’ve realized the JIT really needs to be redesigned. It’s too inflexible to support the translations and optimizations I would like.

So my priority is to be feature complete for the beta in a few months from a language viewpoint. The production release will include the JIT redesign and should generate higher quality GPGPU code.

Programmers are lazy, and this may be right

Stencil and filter kernels should rely on the texture cache. Seems obvious? It’s very natural for images as they are built on top of texture sampling for graphics. I don’t know why I didn’t see this earlier. I kept thinking about the much harder problem of prefetching into local memory (more about this later).

This will mean more big changes to the JIT. Right now, it only autotunes GEMM and GEMV. Everything else is just generated in one pass.

I also made a very simple change last night so work group dimensions may be influenced by the configuration file to better fit natural warp and wavefront sizes. This is not quite right either.

The JIT should autotune (some) generated kernels (not only the GEMM and GEMV templates) at both the TLP and ILP levels.

  • TLP – thread level parallelism: work group dimensions
  • ILP – instruction level parallelism: reordering statements

This is great. I’ll be really busy for the next few weeks. Development seems to go like this, in evolutionary spurts.

I’m (re-)reading Rob Farber’s book CUDA Application Design and Development. He briefly cites an observation of Vasily Volkov that is very true in my experience. “Volkov notes that the trend in parallel architecture design is towards an inverse memory hierarchy where the number of registers is increasing compared to cache and shared memory.”

What are the reasons for this trend?

One reason is the natural imbalance between processor and memory. That’s always been an issue. Memory bandwidth tends to lag behind processor throughput. At some point, memory falls behind.

The other reason is effective programmatic use of a memory hierarchy is difficult. In practice, it’s avoided due to high software development costs. Instead, users rely on automatic mechanisms in the GPU (e.g. L1/2 cache, register spillage into shared memory for NVIDIA) and use more private registers.

This leads back to the JIT in Chai.

The autotuned GEMM and GEMV kernel templates support prefetching into local memory. This was really difficult and caused me no end of troubles. It was very tricky.

I’ve been putting off dealing with this issue of more general JIT local memory prefetching because it scares me. For the wrong reason, I may have done the right thing. Local memory prefetching may be less important. Technology is evolving around it.

Why integers and floating point are necessary

Integer type support is needed for predictive analytics. This is far bigger than stencil kernels and filters.

Today, most unstructured data that is interesting (i.e. computationally tractable) is text. This is really integers. However, classification and regression of this integer data requires floating point for the scoring functions. What’s really going on is the integral text data is mapped into a vector space over the real numbers using a kernel trick. Then statistical learning happens there using the only complete theory we have: linear algebra.

Machine learning technology as culture:

  1. neural networks – before computers were connected in a global network called the Internet
  2. statistical learning theory – in the age of MapReduce
  3. predictive analytics – in the age of apps

The Wikipedia says, “Technology can be viewed as an activity that forms or changes culture.” This is the right way to view machine learning. It is defined by the relationship with society.

We are structuring our societies around markets for labor. When I first heard of Amazon Mechanical Turk, I thought it was twisted genius. “Artificial artificial intelligence,” said Jeff Bezos. Human intelligence is packaged and given value through a market.

I believe this cultural view of intelligence is not limited to people. It extends to machines too. If we restructure society so human activity revolves around machines, then machine activities revolve around people.

Second alpha release uploaded

The second alpha is committed to GitHub. There’s also a gzipped tarball without the git metadata. Sorry this took so long.

The most interesting new feature is the integer array support. Mixed integer and floating point calculation is supported. Also, the JIT code is completely reorganized by stages: AST; bytecode; statements; kernel related processing.

As promised, there is a vectorized MD5 sample application which runs on the GPU. I cheated with the array subscripting. Actually, my first implementation did overload operator[], extend the bytecode, interpreter, and JIT. It didn’t feel right at this point so I backed all that out. I went with a simpler and potentially more flexible approach by using STL and making the JIT a bit smarter about private registers.

I’ve been distracted lately. That’s another reason things have taken longer.

Slide from BASE meetup

Speakers: Owen Martin, Kontagent; Greg Shirakyan, Microsoft Robotics; Peter Kassakian, Quantcast

Last week, I went to the BASE meetup about ML/AI/Robotics. I caught a bug along the way and lost about two days to sickness afterwards.

This June, I will be speaking at the AMD Fusion Developer Summit. It’s actually quite a bit of work to prepare conference presentations. I feel obligated to do the best job I can to entertain, inform, and educate an audience. It’s a lot like teaching a class in university. The teacher is doing something wrong if the class is bored or lost. It’s not easy.

Array subscripts delay the second alpha

I need to extend the language to better support gather operations. This will delay the alpha 2 code drop for about a week.

Crypto is all about shuffling data. Each round requires arbitrary element-wise gathering. It’s really obvious once you start implementing these algorithms, even a simple one like MD5, on the GPU.

The way I want to do this is with an overloaded subscript operator for the Array(u32|i32|f32|f64) objects. If the Array is 1D, then the subscript indexes elements. If the Array is 2D, then the subscript indexes by rows (as memory layout is naturally row-major). All other complexities should be handled by the JIT internally. This is the most elegant and intuitively natural way. It’s how a programmer would want it to work.

This approach supports algorithms that do a lot of gathering such as cryptographic hash functions while allowing streaming at the same time. In the case of MD5, the input cleartext is then a 2D array. The rows correspond to character positions in the cleartext. Within each row, the columns correspond to different texts to be hashed. So the MD5 algorithm is vectorized.

I realize this may not be especially useful as written. This is just a demo application to show off the platform capabilities. (It’s also very necessary for me to develop the language and JIT in the first place. Straightforward examples are very helpful to get the code transformations working correctly.)

It is funny to me that I was at first dismissive of element-wise subscripted array support. Soon thereafter, I encountered several people, none of whom have any relationship, live in different places, and are working on various problems, either give presentations involving gathering or mentioning it as important. So I recognized it was very important. But I still wanted to give it second-class support.

With crypto algorithms, it’s impossible to hide from this. First-class gathering support is needed. That’s the funny part. I think I was resistant to “not invented here.” That syndrome is very dangerous. It was a combination of my limited vision and reluctance to tackle yet another problem. Now I have come around and realize this must go all the way.

MD5 on the GPU

I managed to get MD5 working with the new integer support. That means PeakStream style code (extended to support unsigned integers as well as floating point types) that runs on the GPU and calculates the MD5 hash function. This is interesting. It leads in a completely different direction than traditional science/engineering/quantitative applications.

What led to this is the need to support stencil kernels using gather operations. If the JIT knows when gather subscripts are literal constants, then it can implement more optimizations. It can prefetch into local memory and autotune. So I decided to extend the Chai DSL with first-class integer type support.

What I didn’t realize is how this opens up entirely new classes of applications (e.g. crypto stuff).

I’ll check in the second alpha in a few days and add another GitHub download. Unfortunately, this doesn’t fix anything that was broken in the first alpha release. However, to my knowledge, I haven’t introduced any regressions. If it worked before, it should still work.

The two new things in the second alpha are: reorganized JIT; first-class integer type support. The MD5 sample code will be in this release. Note: the output agrees with md5sum and the original RSA reference implementation.

Several people have asked what I am doing. My honest answer is, “I have no idea what I am doing.”

Next, they ask what I am building. The simplest response I have is, “Perl for GPGPU. It’s something that you may not like but must use anyway as it is too useful.”

What else can I say?

Also, I’m finally reading about CUDA. At the last meetup, I volunteered to give a book report on CUDA Application Design and Development by Rob Farber. So far, it reads like a good PG-13 movie that really wants to be R rated and was originally a trilogy long believed impossible to translate into film. There’s a lot of stuff in the book and even more between the lines. I promised a guy from NVIDIA that I would try out the code and play with it. So now I kind of feel on the hook. But that’s good. I’m impressed with what NVIDIA has done. I should play with their stuff more.

Added first class integer type support

There are now four basic types in the following promotion hierarchy (operations on mixed types are promoted to the highest one).

  1. double
  2. float
  3. int32_t
  4. uint32_t

This is definitely moving beyond PeakStream which supported single and double precision floating point only. The Chai API now has “u32” and “i32” functions along with the PeakStream legacy “f32” and “f64” calls. The OpenCL built-in integer functions are also added to the API. My intention is first-class support for integer, floating point, and mixed precision and type calculation. So I am trying to do everything.

Implementing this turned out to be a miniature nightmare. There are exactly 7900 more lines of code since the alpha release three weeks ago. That’s much more than anticipated.

I still plan on a monthly release cycle – the alpha2 code should be uploaded in about a week. Given the amount of change, the next week will be spent testing.

I originally assumed data could only be two types: single and double precision floating point. This cross-cutting assumption was scattered through the code. The concepts of vector length, floating point precision, and data type were aliased and confused.

The worst part was the interpreter. With only two types, the Cartesian product of types for a binary operator has four combinations (2 x 2 = 4). With four types, there are now 16 combinations (4 x 4 = 16). This is worse when an operator accepts three arguments. Now there are 4 x 4 x 4 = 64 combinations. This is even worse as operations like matrix multiply have four cases: vector * vector; vector * matrix; matrix * vector; matrix * matrix. So that makes 4 x 64 = 256 combinations.

The JIT turned out to be much less work. That was a surprise.

One issue that has become obvious after doing this is code bloat. The API has grown large enough where a single monolithic header file and library is impractical. It needs to be factored into modular components that applications can selectively use. Even if the code bloat is removed, the Chai platform language is big enough where it should be partitioned.

During the long years when C++ was a draft standard, I read of a proposal to have a single massive header file for the entire language. Applications would have a single #include and get the STL and everything else. I don’t know if this could work – but I think everyone has the same gut feeling I do – it seems very wrong.

It’s already March.

My plan for this year is a beta release sometime this summer and a production release by the end of the year. That is not much time at all. The beta should include every major feature in the production 1.0 release. From the summer beta release to the production release, the focus should be on bug fixing, stability and quality.

So from now until the beta, I will be throwing new features in rapidly.

These features include:

  • auto-tuned filter kernels with pre-fetching into local memory (original motivation behind adding integer type support – so the JIT could distinguish constant subscripts in gather operations)
  • (pseudo) random number generation
  • modularized platform

Adding first class integer type support

Chai should have first-class integer support in addition to single and double precision. This extends the PeakStream API with “i32” in addition to “f32” and “f64”.

Stencils led to this.

The JIT needs to distinguish between statically known integral subscript indices and values known only at runtime on the GPU. This allows generating different code. It’s very useful for auto-tuning. Specifically, generated kernels can prefetch from global into local memory if gather subscript indices are known at compile time. (By compile time, I mean when the JIT runs, not when the C++ application is compiled.)

My first thought was adding special case support just for the gather operation. The more I looked at this, it just seemed more natural to extend the language uniformly. This begins to address the major omission of support for algorithms that operate on integer data.