there are so many cores

Just another site

Monthly Archives: January 2011

Hand written compilers are more evil

It’s even more complicated. Hand written compilers are probably worse than hand written parsers. The data structures and algorithm problems are very difficult to get right. It must be more evil than a hand crafted parser state machine.

I’m on the edge of getting nested reductions to work. In other words, multiple levels of nested for-loops in the JIT with OpenCL. This comes up with matrix multiply as calculated by the matmul() function (which has four distinct polymorphic variants: outer product; vector-matrix; matrix-vector; matrix-matrix). Unfortunately, I am too tired and unable to keep it all straight. I have to sleep first.

My mood comes from source code

I was sad since Monday afternoon when a mysterious bug had me worried. It was just careless, something like this:

output_stream << (sizeof(float) == _sz) ? "as_float(" : "as_double(";

This resulted in inserting a 0 or 1 into the stream instead of “as_float(” or “as_double(“. It also caused mismatched parentheses in generated OpenCL. The current design supports mixed precision arithmetic from the ground up so must automatically perform conversions.

My mood comes from source code.

One question posed to Papakipos during his Stanford presentation (shortly after the Google acquisition) was the possibility of opening up the API in order to allow adding new functions. He kind of groaned and didn’t say very much in response. My guess is that what I have is quite similar. To allow user extensions easily requires open sourcing the whole system. That’s fine for me (as this is a FOSS project). For PeakStream, it would have been very tricky to do.

Trivial example of the JIT with OpenCL

Well, just got something to run through OpenCL with a true JIT. It takes this statement:

C = sqrt(B + sum(A + B * mean(A)));

generates the equivalent OpenCL kernel:

__kernel void kx(
    __global float* a0x1d72b00, 
    __global float* a0x7ffff8234f30, 
    __global float* a0x7ffff82350c0)
    const size_t x0 = get_global_id(0);

    float a2 = 0;
    for (size_t x2 = 0; x2 < 100; x2++) {
        a2 += a0x7ffff82350c0[x2];

    float a1 = 0;
    for (size_t x1 = 0; x1 < 100; x1++) {
        a1 += (a0x7ffff82350c0[x1] + (a0x7ffff8234f30[x1] * (a2 / 100)));

    a0x1d72b00[x0] = sqrt((a0x7ffff8234f30[x0] + a1));

enqueues, waits and continues after kernel output data is transferred from the device back to the host.

This is the first “end-to-end” functionality (with a trivial and completely unrealistic example as the OpenCL index space has only a single thread).

The compiler will have to develop a modular architecture allowing optimizations. Some of the basic but necessary compiler features I can see are:

  1. loops in the execution trace
  2. batch multiple reductions
  3. asynchronous JIT and interpreter execution
  4. JIT code cache

At the same time, I am cautious to avoid over investing in the wrong problems.

An excellent JIT may not improve real world application performance as much as a good kernel library. My experience is high GPGPU throughput is mostly due to pre-optimized kernels, arithmetically intense algorithms like matrix multiply and n-body simulation with far more calculation than memory access. The benefit of a managed platform and JIT is making it easy to write software to use those library kernels in a way that does not kill performance (minimize PCIe bus data transfer and kernel scheduling overheads).

Progress towards a tracing JIT with OpenCL

Design has (again) turned out to be more complex than anticipated. It’s not mysterious, just involved. An intuitive working style appears to be productive. I am able to feel my way to solutions.

However, any intuition regarding time estimates is usually grossly inaccurate. It’s actually worse than that. My level of effort expectations are very often qualitatively wrong. The design solutions found are often surprising and very unlike my preconceptions.

This in itself is interesting. It means I do not know what I do not know. In computer science terms, it means the design process is not easily computable – the algorithm finishes when it does and a priori statements about halting are difficult if not impossible.

I’m not displeased, though.

In the last few days, the overall form of the kernel back-end (for the GPU) has taken firm shape. The same stream data stack is used when evaluating bytecode. Unlike the interpreter that pushes and pops array data, the kernel back-end assembles a graph (right now, it is really a tree but that will change later). When it finishes (determining good stopping points is yet another problem), the interpreter continues from where the kernel back-end left off and forces evaluation of the boxed computation left on the stack.

As a simple example, here’s some input source:

Arrayf64 C;
    // eagerly create buffers/images on compute device
    Arrayf64 A = make1(100, cpuA);
    Arrayf64 B = make1(100, cpuB);

    // lazy boxed calculation
    C = sum(A + B);
double c = C.read_scalar(); // force evaluation, read back

The corresponding bytecode stack:

0: read_scalar_f64
1: convert_f64
2: sum
3: operatorADD
4: make1_f64
5: 100
6: 0x7ffffef48d10
7: make1_f64
8: 100
9: 0x7ffffef489f0

The OpenCL source generated:

__kernel void kx(__global double* out_a0, __global double* v0x7ffffef489f0, __global double* v0x7ffffef48d10) {
double a0 = 0;
for (int i0 = 0; i0 < 100; i0++) { 
a0 += (v0x7ffffef48d10[i0] + v0x7ffffef489f0[i0]);
out_a0[get_global_id(0) + get_global_id(1) * 1] = a0;

This is extremely primitive at the moment.