there are so many cores

Just another WordPress.com site

Monthly Archives: February 2011

PeakStream is not the ideal solution

This has been on my mind for a while.

Why recreate PeakStream? The technology would be five years old. Clearly, the market has changed since then. We can be fairly certain the best solution for today is different.

So to be clear: PeakStream is not the ideal solution. However, it is the lowest hanging fruit within reach and a good step towards better solutions. I believe that PeakStream, as a company faced with product design choices, came to the same conclusion.

PeakStream is a DSL inside C++. While native code wins for performance, the market has clearly moved away from it. Most applications development today is with virtual machine based languages (Java, JavaScript, Python, etc). More significant is that the scientists, statisticians and quants I have known often preferred MATLAB or R as the best tool for their work.

A native code platform like PeakStream is clearly not the future. The market wants virtual machine languages.

So if PeakStream is not the future, why build it?

I believe it is a necessary intermediate step for the following reasons.

  1. Useful for solving problems today
  2. Can be done “quickly” for reasonable cost
  3. Lessons learned inform future work

This is why I see it as important to get the technology working soon. Technology rots even when it doesn’t exist. Ideas and concepts are valid for only a few years before the world changes too much.

Advertisements

Rewriting everything yet again, looks more like PeakStream

I am rewriting everything again. Some of the trees are the same. But the forest is very different.

A few months ago (somewhat shocking to realize how much time has passed), there was only a superficial resemblance between the design and PeakStream’s high-level architecture. I rationalized this as PeakStream’s need to present a clear narrative in marketing the technology and product. Perhaps, the ugly details were too confusing. If so, then the architectural diagram from Papakipos’ Stanford presentation (YouTube video, PowerPoint slides) reflected a system “analysis” rather than “design” viewpoint. In other words, I thought I was right and didn’t believe the engineering and marketing artifacts suggesting I was wrong.

Now I believe otherwise.

Papakipos was entirely open and straightforward about the technology and product during his presentation at Stanford. He said as much as he could under the restrictions of non-disclosure. The architectural diagram was accurate.

The surprise is how well the design now fits PeakStream’s high-level architecture.

      Application Binary
 ---> API                                           GPU compiler
 |    scheduler <--------> JIT compiler <---------- math libs
 |    memory manager            |
 ---- executor ----------> instrument & analyze --> profiler & debugger

During his talk, Papakipos made an observation that the scheduler is at least as important and technically impressive as the sexy JIT compiler. Most attention focuses on the JIT when the scheduler is just as vital.

In virtual machine based languages like Java and Python, runtime statistics or other heuristics are used to classify code as hot or not. Only hot code is sent to the JIT for compilation. Cold code is interpreted. Conceptually, between the interpreter and JIT is a binary classifier.

PeakStream has more problems between the interpreter and JIT. It schedules across multiple compute devices of different architectures. Even if a compute device supports concurrent kernel execution, high performance may require folding code (possibly from multiple threads) into single kernels. Conceptually, between the interpreter and JIT is a scheduler.

Coalesced versus single DGEMM on ATI HD 5870

Coalesced versus single DGEMM on ATI HD 5870

I would like to justify the statement above about folding code into kernels. The chart above is of coalesced versus single DGEMM (double precision matrix multiply) using images (texture sampling) on a Radeon HD 5870 GPU. Most performance benchmarks use large dense matrices to reach peak throughput. In my experience, this is unrealistic. Real world problems more often involve either large sparse matrices (solving PDEs and gradient descent methods – find a solution that minimizes residual) or large numbers of small dense matrices (computational kernels in quantitative statistical problems – evaluate the goodness of a strategic solution).

Large kernel execution overheads on GPUs is an issue. That’s one reason why just making the matrices larger leads to higher throughput. It amortizes out these overhead costs. Another way is to use fewer GPU kernels for the same set of calculations. In this example, combining many matrix multiplies together into a GPU kernel increased throughput by an order of magnitude for some matrix sizes.

Read about PyPy and Psyco the last week

<gobbledygook> Last week, I realized that syntactic analysis is not entirely free of operation code semantics. </gobbledygook>

Or: Last week, I realized I am ignorant of prior art and wasting time reinventing wheels.

So I stopped coding and started reading about PyPy and Psyco which both use tracing JITs. As Mr. Spock would say, “Fascinating.” Knowing what something is not helps to understand what it is.

PeakStream is not a dynamic language. Obvious? The API suggests static typing:

RNGf32 G(SP_RNG_DEFAULT, 271828); // create an RNG
Arrayf32 X = rng_uniform_make(G, NSET, 1, 0.0, 1.0);
Arrayf64 A = make1(100, cpuA);

Every array variable is typed by precision. Single precision is “f32” while double precision is “f64“. There’s no variable polymorphism or need for type inference. All variable types are known statically. There is also no way to modify the platform runtime from within user code. Everything necessary is known from the source code.

Despite being not dynamic, PeakStream uses a JIT to compile shader kernels at runtime. Why? There are two reasons.

  1. Compiling kernels statically would require a special toolchain (as in BrookGPU). Embedding the platform in libraries allows standard C++ development tools. This is mostly a market constraint rather than a technical decision.
  2. Foreign C++ freely intermingles with the PeakStream DSL. The runtime is ignorant of program control flow. Only the execution trace is available.

The tracing JITs of Psyco and PyPy rely on control flow hints passed through the interpreter at backward jumps. The major optimization is identifying hot loops and compiling these. Hooks between the interpreter and JIT are at points where control flow changes.

PeakStream does not have control flow information like a conventional tracing JIT. Any loops and even common subexpressions must be detected using heuristics and hashing (transpositions). PeakStream knows the computational graph but has no direct knowledge of the user source code that generated it. The runtime must work backwards to infer loops, i.e. “loop rolling”.

From an abstract point of view, the PeakStream JIT is a specializer performing dynamic partial evaluation like Psyco and PyPy. The JIT accelerates hot spots found when interpreting. However, this is misleading as optimization is completely different for CPUs and GPUs. My experience auto-tuning GPGPU kernels with GATLAS is specialization requires search of a neighborhood around each problem. It’s not a straightforward series of transformations that we would do as human programmers.

This project seems much smaller to me now. PeakStream is a simple language. It is not dynamic. The base language, including a reasonable JIT and scheduler, should not be that difficult to implement. Most development costs will come later with good compiler optimizations across accelerator devices (GPUs) and high performance math kernel libraries.

Chai TEA

It’s necessary to attack the hard syntactic analysis problems of loop detection and subgraph transpositions head-on. I’ve been avoiding it for weeks and only creating more work for myself. Naturally, a hash function is required for this sort of thing. I’m going to try using the Tiny Encryption Algorithm like the original Xbox.

I was just adding the code for this when I noticed something amusing.

#ifndef _CHAI_TEA_HPP_
#define _CHAI_TEA_HPP_

namespace chai_internal {

////////////////////////////////////////
// Tiny Encryption Algorithm
// by David Wheeler and Roger Needham

My name for this project is “Chai” so the preprocessor macro guard _CHAI_TEA_HPP_ elliptically fits the narrative. I drink a lot of chai tea.

Need another parsing stage

Another parsing stage is required just after the front-end bytecode and before anything else. This is where duplicate subgraphs in the execution trace are identified using a memo. This is generally useful for both interpreting and kernel assembly. The interpreter avoids recalculating results using the memo as it recursively traverses the execution trace. In a similar fashion, kernel assembly avoids duplicated reductions.

The current design goes directly from stack machine bytecode to the interpreter and JIT. While this keeps the front-end simple, it also forces syntactical analysis (identifying structure in the execution trace like identical subgraphs) deep into the middle-end and back-end where it does not belong.

As usual, if a design does not feel right, then it is probably wrong.

Some huge machine generated kernels

The good news is that output from the JIT/OpenCL agrees with the interpreter. The bad news is that there’s a lot more work required. As I’ve never written something like this before, I didn’t see the requisite static analysis and code transformation problems beforehand. It’s a process of discovery, not only of finding solutions but learning what the (compiler) problems are and how they are interdependent.

A minute ago, the need for memoizing generation of kernel code became obvious when testing with the PeakStream conjugate gradient example.

Arrayf32 x = Arrayf32::zeros(N);
{
    Arrayf32 A = Arrayf32::make2(N, N, cpuA);
    Arrayf32 b = Arrayf32::make1(N, cpub);
    Arrayf32 residuals = b - matmul(A, x);
    Arrayf32 p = residuals;
    Arrayf32 newRR = dot_product(residuals, residuals);

    for (iter = 0; iter < N; iter++) {
        Arrayf32 oldRR = newRR;
        Arrayf32 newX, newP, newResiduals;
        Arrayf32 Ap = matmul(A, p);
        Arrayf32 dp = dot_product(p, Ap);

        newX = x + p * oldRR / dp;
        newResiduals = residuals - Ap * oldRR / dp;
        newRR = dot_product(newResiduals, newResiduals);
        newP = newResiduals + p * newRR / oldRR;

        p = newP;
        residuals = newResiduals;

        float oldRRcpu = oldRR.read_scalar();
        if (oldRRcpu <= TOLERANCE) {
            break;
        }
        x = newX;
    }
}

The first iteration generates a simple kernel. It is just a dot product.

__kernel void kx(
    __global float* a0xeaff80,
    __global float* a0x7fff6375c1a0,
    __global float* a0x7fff6375c7e0)
{
    const size_t x0 = get_global_id(0);

    float a1 = 0;
    for (size_t x1 = 0; x1 < 20; x1++) {
        const float a2 = 0;
        const float a3 = 0;
        a1 += ((a0x7fff6375c7e0[x1] - a2) * (a0x7fff6375c7e0[x1] - a3));
    }

    a0xeaff80[x0] = a1;
}

The second iteration generates a more complex kernel.

__kernel void kx(
    __global float* a0xf6a740,
    __global float* a0x7fff6375c1a0,
    __global float* a0x7fff6375c7e0)
{
    const size_t x0 = get_global_id(0);

    float a18 = 0;
    for (size_t x18 = 0; x18 < 20; x18++) {
        const float a19 = 0;
        float a20 = 0;
        for (size_t x20 = 0; x20 < 20; x20++) {
            const float a21 = 0;
            a20 += (a0x7fff6375c1a0[x20 + x18 * 20] * (a0x7fff6375c7e0[x20] - a21));
        }

        a18 += ((a0x7fff6375c7e0[x18] - a19) * a20);
    }

    float a15 = 0;
    for (size_t x15 = 0; x15 < 20; x15++) {
        const float a16 = 0;
        const float a17 = 0;
        a15 += ((a0x7fff6375c7e0[x15] - a16) * (a0x7fff6375c7e0[x15] - a17));
    }

    float a8 = 0;
    for (size_t x8 = 0; x8 < 20; x8++) {
        const float a9 = 0;
        float a10 = 0;
        for (size_t x10 = 0; x10 < 20; x10++) {
            const float a11 = 0;
            a10 += (a0x7fff6375c1a0[x10 + x8 * 20] * (a0x7fff6375c7e0[x10] - a11));
        }

        a8 += ((a0x7fff6375c7e0[x8] - a9) * a10);
    }

    float a5 = 0;
    for (size_t x5 = 0; x5 < 20; x5++) {
        const float a6 = 0;
        const float a7 = 0;
        a5 += ((a0x7fff6375c7e0[x5] - a6) * (a0x7fff6375c7e0[x5] - a7));
    }

    float a1 = 0;
    for (size_t x1 = 0; x1 < 20; x1++) {
        const float a2 = 0;
        float a3 = 0;
        for (size_t x3 = 0; x3 < 20; x3++) {
            const float a4 = 0;
            a3 += (a0x7fff6375c1a0[x3 + x1 * 20] * (a0x7fff6375c7e0[x3] - a4));
        }

        const float a12 = 0;
        float a13 = 0;
        for (size_t x13 = 0; x13 < 20; x13++) {
            const float a14 = 0;
            a13 += (a0x7fff6375c1a0[x13 + x1 * 20] * (a0x7fff6375c7e0[x13] - a14));
        }

        a1 += (((a0x7fff6375c7e0[x1] - a2) - ((a3 * a5) / a8)) * ((a0x7fff6375c7e0[x1] - a12) - ((a13 * a15) / a18)));
    }

    a0xf6a740[x0] = a1;
}

Many reductions are duplicated in the kernel as it is generated directly from recursive traversal of the execution trace DAG. There’s no transposition detection. That’s why a memo is needed (although I’m not sure where it should go – generally, I try to avoid creating temporary stages by performing transformations as far upstream as possible). By the third iteration, the generated kernel is 756 lines. The fourth iteration results in a kernel with 9876 lines. The fifth iteration has a kernel with 130506 lines. This is too much for the OpenCL compiler which then seg faults.

As you might guess, there are other issues going on besides transpositions. I’m only showing off the skeletons I understand. There are many more in the closet.

Here’s another example of a crazy looking kernel.

__kernel void kx(
    __global float* a0x1da56e0,
    __global float* a0x7fffc0ceeaa0)
{
    const size_t x0 = get_global_id(0);
    const size_t y0 = get_global_id(1);

    a0x1da56e0[x0 + y0 * 100] = ((((((((((((((((((((0 + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 0) * ((1 * y0) - 0)) * 1))) * 1))))) % 100) + (as_uint(floor((0 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 1) * ((1 * y0) - 1)) * 1))) * 1))))) % 100) + (as_uint(floor((1 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 2) * ((1 * y0) - 2)) * 1))) * 1))))) % 100) + (as_uint(floor((2 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 3) * ((1 * y0) - 3)) * 1))) * 1))))) % 100) + (as_uint(floor((3 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 4) * ((1 * y0) - 4)) * 1))) * 1))))) % 100) + (as_uint(floor((4 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 5) * ((1 * y0) - 5)) * 1))) * 1))))) % 100) + (as_uint(floor((5 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 6) * ((1 * y0) - 6)) * 1))) * 1))))) % 100) + (as_uint(floor((6 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 7) * ((1 * y0) - 7)) * 1))) * 1))))) % 100) + (as_uint(floor((7 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 8) * ((1 * y0) - 8)) * 1))) * 1))))) % 100) + (as_uint(floor((8 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 9) * ((1 * y0) - 9)) * 1))) * 1))))) % 100) + (as_uint(floor((9 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 10) * ((1 * y0) - 10)) * 1))) * 1))))) % 100) + (as_uint(floor((10 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 11) * ((1 * y0) - 11)) * 1))) * 1))))) % 100) + (as_uint(floor((11 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 12) * ((1 * y0) - 12)) * 1))) * 1))))) % 100) + (as_uint(floor((12 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 13) * ((1 * y0) - 13)) * 1))) * 1))))) % 100) + (as_uint(floor((13 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 14) * ((1 * y0) - 14)) * 1))) * 1))))) % 100) + (as_uint(floor((14 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 15) * ((1 * y0) - 15)) * 1))) * 1))))) % 100) + (as_uint(floor((15 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 16) * ((1 * y0) - 16)) * 1))) * 1))))) % 100) + (as_uint(floor((16 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 17) * ((1 * y0) - 17)) * 1))) * 1))))) % 100) + (as_uint(floor((17 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 18) * ((1 * y0) - 18)) * 1))) * 1))))) % 100) + (as_uint(floor((18 * 1))) % 100) * 100]) + a0x7fffc0ceeaa0[(as_uint(floor(as_float((0.5 + (sqrt((((1 * x0) * (1 * x0)) + ((((1 * y0) - 19) * ((1 * y0) - 19)) * 1))) * 1))))) % 100) + (as_uint(floor((19 * 1))) % 100) * 100]);
}

Death by parentheses is possible in OpenCL too? Anyway, here’s the source code that generates it. This is the PeakStream Kirchhoff migration (seismic/depth imaging) example.

Arrayf32 modl = zeros_f32(NT, N);
{
    Arrayf32 x = dx * index_f32(1, NT, N);
    Arrayf32 z = dt * index_f32(0, NT, N);
    Arrayf32 data = Arrayf32::make2(NT, N, datagpu);

    for (int iy = 0; iy < N; iy++)
    {
        float y = float(iy) * dx;
        Arrayf32 index1 = float(iy) * ones_f32(NT, N);
        Arrayf32 it = 0.5 + sqrt(z * z + (x - y) * (x - y) * factor) * idt;
        modl += gather2_floor(data, it, index1);
    }
}