there are so many cores

Just another site

Monthly Archives: December 2010

API dispatch table and stream shape

Here is a table that connects:

  • virtual machine operation bytecodes
  • PeakStream API function points
  • CPU back-end
  • stream shape
  • computational graph

The most basic JIT optimization is finding subgraphs with homogeneous stream shape (multiple consecutive loops that can be fused together into a single loop). These subgraphs correspond to synthesized compute kernels for the GPU. That’s important as kernel scheduling overhead is very high for GPUs. Minimizing the number of kernels scheduled can have a dramatic effect on performance (my experience is about 5x higher dense matrix multiply throughput on an ATI Radeon HD 5870 in double precision).

code    API name                CPU back-end            stream shape    graph

0       abs                     CPUisomorph<cpu_ABS>    same            interior
1       cond                    CPUcond                 polymorphic     interior
2       construct_f32           CPUscalar               none            leaf
3       construct_f64           CPUscalar               none            leaf
4       convert_f32             CPUconvert<float>       same            interior

5       convert_f64             CPUconvert<double>      same            interior
6       dot_product             CPUdotprod<false>       reduction       interior
7       exp                     CPUisomorph<cpu_EXP>    same            interior
8       gather1_floor           CPUshuffle<1>           same            interior
9       gather2_floor           CPUshuffle<2>           same            interior

10      index1_f32              CPUidxdata<float, 1>    none            leaf
11      index1_f64              CPUidxdata<double, 1>   none            leaf
12      index2_f32              CPUidxdata<float, 2>    none            leaf
13      index2_f64              CPUidxdata<double, 2>   none            leaf
14      make1_f32               CPUmakedata<float, 1>   (W, 1)          leaf

15      make1_f64               CPUmakedata<double, 1>  (W, 1)          leaf
16      make2_f32               CPUmakedata<float, 2>   (W, H)          leaf
17      make2_f64               CPUmakedata<double, 2>  (W, H)          leaf
18      matmul                  CPUmatmul               polymorphic     interior
19      max                     CPUbinop<cpu_MAX>       same            interior

20      mean                    CPUaccum<true>          reduction       interior
21      min                     CPUbinop<cpu_MIN>       same            interior
22      negate                  CPUisomorph<cpu_NEGATE> same            interior
23      ones1_f32               CPUlitdata<float, 1>    none            leaf
24      ones1_f64               CPUlitdata<double, 1>   none            leaf

25      ones2_f32               CPUlitdata<float, 2>    none            leaf
26      ones2_f64               CPUlitdata<double, 2>   none            leaf
27      operatorADD             CPUbinop<cpu_ADD>       same            interior
28      operatorAND             CPUbinop<cpu_AND>       same            interior
29      operatorDIV             CPUbinop<cpu_DIV>       same            interior

30      operatorEQ              CPUbinop<cpu_EQ>        same            interior
31      operatorGE              CPUbinop<cpu_GE>        same            interior
32      operatorGT              CPUbinop<cpu_GT>        same            interior
33      operatorLE              CPUbinop<cpu_LE>        same            interior
34      operatorLT              CPUbinop<cpu_LT>        same            interior

35      operatorMUL             CPUbinop<cpu_MUL>       same            interior
36      operatorNE              CPUbinop<cpu_NE>        same            interior
37      operatorOR              CPUbinop<cpu_OR>        same            interior
38      operatorSUB             CPUbinop<cpu_SUB>       same            interior
39      pusharray_f32           CPUpusharray<float>     (W, H)          leaf

40      pusharray_f64           CPUpusharray<double>    (W, H)          leaf
41      read_scalar_f32         CPUnop                  -               root
42      read_scalar_f64         CPUnop                  -               root
43      read1_f32               CPUnop                  -               root
44      read1_f64               CPUnop                  -               root

45      read2_f32               CPUnop                  -               root
46      read2_f64               CPUnop                  -               root
47      rng_normal_make_f32     CPUrngnormal<float>     none            leaf
48      rng_normal_make_f64     CPUrngnormal<double>    none            leaf
49      rng_uniform_make_f32    CPUrnguniform<float>    none            leaf

50      rng_uniform_make_f64    CPUrnguniform<double>   none            leaf
51      sqrt                    CPUisomorph<cpu_SQRT>   same            interior
52      sum                     CPUaccum<false>         reduction       interior
53      zeros1_f32              CPUlitdata<float, 1>    none            leaf
54      zeros1_f64              CPUlitdata<double, 1>   none            leaf

55      zeros2_f32              CPUlitdata<float, 2>    none            leaf
56      zeros2_f64              CPUlitdata<double, 2>   none            leaf

Ready to start on the GPU back-end

The C++ API front-end and CPU interpreter back-end are solid now. Next up is the interesting stuff with scheduling, JIT and GPU back-end.

  • 3253 lines of C++
  • 54 files
  • keyword template appears 37 times
  • preprocessor directive #define appears 58 times
  • bytecode stack virtual machine with 57 instructions
  • reference counted array buffers
  • tunable lazy evaluation (very important!)

The last point about lazy evaluation is significant. The advantage of a JIT comes from the ability to optimize and schedule code at runtime. Code does not always execute immediately. It is queued, dynamically compiled and scheduled. Then metrics are collected to inform the system, improve management and enable higher performance over time.

There is a balance between eager and lazy evaluation. Eagerness permits less optimization but also has lower overhead. In the limit, this becomes an interpreter. Laziness allows more optimization but has higher overhead from boxing computations. In the limit, this becomes Haskell!

Before dealing with this eager/lazy trade-off, the first thing is a back-end for the GPU. This will be a non-optimizing (no auto-tuning, that comes later) JIT that fuses streams and synthesizes OpenCL kernels dynamically. New virtual machine instructions will be added to the interpreter dispatch table corresponding to the synthesized kernels. So one way of thinking about the JIT is as an accelerator for the interpreter.

Less really is more

The last two weeks were consumed continuously rewriting the core runtime. It is now five times smaller at just over 3,000 lines of code. The size is deceptive. Template meta-programming is used throughout (along with unavoidably convenient macros).

For this project, I believe a small quantity of elegantly complex code is preferable to a simpler but larger code base. Less code, even if dense, will be easier to maintain going forward and allow faster progress.

Interpreter 2: Rise of the Stack Machine

I rewrote everything.

Original design (week before last):

  • syntax trees and statement graphs on heap
  • interpreter recursion by pointer chasing
  • values in reference counted variables

The rewrite (last week):

  • const parse trees and statements converted directly to stack bytecode
  • interpreter recursion by stack iteration
  • no values or reference counting, DAG as bytecode stack is full computation

The reason I moved to a bytecode stack machine is for the JIT. Pointer chasing over a DAG is fine so long as it is strictly interpreted. This becomes impractical for a JIT as the graph must be transformed. The JIT will:

  1. find subgraphs it knows how to optimize
  2. synthesize new kernel functions for these subgraphs
  3. replace subgraphs with new kernels

Another concern I had is all of the locking in the standard library allocators and loss of memory locality when building everything on the heap. It’s just not efficient and could be costly inside inner loops.

There’s one feature of the original design that must be carried over to the rewrite. Variables need mutable state and reference counting again. Once a variable has a value, the graph of computations required to calculate it can be discarded. Without this, the bytecode stack often becomes huge as the full computation is an enormous graph. There’s nothing surprising about this. Tail recursion and memoization are often necessary for recursive algorithms. Otherwise, combinatorial explosions take hold.

One major advantage of a bytecode based virtual machine is that it decouples the front-end. The original design had no clear separation boundary between the front-end and middle/back-end. The zoo of types threatened to turn the design into a big ball of mud. With bytecode, there is clear separation between layers. This also will make it much easier to implement additional front-ends, especially in other languages.

I’m a little disappointed that I don’t have a working JIT yet, even as a primitive proof-of-concept. But the rewrite was necessary to make the JIT possible. And the design is much better now.

resurrected interpreter from visitor hell

I wrote an interpreter back-end that allows running PeakStream examples on the CPU. It supports the entire API (which will continue to evolve). This gives me a lot of confidence in the basic design. It is somewhat complex internally as mutable state is never shared (with one exception at the highest level in the executive, and that is MROW protected). This should allow lock-free concurrency most of the time (if locks inside standard library allocators are ignored – this may become an issue later).

Here are three samples from Public Google PeakStream.ppt that now work.

Conjugate gradient example from page 13:

int Conj_Grad_GPU_PS(int N, float *cpuA, float *cpux, float *cpub)
    int iter;
    Arrayf32 x = Arrayf32::zeros(N);
        Arrayf32 A = Arrayf32::make2(N, N, cpuA);
        Arrayf32 b = Arrayf32::make1(N, cpub);
        Arrayf32 residuals = b - matmul(A, x);
        Arrayf32 p = residuals;
        Arrayf32 newRR = dot_product(residuals, residuals);

        for (iter = 0; iter < N; iter++) {
            Arrayf32 oldRR = newRR;
            Arrayf32 newX, newP, newResiduals;
            Arrayf32 Ap = matmul(A, p);
            Arrayf32 dp = dot_product(p, Ap);
            newX = x + p * oldRR / dp;
            newResiduals = residuals - Ap * oldRR / dp;
            newRR = dot_product(newResiduals, newResiduals);
            newP = newResiduals + p * newRR / oldRR;

            p = newP;
            residuals = newResiduals;

            float oldRRcpu = oldRR.read_scalar();
            if (oldRRcpu <= TOLERANCE) {
            x = newX;
    x.read1(cpux, N * sizeof(float));
    return iter;

Monte-carlo π calculation from page 17:

    RNGf32 G(SP_RNG_DEFAULT, 271828); // create an RNG
    Arrayf32 X = rng_uniform_make(G, NSET, 1, 0.0, 1.0);
    Arrayf32 Y = rng_uniform_make(G, NSET, 1, 0.0, 1.0);
    Arrayf32 distance_from_zero = sqrt(X * X + Y * Y);
    Arrayf32 inside_circle = (distance_from_zero <= 1.0f);
    return 4.0f * sum(inside_circle) / NSET;

Arrayf32 Pi = compute_pi();  // get the answer as a 1x1 array
float pi = Pi.read_scalar(); // convert answer to a simple float
printf("Value of Pi = %f\n", pi);

Monte-carlo options pricing from page 36:

float MonteCarloAntithetic(float price,
                           float strike,
                           float vol,
                           float rate,
                           float div,
                           float T)
    float deltat        = T / N;
    float muDeltat      = (rate - div - 0.5 * vol * vol) * deltat;
    float volSqrtDeltat = vol * sqrt(deltat);
    float meanCPU       = 0.0f;
    Arrayf32 meanSP; // result
    {                // a new scope to hold temporary arrays
        RNGf32 rng_hndl(SP_RNG_CEICG12M6, 0);
        Arrayf32 U = zeros_f32(M);
        for (int i = 0; i < N; i++) {
            U += rng_normal_make(rng_hndl, M);
        Arrayf32 values;
            Arrayf32 lnS1 = log(price) + N * muDeltat
                                       + volSqrtDeltat * U;
            Arrayf32 lnS2 = log(price) + N * muDeltat
                                       + volSqrtDeltat * (-U);
            Arrayf32 S1 = exp(lnS1);
            Arrayf32 S2 = exp(lnS2);
            values = (0.5 * ( max(0, S1 - strike) +
                              max(0, S2 - strike) )
                          * exp(-rate * T));
        meanSP = mean(values);
    }                // all temporaries released as we exit scope
    meanCPU = meanSP.read_scalar();
    return meanCPU; 

Getting the code samples above to work has been an interesting exercise. I can see how “Perl happens” or “C++ happens”… that’s when platform implementation comes first and language grammar is an afterthought. This tends to create dark corners (often touted as clever features by experts) in which semantics surprise the unwary. I am trying to avoid these situations.

The next major step is the JIT, the heart and soul of the system. I don’t know enough to make any accurate statements other than it will take much more time to do. So to keep things rolling, I am going to start working on multiple areas in order to pipeline work.

One of the big advantages of PeakStream over its competitors at the time was ease-of-use. The virtual machine and JIT automated more work the developer would otherwise have to do. However, many software developers, data analysts and quantitative scientists are still very uncomfortable with C++ and would prefer different languages.

The PeakStream platform had a language binding to C++ only. I think it would be nice to have bindings to more languages. Two I am thinking of now are Ruby and Scala. They are DSL friendly and may support the imperative syntax required for intuitively natural programming.

This sounds crazy! GPGPU in Ruby?! But I think it makes a lot of sense. Very often, scripting languages are the best tool for data extraction, transforming and reporting. The weakness is any sort of numerically intense analytical processing. Scripting languages are much slower than native code or the JVM. As a result, computational pipelines often have multiple stages with different language platforms. Systems become more complex as a whole.

More generally, native code fluency is a requirement for GPGPU even today. Wouldn’t it be nice if using GPUs for general purpose computation did not require knowing C or C++ at all? GPGPU would then become normal instead of something exotic that few people know about. It would be like floating point. It is part of the language and happens automatically for you.

visiting multiple dispatch hell

Hacked together an interpreter back-end, enough to run a simple PeakStream example.

Arrayf64 C;
    // eagerly create buffers/images on compute device
    Arrayf64 A = make1(100, cpuA);
    Arrayf64 B = make1(100, cpuB);

    // lazy boxed calculation
    C = sum(A + B);
double c = C.read_scalar(); // force evaluation, read back

I want a full implementation of the language in an interpreter before starting work on the JIT. That will make debugging and testing much easier. After the interpreter, work will start on the JIT and OpenCL back-end.

Another thing done last week is puzzling out the PeakStream grammar. It’s almost all there now.  The API is missing special flags like SP_ALL_DIMS and SP_MEMORY_NORMAL. I’m also unsure how stream resizing works in the gather1_floor() and gather2_floor() operations.

Several engineers have independently asked the same two questions, implying the following must be mutually exclusive.

  • Is PeakStream a language? Answer: Yes
  • Or is PeakStream a library? Answer: Yes

PeakStream is a DSL that calls a foreign virtual machine from C++. Lua is an example of another language that works this way. I think the confusion comes from the usual pattern for a managed platform like Java. The foreign call interface is from Java to native code using JNI. PeakStream works in the opposite direction. You can also do that with Java, create a JVM inside a native C/C++ application and call into it, but no one ever does.