there are so many cores

Just another site

AFDS 2012 First Day

Badging booth:

Welcome Reception in the ballroom (note the DJ booth with red lights):

Hyatt Regency Bellevue venue map:


A Beta Tomorrow

I originally envisioned a beta release this summer. Well, that’s now. The new date is before next year. It’s a more real date (at least for me) as I am saying it in public at a conference in a few days.

These two talks are interesting.  “Evolving GPUs into a Substrate for Cloud Computing” – The takeaway is that GPUs are changing. SoC processors with integrated GPUs (e.g. AMD Fusion APU, Intel Core, Nvidia Tegra) may lack the sexy high peak throughput of big discrete cards. However, they may also have much lower penalties for data transfers. The talk showcased an example of porting memcached indexing to an AMD Fusion APU. Yes, it does run significantly faster even with the I/O costs and control flow divergence. A big discrete like an ATI HD 5870 has terrible performance in this use case. “Simon Peyton Jones: Data Parallel Haskell ” – The takeaway is pure functional is good for parallel code but bad for locality of data. That’s bad as data movement costs often dominate performance in practice.

The many-core accelerator designs all seem to fail in the marketplace. STI Cell. Larrabee. They are too expensive for what they do, the Swiss Army knife effect of not excelling at anything. What seems to win are the multi-core CPU and the GPU. These are very good at what they do.

It is the bazaar of CPUs and GPUs over the cathedral of many-core.

There’s a lot of interest now in “device fission”. I see this as part of virtualization which is really about two things which are both the same: sell more with less; increase device utilization. It’s like slicing the pizza and selling the slices. You make more money that way and better utilize the pizza too.

I haven’t seen any talk of “device fusion”. Why can’t a cloud look like a big stream processor? If there were virtual fused compute devices, then applications could scale arbitrarily.

But maybe the reason “device fusion” isn’t considered is that, like the nuclear counterpart, it’s something cool that is impractically difficult. That’s a little like the many-core vision of Swiss Army knife homogeneous computing devices.

So the lesson here is that heterogeneity is real. It’s ugly. But that’s diversity in the bazaar. We want cathedrals because they appeal to inner drives to seek beauty through perfection and order. That’s a trap.

Inlining OpenCL with managed code works

It’s a dumb example but shows the basic idea.

ProgramCL progCL(
    "__kernel void add_scalar(__global float* a, float s)"
    "  a[get_global_id(0)] += s;"
    "}" );

Arrayf32 C;
    Arrayf32 A = make1(N, cpuA);
    Arrayf32 B = make1(N, cpuB);

    C = A + B;

    (progCL, "add_scalar", C, 1.5f)(N, 1);

    C += A - 3.3f;
C.read1(cpuC, N * sizeof(float));

OpenCL kernels can be inserted anywhere in the managed execution trace. The virtual machine and JIT do the right thing regarding memory transfers and generated kernel boundaries. It just works. If nothing else, this allows programming in OpenCL as a kernel language and letting the virtual machine manage memory. It’s also a very natural way to extend the platform dynamically.

OpenCL programs can have multiple kernels, local memory, images, etc. The ProgramCL object parses the source code text and does the right thing with arguments. The work space dimensions can be 1D or 2D. The example above is 1D with “(global0, local0)“. For 2D, this would be “(global0, global1, local0, local1)“.

The syntax is unusual. It is also the best compromise I could find. I did not want the verbose Java “setData()” style of interface design. However, I also did not want to expose magic macros or C++ templates in the API. The variadic function call does not work as the array variable objects are not Plain Old Data. So I overloaded the comma and function call operators. It’s a balance between flexibility, natural syntax, and compile-time safety.

I have to switch gears and work on my presentation for AFDS 2012. The final slide deck deadline with AMD is June 6th. There’s also another slide deck I’m working on too.

Inline OpenCL with the DSL

Here’s a simple example of inline OpenCL with the Chai DSL.

const string sourceText = "__kernel "
                          "void foo(__global float* a, const float b)"
                          " a[get_global_id(0)] += b;"
Arrayf32 A = make1(1000, inputA, ARR_MEMBUF1); // constrain JIT memory choice
A += 123;                                      // regular DSL code before
CL(sourceText) << A << 1.0f;                   // inline OpenCL kernel
A += 456;                                      // regular DSL code after
A.read1(outputA, 1000 * sizeof(float));        // evaluate and read back result

This becomes more complicated with anything realistic.

It’s not hard to support local memory and work group dimensions. Note that order of insertion matters for kernel arguments only. The work group dimensions could appear in any order.

CL(sourceText) << Localf32(100) << GlobalWork(1000) << LocalWork(100);

Does the following look better? Perhaps a variadic function call is more natural and easier to use?

CL(sourceText, A, 1.0f, Localf32(100), GlobalWork(1000), LocalWork(100));

I am inclined to use the insertion operator idiom as it allows more flexibility. A programmer may wish to pass different arguments to the inline OpenCL kernel at runtime. That is more awkward with the variadic style. However, this debate reminds me of Perl’s philosophy: allow more than one solution. So I’ll probably end up supporting both ways.

One consequence of inline OpenCL is breaking the interpreter. The simple interpreter can not emulate the inline calculations without going through all the machinery of OpenCL. So why not do this? Unfortunately, not all OpenCL implementations support CPU compute devices (e.g. Nvidia).

This means that use of inline OpenCL affects scheduling. Execution traces with inline OpenCL must go to the JIT only. They don’t have the option to be interpreted. That’s ok now as Chai supports the full language. Random numbers was the last missing piece.

Also, multi-threaded scheduling support has been pretty badly broken for a while now. As originally designed, I had a concept of gathering similar threads into a vector, enqueuing a fused kernel, and scattering the results back. Pretty slick – a multi-thread gather/scatter vectorizing scheduler (which sounds ridiculous just writing this). After data parallel support was added, it was clear that this MT gather/scatter feature would never be used. It adds too much overhead. If you want data parallel, then just put the data into vectors directly.

Multi-threaded support is important. But the gather/scatter in the scheduler makes no sense.

Alpha3 uploaded to GitHub

I just committed the third alpha code drop to GitHub. As before, there is also a tarball download without git metadata. That’s probably more convenient for most people.

While writing my presentation slide deck for AFDS 2012, I realized that there can be no beta release without documentation. It doesn’t matter if the core engineering is done. There is no usable technology product if developers don’t know how to use it.

Writing minimal documentation will take weeks. That seems excessive. But my experience preparing for conferences is that writing is a lot more work than it appears. Good writing is storytelling. It is considering the audience with an interesting narrative. I suspect this is true of technical writing as much as it is fiction.

So the beta release is still a ways off. Beta will mean something that is genuinely usable. It will have rough edges. But it will be a tool you can use to solve problems.

One more major feature I am considering is more direct interoperability with OpenCL. It should be easy for applications to mix Chai managed code with explicit OpenCL. The tricky part is of course memory management. That’s why managed platforms have foreign function interfaces. It’s always clunky. I’d prefer to have something more like inlined assembly language in C.

PeakStream didn’t have anything like this. However, that also makes sense as OpenCL didn’t exist. Even CUDA was released in 2007, the same year PeakStream was acquired by Google and disappeared. If there were a portable shader language then, would PeakStream have tried to interoperate with it?

My guess is they would have. Don’t force people to learn new languages. Try to work with what they have. That’s why BrookGPU and PeakStream are internal DSLs implemented as C++ APIs. OpenCL is now part of the mainstream. So it’s the language people know and use today.

Draft slide deck for AFDS 2012

I’ve been working on my presentation for AMD Fusion Developer Summit 2012. Here’s the draft slide deck: . I’m in with the Programming Languages and Models track.

The AFDS 2012 session catalog looks interesting. There’s a lot of stuff on other tracks. But I know from experience that conference organizers do a good job of sorting speakers. If you hang around with your assigned track, you are likely to learn the most.

My experience of presenting is that I learn what I’m doing when trying to explain it to an audience. I’ve heard teachers say this, that they learn as much as students do. The narrative of this talk for AFDS 2012 turned out differently than I expected. It feels much more like product marketing. The audience isn’t compiler engineers so much as developers who need to solve problems with GPGPU.

Also, I didn’t understand how close a workalike clone of PeakStream this is until I saw the code samples in the slide deck. Chai doesn’t have everything PeakStream did. However, PeakStream didn’t have everything Chai does now.

The obvious work is in the JIT middle-end. I cheated and skipped a lot of safety logic so the language platform could be filled out first. Now that this is done, I have to go back and rework the JIT.

My big question, one I hope to answer from attending AFDS 2012, is if anyone cares. The world is changing. HPC and GPGPU… is this still relevant for an economy built around smart phones connected to social networks in the cloud?

PeakStream’s Monte Carlo Finance demo working

Chai has been designed to run PeakStream demo code. In fact, language grammar and platform semantics were deduced from the example code in marketing presentations and whitepapers. That’s all I had to go on.

  • Conjugate gradient
  • Monte Carlo PI
  • Kirchoff migration
  • Monte Carlo Finance (with slight modification)

The PeakStream language is cunningly designed. There’s usually a good reason the API is the way it is. So I have tried to follow it very closely.

I just got the options pricing demo working. However, to do this, I had to change the code slightly.

float MonteCarloAntithetic(float price,
                           float strike,
                           float vol,
                           float rate,
                           float div,
                           float T)
    float deltat        = T / N;
    float muDeltat      = (rate - div - 0.5 * vol * vol) * deltat;
    float volSqrtDeltat = vol * sqrt(deltat);
    float meanCPU       = 0.0f;
    Arrayf32 meanSP; // result
    {                // a new scope to hold temporary arrays
        RNGf32 rng_hndl(RNG_DEFAULT);
        Arrayf32 U = zeros_f32(M);
        for (int i = 0; i < N; i++) {
            U += rng_normal_make(rng_hndl, M);
        Arrayf32 values;
            Arrayf32 lnS1 = log(price) + N * muDeltat + volSqrtDeltat * U;
            Arrayf32 lnS2 = log(price) + N * muDeltat + volSqrtDeltat * (-U);
            Arrayf32 S1 = exp(lnS1);
            Arrayf32 S2 = exp(lnS2);
            values = (0.5 * (max(0, S1 - strike) + max(0, S2 - strike))
                          * exp(-rate * T));
        meanSP = mean(values);
    }                // all temporaries released as we exit scope
    meanCPU = meanSP.read_scalar();

    return meanCPU; 

The curly braces in bold italics are original PeakStream code. They cause problems for Chai. There’s something about the random array U going out of scope before readout that affects memory management.

Generated code quality is not good enough either. At present, Chai has a lot of special case code transformations for different kinds of loops. For example, loop rolling and reductions have independent representations. There’s no way to nest a rolled loop (sum of normals in array variable U) inside a reduction loop (implied by the mean() operation) for the demo code above. As a result, the code doesn’t use enough registers and relies on global buffer memory.

Another issue is the need for more PRNG types. At present, there are only two, the Random123 Philox and Threefry generators. I have found that Threefry does not work with the Box-Muller transform. While Threefry does appear uniformly distributed, the random vector components are not independent (enough). The output from Box-Muller is clearly wrong.

Random number generation works

RNG works now.

#define NSET 100000               // number of monte carlo trials

    RNGf32 G(RNG_PHILOX, 271828); // create an RNG
    Arrayf32 X = rng_uniform_make(G, NSET, 1, 0.0, 1.0);
    Arrayf32 Y = rng_uniform_make(G, NSET, 1, 0.0, 1.0);
    Arrayf32 distance_from_zero = sqrt(X * X + Y * Y);
    Arrayf32 inside_circle = (distance_from_zero <= 1.0f);
    return 4.0f * sum(inside_circle) / NSET;

This returns 3.138920 as an estimate for PI.

The generated OpenCL includes a subset of the Random123 library with the philox PRNG. Below are just the two kernels that call into the (dynamically generated) library code.

__kernel void f_2331120269860868340_1(
  __global float4 * a1)
  float4 r1;
  r1= (float4)(0);
  for (int i0 = 0; i0 < 25000; i0++)
    r1+= ((0 == (sqrt((float4)((((philox_uniform_4_4((i0), 271828)) * (philox_uniform_4_4((i0), 271828))) + ((philox_uniform_4_4((i0), 271829)) * (philox_uniform_4_4((i0), 271829)))))) <= (float4)(1))) ? (float4)(0) : (float4)(1)) ;
  a1[0] = r1 ;

__kernel void f_2331120269860868340_2(
  __global float4 * a1,
  __global float4 * a2)
  for (int i0 = 0; i0 < 1; i0++)
    a1[(i0) + get_global_id(0) * 0] = (((float4)(4) * a2[(i0) + get_global_id(0) * 0]) / (float4)(100000)) ;

Code quality is poor. It’s a bigger issue with PRNG support.

Conceptually, RNG violates referential transparency by definition. Even though PRNG is in reality deterministic, the code transformations still change – random numbers should appear random even when the process generating them is not.

The good thing is that the Random123 library is integrated into Chai. Both rng_uniform_make() and rng_normal_make() work properly with two PRNG variants: Philox; Threefry.

My presentation from SIAM PP12

Here is the presentation I gave at SIAM PP12 in February:, “Inside a GPGPU Managed Platform with an Auto-tuning JIT Compiler”. This page needs the Adobe Flash Player.

Here’s a direct link to the PDF slide deck: .

After my talk, someone asked if I was writing a paper about this stuff. I had to answer no – I don’t have enough resources to write a paper and build the technology. My first priority is building.

However… I am really confused as to what I am building. Seriously, I don’t know. Technology is stuff that affects culture. It’s not only engineering.

What is the relevance of this?

Let’s assume that everything works perfectly (it doesn’t but this is a thought experiment). Then how is it useful to people? How does it change anything they do? I do not have an answer to this.

The closest analogue to this work may be something like Ocaml. It’s a practical language for rapid prototyping of quantitative data munging applications (just my point of view). It is still quite obscure: too practical to be interesting in academic research; too different to be mainstream for software engineers.

Part of relevance is more than engineering. Of that, I am certain.


I will try integrating the Random123 library into Chai. The Supercomputing 2011 paper Parallel Random Numbers: As Easy as 1, 2, 3 makes a convincing case. There’s also a thread about merging this code into Boost. The license is generous and reasonable too.

To be honest, I really don’t know much about randomness. If I tried to reinvent this wheel, it is likely I would be fooled by randomness. (Excuse this pun, I could not resist.)

My past experience (somewhat vicarious at that) of machine learning never used Monte Carlo methods. There was no sampling based quadrature. Error was calculated by iteration over every member of a cluster. The learning part was finding those clusters of similar things. These clusters implied distributions based on historical data.

To give a sense of my ignorance, my understanding of random numbers was limited to: the truly random (from nature, i.e. hardware generated); the cryptographically secure pseudo-random (deterministically computed in software but indistinguishable from truly random with today’s technology); the cryptographically insecure pseudo-random. I did not know of quasi-random numbers (low discrepancy, even space filling).

This is ironic as I still remember Martin Billik‘s Hilbert space class as an undergraduate. He was big on measure theory. At the time I took his class, I still believed I was a real math guy.

So the right thing to do is work around my ignorance by using other people’s stuff who know far more than I do. This also happens to minimize development time and (potentially) spread good karma to the people behind Random123.