there are so many cores

Just another site

Monthly Archives: December 2011

Brain dump of slide topics

This is just a brain dump of slide topics for PP12. It’s far too much, roughly 77 slides which would mean about 20 seconds per slide to fit in a 25 minute talk.

It helps me to try and write down some of these ideas. In a past life, I was on the math professor career track. Teaching helps the teacher to understand perhaps as much as it does students.


  pedagogical and historical order:
  1. GPGPU basics
  2. parameterized kernel design
  3. auto-tuning JIT back-end
  4. application virtual machine architecture
  5. memory management
  6. JIT middle-end
  7. application
  8. performance, error, and the long view

GPGPU basics

  CPU and GPU, the von Neumann machine and stream processor
  kernelized functional closures where side-effects are important
  trade time for space rather than space for time
  arithmetic intensity is time/space complexity ratio
  six kinds of memory: host, driver, global, local, private and texture
  optimization is mostly memory efficiency
  inner and outer blocking
  inner and outer vectorization

Parameterized kernel design

  the shader compiler won't do these loop code transformations for you:
  a. unrolling
  b. fusion
  c. interchange
  d. strip mining
  e. tiling
  f. scalar expansion
  ten general kernel design principles for auto-tuning:
  1. vectorize memory access and use mad() scalar arithmetic
  2. directional bias in memory reads
  3. blocked and tiled layout is not always better than simple linear
  4. tune outer and inner blocking (problem size, work groups and registers)
  5. kernel execution time variance
  6. check correctness of data output
  7. designs generalize within but not necessarily across device architectures
  8. memory buffers can be faster than textures
  9. coalescing kernels to amortize overhead
  10. synthesize low intensity kernels JIT, optimize high intensity kernels AOT

Auto-tuning JIT back-end

  auto-tuning as statistical optimization (expectation maximization works)
  register pressure and convexity (why should optimization converge?)
  define a model family with endogenous kernel template parameters
  avoid the curse of dimensionality with exogenous brute force
  memoization and journaling as practical technology considerations
  everything fails: compilers, drivers, and kernels
  auto-tuning ahead of time and just in time (or cold start versus warm start)
  invest in arithmetic intensity to maximize returns (the 80/20 rule again)

Application virtual machine architecture

  managed platform as C++ domain specific language (inspired by PeakStream)
  bytecode stack machine
  tracing JIT and interpreter
  concurrency and data parallelism with the gather/scatter scheduler
  application, device and scheduler threads
  hashing execution traces and vectorizing threads
  JIT translator and OpenCL compute devices
  interpreter clears trace queue when the translator fails

Memory management

  managed platforms are always about memory
  levels of memory indirection: host, front, back, and compute device
  ultimate owners are arrays on the stack frame and application threads
  garbage collection through nested reference counting
  unifying memory across traces after gathering
  translator explicitly enqueues data transfers with compute devices
  interpreter implicitly swizzles back to CPU host and scatters
  compute device state makes continuation hard (data movement is expensive!)

JIT middle-end

  tracing without the instruction pointer
  constant lifting and heuristically loop rolling bytecode
  don't box too much: ASTs, statements, and variables
  don't box too little: kernelization and index spaces
  sending live variables and creating dead temporaries
  auto-tuning warm start from kernelization
  reconciling auto-tuned vector lengths
  no worries mixing images and memory buffers together


  array programming as stream processing (inspired by Brook)
  data parallel styles: OpenMP loop; Pthreads; array data vectors
  data parallel reductions
  four kinds of matrix multiplication
  concurrent loops and extra complications with stream processing
  mixed single and double precision
  intermingling with unmanaged native code
  runtime configuration: avoid the GPU chimera binary
  biased for performance: auto-tuning ahead of time
  dynamically extending the virtual machine

Performance, error and the long view

  some benchmark numbers for comparison
  compute devices are different and complementary
  thinking about error in terms of basis points
  quantitative strategy and analytics rather than solving PDEs
  classification and clustering to find structure in data
  it's a search problem again

Christmas week features

I basically ignored the feature freeze and have been coding furiously the last week. There are two big changes with new functionality.

  1. Scheduled traces do not terminate when reading back data from the compute device. This allows loops that stop depending on intermediate results from the calculation.
  2. Data parallelism is exposed directly as vectorized arrays. It is no longer necessary to rely on OpenMP or Pthreads and the gather/scatter scheduler to group execution traces across threads.

Both of these are basic language semantics. Without 1, execution traces would be analogous to a single stage of flat data parallel map-reduce. Without 2, outer vectorization relies on the scheduler and threads which in the best case is inefficient. (By outer vectorization, I mean tiling arrays into streams. Inner vectorization with the use of vectorized data types is performed by the auto-tuning JIT.)

It’s funny that for both features, the first solutions turned out to be failures. They were too slow and didn’t work. I couldn’t figure it out. Even if I could, the performance penalty was a big concern. That forced me to roll back a few days work and try again. By that time, I had enough insight to see simpler and more direct solutions which did turn out to work.

I’ve read that reducing development cycle times is a key factor in success. I agree with this. It’s not how good you are today. It’s how fast you learn and evolve. That’s why waterfall development is notorious for poor outcomes. In the context of a society that solves problems, learning is too slow.

Here’s an example of the two ways for expressing parallel array data (i.e. tiled arrays into streams).

Gathering execution traces across threads:

    #pragma omp parallel for
    for (size_t i = 0; i < 10; i++)
        Arrayf64 C;
            Arrayf64 A = make1(100, cpuA[i]); // double cpuA[10][100]
            Arrayf64 B = make1(100, cpuB[i]); // double cpuB[10][100]
            C = sum(A + B);
        result[i] = C.read_scalar();

Vectorized array data in a single execution trace:

    Arrayf64 C;
        Arrayf64 A = make1(100, vecA); // vector<double*> vecA
        Arrayf64 B = make1(100, vecB); // vector<double*> vecB
        C = sum(A + B);
    const vector< double > c = C.read_scalar(10);

Over the last week, I’ve become more aware of scheduling and JIT overhead. If the runtime spends too much time scheduling and compiling, that could seriously limit performance (Amdahl’s law). Keeping the runtime lean and fast is important to real world throughput.

I want to spend another few days performance tuning and fixing bugs. Then I really need to write the conference presentation for PP12. There’s an enormous amount of material here, far too much for 25 minutes.

Don’t be rude.

It’s past feature freeze. So I should be focused on bug fixing. This is just too sexy and easy to defer.

I added language extension support to the virtual machine and JIT. New functions may be added anytime after virtual machine initialization. This really was not that difficult. It just needed a few hooks to extend dispatch tables and pass around memory management and execution trace context.

There really is a need for this. New math kernels should not require modifying the core platform code, especially the increasingly complicated JIT middle-end. It’s starting to become rather complicated in there. For the system to scale, the core platform should be distinct from libraries.

The other thing I did was remove all potentially trademark infringing references to PeakStream from the source code. I shouldn’t imply any relationship where none exists. Don’t be rude.

When I started this project, I took the concept of PeakStream as inspiration. I searched for a copy. Fortunately, Google did a thorough job of discontinuing the product line after acquiring the company. I couldn’t find any source code file (I really wanted peakstream.h), let alone the beta download. Nothing. All I found were marketing materials and a few research papers surveying GPGPU products.

So the PeakStream DSL was my best guess at a language grammar from reading a handful of whitepapers archived on the Internet Archive Wayback Machine. Later, the picture became clearer to me. I tend to believe in convergent technical evolution – given the same problem, different groups, working independently, will often arrive at similar solutions.

However, the technology landscape today is different from five years ago when PeakStream was still in business. I have also made some very different design choices. PeakStream had to build a lot more (OpenCL did not exist yet) and address a (relatively speaking) mass market of scientific and engineering users to provide speculative private equity value to venture capital funds.

PeakStream was trying to build a more general purpose platform. They also had more resources than I do. At the end, the company was 35 people. As one guy, I’ve had to adopt a narrower vision and focus on what provides the maximum return on my investment of time.

So what I’ve built, while it started out as the same idea as PeakStream, has certainly diverged from it as the vision has changed and become clearly focused. The project has acquired an identity and life of it’s own.

Feature complete

Fixed all of the problems with the auto-tuned GEMV. It wasn’t as bad as expected. Yesterday, I was rattled because every kernel specialization was failing. Not reassuring.

I’ve learned to be paranoid about numerical correctness. Automated testing is incorporated into the auto-tuning process. A kernel specialization is stress tested with random data before accepted as good.

What makes this trickier is that extensive auto-tuning, when hundreds or thousands of kernel variations are tested, meets limitations in vendor runtimes. The GPU driver might crash or the device enter a bad state. The OpenCL compiler may hang, segfault, or fail with internal error messages. Despite coming from the same design template, some specializations work perfectly on a device while others fail.

All of this adds enough ambiguity that distinguishing your bugs from toolchain and platform issues is difficult. My experience so far is: My code often has more bugs than I think it does. It’s probably a cognitive bias to blame known vendor bugs as responsible for other, as yet undiagnosed, bugs.

There are still some serious bugs in the JIT. However, even with those, I see output that agrees between generated OpenCL on ATI, NVIDIA, x86 and a reference CPU interpreter. The numbers are all the same. That gives me confidence it is really working and not garbage output.

Added last feature for alpha release

Adding GEMV turned out to be a bit more work than expected. The old implementation was primitive. Rewriting in the current autotuning JIT framework was somewhat tricky.

There will be a few days of testing required. After that, I will consider the project feature complete for a first alpha release. Focus will shift to quality instead of adding new features.

In his Stanford talk after Google acquired PeakStream (this is an excellent presentation and reveals much about the technology if you pay attention and read between the lines), Papakipos was asked about extending the language. Can users add new functions to the platform?

At this, Papakipos appeared somewhat uncomfortable. He gave the impression it was very difficult to do. If their technology looked anything like what I’ve done, I completely understand. It’s not that it is that difficult – however, the virtual machine and JIT are monolithic in nature. So extending the platform requires access to the full source. That at least is not an issue for this project (as it is open source).

Almost at feature freeze for first release

Added this week:

  • index arrays (index_f32, index_f64 – both 1D and 2D variants along width or height)
  • gather data shuffling (gather1_floor, gather2_floor – allows array subscripts with ordinates from data instead of loop indexes and work-item IDs)
  • outer product matrix multiply in generated kernel

The last remaining major feature is auto-tuned GEMV (matrix-vector multiply). That shouldn’t be too hard as I’ve done this before (have old code) and GEMM is already integrated. I want to get this done tomorrow.

As mentioned earlier, there’s no time to work on GPU random number support in the immediate future. Any execution traces that use the RNG API will be scheduled on the CPU interpreter.

The JIT does work but has many dark corners. That’s why it is important to stop adding new features and start fixing bugs. For the first release, not everything will work – which makes it more important to know what does work.

Correctness is the prime requirement. After that is stable and consistent behavior. That’s an issue with managed platforms sometimes (e.g. unpleasant surprises with a database execution plan optimizer). Failures are o.k. if known and not silent.

That’s the mindset I have for this as technology. It must be useful. It doesn’t have to be perfect. I want to add value, not uncertainty.

Finished adding built-ins

Finished adding 84 OpenCL built-in functions. Now retesting to verify I didn’t break anything. Looks good so far.

Next up is index array support. PeakStream had a clever construction with arrays of loop indexes. This allowed something like a stream closure in the array programming language. The closure could be mapped to either the enqueued global index space or as a loop done in the kernel for each work ID. They didn’t say this in their whitepapers, but having now written a similar JIT compiler, it’s obvious.

I won’t get to random number support this year or before the conference next year. There’s not enough time. I would rather have a stable release and reliability than more features. The RNG API support will still be there. But it won’t be implemented on the GPU. Random numbers will be generated on the CPU host and transferred over the bus to the GPU – inefficient, but at least the functionality is there.

Adding built-in functions to the API

It’s time to add (most of) the OpenCL 1.1 built-in functions:

  • 9.3.2 Math Functions: acos acosh acospi asin asinh asinpi atan atan2 atanh atanpi atan2pi cbrt ceil copysign cos cosh cospi erfc erf exp exp2 exp10 expm1 fabs fdim floor fma fmax fmin fmod fract frexp hypot ilogb ldexp lgamma log log2 log10 log1p logb mad maxmag minmag modf nan nextafter pow pown powr remainder remquo rint rootn round rsqrt sin sincos sinh sinpi sqrt tan tanh tanpi tgamma trunc
  • 9.3.5 Relational Functions: isequal isnotequal isgreater isgreaterequal isless islessequal islessgreater isfinite isinf isnan isnormal isordered isunordered

These functions are part of the standard C99 POSIX math libraries. It’s easy to maintain equivalent functionality between the interpreter and GPU back-end. No rocket-science. It will be a long weekend of drone work.

JIT back-end code generation is still very incomplete. Everything is working, though. Even the gather/scatter of application thread traces by the scheduler is working with the GPU. This is really support for OpenMP and loops, although regular Pthreads work just as well.

My experience is these kinds of problems can be endless. If you try to build the complete, bullet-proof virtual machine and JIT compiler up-front, it will take too long. Worse, by the time it is finished (if that ever happens), you may find that it has been optimized for the wrong problems.

Better is building out a platform as needed to support application functionality. That’s the only way to do this.