there are so many cores
Just another WordPress.com site
Monthly Archives: December 2010
API dispatch table and stream shape
December 31, 2010
Posted by on Here is a table that connects:
- virtual machine operation bytecodes
- PeakStream API function points
- CPU back-end
- stream shape
- computational graph
The most basic JIT optimization is finding subgraphs with homogeneous stream shape (multiple consecutive loops that can be fused together into a single loop). These subgraphs correspond to synthesized compute kernels for the GPU. That’s important as kernel scheduling overhead is very high for GPUs. Minimizing the number of kernels scheduled can have a dramatic effect on performance (my experience is about 5x higher dense matrix multiply throughput on an ATI Radeon HD 5870 in double precision).
code API name CPU back-end stream shape graph 0 abs CPUisomorph<cpu_ABS> same interior 1 cond CPUcond polymorphic interior 2 construct_f32 CPUscalar none leaf 3 construct_f64 CPUscalar none leaf 4 convert_f32 CPUconvert<float> same interior 5 convert_f64 CPUconvert<double> same interior 6 dot_product CPUdotprod<false> reduction interior 7 exp CPUisomorph<cpu_EXP> same interior 8 gather1_floor CPUshuffle<1> same interior 9 gather2_floor CPUshuffle<2> same interior 10 index1_f32 CPUidxdata<float, 1> none leaf 11 index1_f64 CPUidxdata<double, 1> none leaf 12 index2_f32 CPUidxdata<float, 2> none leaf 13 index2_f64 CPUidxdata<double, 2> none leaf 14 make1_f32 CPUmakedata<float, 1> (W, 1) leaf 15 make1_f64 CPUmakedata<double, 1> (W, 1) leaf 16 make2_f32 CPUmakedata<float, 2> (W, H) leaf 17 make2_f64 CPUmakedata<double, 2> (W, H) leaf 18 matmul CPUmatmul polymorphic interior 19 max CPUbinop<cpu_MAX> same interior 20 mean CPUaccum<true> reduction interior 21 min CPUbinop<cpu_MIN> same interior 22 negate CPUisomorph<cpu_NEGATE> same interior 23 ones1_f32 CPUlitdata<float, 1> none leaf 24 ones1_f64 CPUlitdata<double, 1> none leaf 25 ones2_f32 CPUlitdata<float, 2> none leaf 26 ones2_f64 CPUlitdata<double, 2> none leaf 27 operatorADD CPUbinop<cpu_ADD> same interior 28 operatorAND CPUbinop<cpu_AND> same interior 29 operatorDIV CPUbinop<cpu_DIV> same interior 30 operatorEQ CPUbinop<cpu_EQ> same interior 31 operatorGE CPUbinop<cpu_GE> same interior 32 operatorGT CPUbinop<cpu_GT> same interior 33 operatorLE CPUbinop<cpu_LE> same interior 34 operatorLT CPUbinop<cpu_LT> same interior 35 operatorMUL CPUbinop<cpu_MUL> same interior 36 operatorNE CPUbinop<cpu_NE> same interior 37 operatorOR CPUbinop<cpu_OR> same interior 38 operatorSUB CPUbinop<cpu_SUB> same interior 39 pusharray_f32 CPUpusharray<float> (W, H) leaf 40 pusharray_f64 CPUpusharray<double> (W, H) leaf 41 read_scalar_f32 CPUnop - root 42 read_scalar_f64 CPUnop - root 43 read1_f32 CPUnop - root 44 read1_f64 CPUnop - root 45 read2_f32 CPUnop - root 46 read2_f64 CPUnop - root 47 rng_normal_make_f32 CPUrngnormal<float> none leaf 48 rng_normal_make_f64 CPUrngnormal<double> none leaf 49 rng_uniform_make_f32 CPUrnguniform<float> none leaf 50 rng_uniform_make_f64 CPUrnguniform<double> none leaf 51 sqrt CPUisomorph<cpu_SQRT> same interior 52 sum CPUaccum<false> reduction interior 53 zeros1_f32 CPUlitdata<float, 1> none leaf 54 zeros1_f64 CPUlitdata<double, 1> none leaf 55 zeros2_f32 CPUlitdata<float, 2> none leaf 56 zeros2_f64 CPUlitdata<double, 2> none leaf
Ready to start on the GPU back-end
December 29, 2010
Posted by on The C++ API front-end and CPU interpreter back-end are solid now. Next up is the interesting stuff with scheduling, JIT and GPU back-end.
- 3253 lines of C++
- 54 files
- keyword template appears 37 times
- preprocessor directive #define appears 58 times
- bytecode stack virtual machine with 57 instructions
- reference counted array buffers
- tunable lazy evaluation (very important!)
The last point about lazy evaluation is significant. The advantage of a JIT comes from the ability to optimize and schedule code at runtime. Code does not always execute immediately. It is queued, dynamically compiled and scheduled. Then metrics are collected to inform the system, improve management and enable higher performance over time.
There is a balance between eager and lazy evaluation. Eagerness permits less optimization but also has lower overhead. In the limit, this becomes an interpreter. Laziness allows more optimization but has higher overhead from boxing computations. In the limit, this becomes Haskell!
Before dealing with this eager/lazy trade-off, the first thing is a back-end for the GPU. This will be a non-optimizing (no auto-tuning, that comes later) JIT that fuses streams and synthesizes OpenCL kernels dynamically. New virtual machine instructions will be added to the interpreter dispatch table corresponding to the synthesized kernels. So one way of thinking about the JIT is as an accelerator for the interpreter.
Less really is more
December 27, 2010
Posted by on The last two weeks were consumed continuously rewriting the core runtime. It is now five times smaller at just over 3,000 lines of code. The size is deceptive. Template meta-programming is used throughout (along with unavoidably convenient macros).
For this project, I believe a small quantity of elegantly complex code is preferable to a simpler but larger code base. Less code, even if dense, will be easier to maintain going forward and allow faster progress.
Interpreter 2: Rise of the Stack Machine
December 20, 2010
Posted by on I rewrote everything.
Original design (week before last):
- syntax trees and statement graphs on heap
- interpreter recursion by pointer chasing
- values in reference counted variables
The rewrite (last week):
- const parse trees and statements converted directly to stack bytecode
- interpreter recursion by stack iteration
- no values or reference counting, DAG as bytecode stack is full computation
The reason I moved to a bytecode stack machine is for the JIT. Pointer chasing over a DAG is fine so long as it is strictly interpreted. This becomes impractical for a JIT as the graph must be transformed. The JIT will:
- find subgraphs it knows how to optimize
- synthesize new kernel functions for these subgraphs
- replace subgraphs with new kernels
Another concern I had is all of the locking in the standard library allocators and loss of memory locality when building everything on the heap. It’s just not efficient and could be costly inside inner loops.
There’s one feature of the original design that must be carried over to the rewrite. Variables need mutable state and reference counting again. Once a variable has a value, the graph of computations required to calculate it can be discarded. Without this, the bytecode stack often becomes huge as the full computation is an enormous graph. There’s nothing surprising about this. Tail recursion and memoization are often necessary for recursive algorithms. Otherwise, combinatorial explosions take hold.
One major advantage of a bytecode based virtual machine is that it decouples the front-end. The original design had no clear separation boundary between the front-end and middle/back-end. The zoo of types threatened to turn the design into a big ball of mud. With bytecode, there is clear separation between layers. This also will make it much easier to implement additional front-ends, especially in other languages.
I’m a little disappointed that I don’t have a working JIT yet, even as a primitive proof-of-concept. But the rewrite was necessary to make the JIT possible. And the design is much better now.
resurrected interpreter from visitor hell
December 13, 2010
Posted by on I wrote an interpreter back-end that allows running PeakStream examples on the CPU. It supports the entire API (which will continue to evolve). This gives me a lot of confidence in the basic design. It is somewhat complex internally as mutable state is never shared (with one exception at the highest level in the executive, and that is MROW protected). This should allow lock-free concurrency most of the time (if locks inside standard library allocators are ignored – this may become an issue later).
Here are three samples from Public Google PeakStream.ppt that now work.
Conjugate gradient example from page 13:
int Conj_Grad_GPU_PS(int N, float *cpuA, float *cpux, float *cpub) { int iter; Arrayf32 x = Arrayf32::zeros(N); { Arrayf32 A = Arrayf32::make2(N, N, cpuA); Arrayf32 b = Arrayf32::make1(N, cpub); Arrayf32 residuals = b - matmul(A, x); Arrayf32 p = residuals; Arrayf32 newRR = dot_product(residuals, residuals); for (iter = 0; iter < N; iter++) { Arrayf32 oldRR = newRR; Arrayf32 newX, newP, newResiduals; Arrayf32 Ap = matmul(A, p); Arrayf32 dp = dot_product(p, Ap); newX = x + p * oldRR / dp; newResiduals = residuals - Ap * oldRR / dp; newRR = dot_product(newResiduals, newResiduals); newP = newResiduals + p * newRR / oldRR; p = newP; residuals = newResiduals; float oldRRcpu = oldRR.read_scalar(); if (oldRRcpu <= TOLERANCE) { break; } x = newX; } } x.read1(cpux, N * sizeof(float)); return iter; }
Monte-carlo π calculation from page 17:
Arrayf32 compute_pi(void) { RNGf32 G(SP_RNG_DEFAULT, 271828); // create an RNG Arrayf32 X = rng_uniform_make(G, NSET, 1, 0.0, 1.0); Arrayf32 Y = rng_uniform_make(G, NSET, 1, 0.0, 1.0); Arrayf32 distance_from_zero = sqrt(X * X + Y * Y); Arrayf32 inside_circle = (distance_from_zero <= 1.0f); return 4.0f * sum(inside_circle) / NSET; } Arrayf32 Pi = compute_pi(); // get the answer as a 1x1 array float pi = Pi.read_scalar(); // convert answer to a simple float printf("Value of Pi = %f\n", pi);
Monte-carlo options pricing from page 36:
float MonteCarloAntithetic(float price, float strike, float vol, float rate, float div, float T) { float deltat = T / N; float muDeltat = (rate - div - 0.5 * vol * vol) * deltat; float volSqrtDeltat = vol * sqrt(deltat); float meanCPU = 0.0f; Arrayf32 meanSP; // result { // a new scope to hold temporary arrays RNGf32 rng_hndl(SP_RNG_CEICG12M6, 0); Arrayf32 U = zeros_f32(M); for (int i = 0; i < N; i++) { U += rng_normal_make(rng_hndl, M); } Arrayf32 values; { Arrayf32 lnS1 = log(price) + N * muDeltat + volSqrtDeltat * U; Arrayf32 lnS2 = log(price) + N * muDeltat + volSqrtDeltat * (-U); Arrayf32 S1 = exp(lnS1); Arrayf32 S2 = exp(lnS2); values = (0.5 * ( max(0, S1 - strike) + max(0, S2 - strike) ) * exp(-rate * T)); } meanSP = mean(values); } // all temporaries released as we exit scope meanCPU = meanSP.read_scalar(); return meanCPU; }
Getting the code samples above to work has been an interesting exercise. I can see how “Perl happens” or “C++ happens”… that’s when platform implementation comes first and language grammar is an afterthought. This tends to create dark corners (often touted as clever features by experts) in which semantics surprise the unwary. I am trying to avoid these situations.
The next major step is the JIT, the heart and soul of the system. I don’t know enough to make any accurate statements other than it will take much more time to do. So to keep things rolling, I am going to start working on multiple areas in order to pipeline work.
One of the big advantages of PeakStream over its competitors at the time was ease-of-use. The virtual machine and JIT automated more work the developer would otherwise have to do. However, many software developers, data analysts and quantitative scientists are still very uncomfortable with C++ and would prefer different languages.
The PeakStream platform had a language binding to C++ only. I think it would be nice to have bindings to more languages. Two I am thinking of now are Ruby and Scala. They are DSL friendly and may support the imperative syntax required for intuitively natural programming.
This sounds crazy! GPGPU in Ruby?! But I think it makes a lot of sense. Very often, scripting languages are the best tool for data extraction, transforming and reporting. The weakness is any sort of numerically intense analytical processing. Scripting languages are much slower than native code or the JVM. As a result, computational pipelines often have multiple stages with different language platforms. Systems become more complex as a whole.
More generally, native code fluency is a requirement for GPGPU even today. Wouldn’t it be nice if using GPUs for general purpose computation did not require knowing C or C++ at all? GPGPU would then become normal instead of something exotic that few people know about. It would be like floating point. It is part of the language and happens automatically for you.
visiting multiple dispatch hell
December 5, 2010
Posted by on Hacked together an interpreter back-end, enough to run a simple PeakStream example.
Arrayf64 C; { // eagerly create buffers/images on compute device Arrayf64 A = make1(100, cpuA); Arrayf64 B = make1(100, cpuB); // lazy boxed calculation C = sum(A + B); } double c = C.read_scalar(); // force evaluation, read back
I want a full implementation of the language in an interpreter before starting work on the JIT. That will make debugging and testing much easier. After the interpreter, work will start on the JIT and OpenCL back-end.
Another thing done last week is puzzling out the PeakStream grammar. It’s almost all there now. The API is missing special flags like SP_ALL_DIMS and SP_MEMORY_NORMAL. I’m also unsure how stream resizing works in the gather1_floor() and gather2_floor() operations.
Several engineers have independently asked the same two questions, implying the following must be mutually exclusive.
- Is PeakStream a language? Answer: Yes
- Or is PeakStream a library? Answer: Yes
PeakStream is a DSL that calls a foreign virtual machine from C++. Lua is an example of another language that works this way. I think the confusion comes from the usual pattern for a managed platform like Java. The foreign call interface is from Java to native code using JNI. PeakStream works in the opposite direction. You can also do that with Java, create a JVM inside a native C/C++ application and call into it, but no one ever does.
Recent Comments