there are so many cores

Just another WordPress.com site

Monthly Archives: April 2011

Status report, JIT middle-end looks good but stress testing compels revisiting scheduler and memory management

JIT middle-end stuff looks good. Some transformations are in bytecode (constants, invariants, loop rolling) while others are in object trees (reductions, kernel boundaries, reordering, variables). Everything fits into the design as-is. Compilers are cool.

I was hoping to get far enough to start work on the OpenCL back-end. As harder stuff is done in the middle-end (where it should be), the back-end should be thin and relatively easy. Note the earlier JIT code was a total hack and completely unsuitable.

Unfortunately, I need to revisit scheduling and memory management. Realistic parallel/concurrent test code (dozens or hundreds of threads) that stresses the runtime is intermittently segfaulting and crashing. My goal is a useful platform for real-world use. So I can’t look the other way and pretend it’s not happening.

My experience using Java and Oracle is that managed platforms often have trouble under high loads, even with performance tuning. That’s ok most of the time. Systems move to “share nothing” and scale horizontally.

I think it’s different for GPUs and HPC. Applications are expected to maximize load and stress the system. If a managed platform can’t handle this, why use it? It’s too much trouble.

Advertisements

Progress report and more details of the first release

In a few months, the first release will be ready. It will include the full virtual machine based language with fast matrix multiply (capable of over 50% GPU utilization) and random number generation (have not figured this out on the GPU yet) using an OpenCL JIT back-end. This is, in my opinion, the minimal feature set useful enough to overcome the adoption barrier.

I know… what people really want is something like PeakStream in a scripting language.

Me too.

I see GPU acceleraed Python projects:

  1. PyGPU – designed for image processing, last updated in 2007 using OpenGL/Cg
  2. GpuPy – no download, also uses OpenGL/Cg, does not appear to optimize shader code
  3. enjing – this is interesting, very close in spirit to this project, uses CUDA
  4. PyCUDA/PyOpenCL – relatively low level foreign function interface over CUDA and OpenCL

Of the four, enjing is the most PeakStream like. It optimizes GPU kernels underneath NumPy. I must take a closer look at it.

Anyway, here are the header files in the project as of this moment.

api/peakstream.h           interp/InterpRNGnormal.hpp   jit/TransCond.hpp
bytecode/BC.hpp            interp/InterpRNGuniform.hpp  jit/TransConvert.hpp
bytecode/ByteCodes.hpp     interp/InterpScalar.hpp      jit/TransDispatch.hpp
bytecode/EditStak.hpp      jit/BoxAccum.hpp             jit/TransDotprod.hpp
bytecode/HashBC.hpp        jit/BoxBase.hpp              jit/TransGather.hpp
bytecode/HashJIT.hpp       jit/BoxBinop.hpp             jit/TransIdxdata.hpp
bytecode/PrintBC.hpp       jit/BoxCond.hpp              jit/TransIsomorph.hpp
bytecode/RefCnt.hpp        jit/BoxConvert.hpp           jit/TransLitdata.hpp
bytecode/Stak.hpp          jit/BoxDotprod.hpp           jit/TransMakedata.hpp
bytecode/Visit.hpp         jit/BoxGather.hpp            jit/TransMatmul.hpp
data/ArrayMem.hpp          jit/BoxIdxdata.hpp           jit/TransReadout.hpp
data/ClientTrace.hpp       jit/BoxIsomorph.hpp          jit/TransRNGnormal.hpp
data/Nut.hpp               jit/BoxLitdata.hpp           jit/TransRNGuniform.hpp
data/SingleTrace.hpp       jit/BoxMakedata.hpp          jit/TransScalar.hpp
data/Stream.hpp            jit/BoxMatmulMM.hpp          jit/VectorStream.hpp
data/VectorTrace.hpp       jit/BoxMatmulMV.hpp          jit/VisitJIT.hpp
interp/InterpAccum.hpp     jit/BoxMatmulVM.hpp          misc/MemalignSTL.hpp
interp/InterpBase.hpp      jit/BoxMatmulVV.hpp          misc/SimpleFuns.hpp
interp/InterpBinop.hpp     jit/BoxReadout.hpp           misc/TEA.hpp
interp/InterpCond.hpp      jit/BoxRNGnormal.hpp         misc/UtilFuns.hpp
interp/InterpConvert.hpp   jit/BoxRNGuniform.hpp        runtime/ArrayClient.hpp
interp/InterpDispatch.hpp  jit/BoxScalar.hpp            runtime/DeviceBase.hpp
interp/InterpDotprod.hpp   jit/JITCompoundStmt.hpp      runtime/DeviceMap.hpp
interp/InterpGather.hpp    jit/JITRepeatStmt.hpp        runtime/Executor.hpp
interp/InterpIdxdata.hpp   jit/JITSingleStmt.hpp        runtime/Interpreter.hpp
interp/InterpIsomorph.hpp  jit/JITStatement.hpp         runtime/MemManager.hpp
interp/InterpLitdata.hpp   jit/JITStream.hpp            runtime/Scheduler.hpp
interp/InterpMakedata.hpp  jit/JITTrace.hpp             runtime/Translator.hpp
interp/InterpMatmul.hpp    jit/TransAccum.hpp           vendor/OCLdevice.hpp
interp/InterpReadout.hpp   jit/TransBase.hpp            vendor/OCLinit.hpp
interp/InterpRNG.hpp       jit/TransBinop.hpp

The source is approximately 12.5 KLOCs. I am still porting over the JIT into the current source tree. Fast matrix multiply (at least for ATI GPUs) will come from GATLAS which is roughly 13 KLOCs. More work on the runtime, profiling and random number generation will likely push total size to about 50 KLOCs for the first release.

The PeakStream beta had full support for the usual math kernel libraries. This first release will only have partial support. It will expose the OpenCL built-in functions (as far as I can). Matrix factorizations and direct methods to solve linear systems will be missing. I do not have sufficient resources (running out of time!). (I could cheat with BLAS/LAPACK exposed only through the interpreter – but the user will have better performance using these libraries directly from native code.)

PeakStream also had extensive error bounds and round-off error characterization. At least for now, I am not even going to attempt that.

One must-have for the first release is good documentation and especially sample code. The user I have in mind is not an expert with C++ and needs examples that can be copied and pasted. Ideally, someone whose primary language is MATLAB and knows enough C++ to “be dangerous” should be able to use this.

Nested loop rolling is easy

Basic loop rolling isn’t difficult.

Here’s a simple example:

Arrayf64 A = make1(100, cpuA);
Arrayf64 B = 5;

for (size_t i = 0; i < 2; i++)
{
    for (size_t j = 0; j < 3; j++)
    {
        A = A + B;
    }

    B = B + 1;
}

Loop rolling is done in three passes:

ORIGINAL TRACE             PASS 1                         PASS 2                         PASS 3

0.0                        0.0                            0.0                            0.0  
0.1  13 PTR ( 0x24dca40 )  0.1  13 PTR ( 0x24dca40 )      0.1  13 PTR ( 0x24dca40 )      0.1  13 PTR ( 0x24dca40 )
1.0                        1.0                            1.0                            1.0  
1.1  50 5                  1.1  50 5                      1.1  50 5                      1.1  50 5
0.2  3 25 0.1 1.1          repeat compound stmt 3 times:  repeat compound stmt 3 times:  repeat compound stmt 2 times:
0.3  3 25 0.2 1.1            0  3 25 0.1 1.1                0  3 25 0.1 1.1                repeat compound stmt 3 times:
0.4  3 25 0.3 1.1                                                                            0  3 25 0.1 1.1
1.2  3 25 1.1 49 1         1.2  3 25 1.1 49 1             1.2  3 25 1.1 49 1               1  3 25 1.1 49 1
0.5  3 25 0.4 1.2          0.5  3 25 0.4 1.2              repeat compound stmt 3 times:
0.6  3 25 0.5 1.2          0.6  3 25 0.5 1.2                0  3 25 0.4 1.2
0.7  3 25 0.6 1.2          0.7  3 25 0.6 1.2
1.3  3 25 1.2 49 1         1.3  3 25 1.2 49 1             1.3  3 25 1.2 49 1

My sense of the first release

The interpreter seems to function correctly with scheduling and vectorized memory management. Now it’s back to the OpenCL JIT.

My sense of the first release:

  1. PeakStream API and managed platform
  2. OpenCL JIT built around high throughput auto-tuned matrix multiply
  3. useful random number generation on the GPU
  4. user documentation

The specific application domains I have in mind are: business analytics, forecasting and options pricing. I have experience of the first two with Amazon’s retail supply chain. The last one is just too important to ignore.