April 28, 2011
Posted by on
JIT middle-end stuff looks good. Some transformations are in bytecode (constants, invariants, loop rolling) while others are in object trees (reductions, kernel boundaries, reordering, variables). Everything fits into the design as-is. Compilers are cool.
I was hoping to get far enough to start work on the OpenCL back-end. As harder stuff is done in the middle-end (where it should be), the back-end should be thin and relatively easy. Note the earlier JIT code was a total hack and completely unsuitable.
Unfortunately, I need to revisit scheduling and memory management. Realistic parallel/concurrent test code (dozens or hundreds of threads) that stresses the runtime is intermittently segfaulting and crashing. My goal is a useful platform for real-world use. So I can’t look the other way and pretend it’s not happening.
My experience using Java and Oracle is that managed platforms often have trouble under high loads, even with performance tuning. That’s ok most of the time. Systems move to “share nothing” and scale horizontally.
I think it’s different for GPUs and HPC. Applications are expected to maximize load and stress the system. If a managed platform can’t handle this, why use it? It’s too much trouble.
April 12, 2011
Posted by on
In a few months, the first release will be ready. It will include the full virtual machine based language with fast matrix multiply (capable of over 50% GPU utilization) and random number generation (have not figured this out on the GPU yet) using an OpenCL JIT back-end. This is, in my opinion, the minimal feature set useful enough to overcome the adoption barrier.
I know… what people really want is something like PeakStream in a scripting language.
I see GPU acceleraed Python projects:
- PyGPU – designed for image processing, last updated in 2007 using OpenGL/Cg
- GpuPy – no download, also uses OpenGL/Cg, does not appear to optimize shader code
- enjing – this is interesting, very close in spirit to this project, uses CUDA
- PyCUDA/PyOpenCL – relatively low level foreign function interface over CUDA and OpenCL
Of the four, enjing is the most PeakStream like. It optimizes GPU kernels underneath NumPy. I must take a closer look at it.
Anyway, here are the header files in the project as of this moment.
api/peakstream.h interp/InterpRNGnormal.hpp jit/TransCond.hpp
bytecode/BC.hpp interp/InterpRNGuniform.hpp jit/TransConvert.hpp
bytecode/ByteCodes.hpp interp/InterpScalar.hpp jit/TransDispatch.hpp
bytecode/EditStak.hpp jit/BoxAccum.hpp jit/TransDotprod.hpp
bytecode/HashBC.hpp jit/BoxBase.hpp jit/TransGather.hpp
bytecode/HashJIT.hpp jit/BoxBinop.hpp jit/TransIdxdata.hpp
bytecode/PrintBC.hpp jit/BoxCond.hpp jit/TransIsomorph.hpp
bytecode/RefCnt.hpp jit/BoxConvert.hpp jit/TransLitdata.hpp
bytecode/Stak.hpp jit/BoxDotprod.hpp jit/TransMakedata.hpp
bytecode/Visit.hpp jit/BoxGather.hpp jit/TransMatmul.hpp
data/ArrayMem.hpp jit/BoxIdxdata.hpp jit/TransReadout.hpp
data/ClientTrace.hpp jit/BoxIsomorph.hpp jit/TransRNGnormal.hpp
data/Nut.hpp jit/BoxLitdata.hpp jit/TransRNGuniform.hpp
data/SingleTrace.hpp jit/BoxMakedata.hpp jit/TransScalar.hpp
data/Stream.hpp jit/BoxMatmulMM.hpp jit/VectorStream.hpp
data/VectorTrace.hpp jit/BoxMatmulMV.hpp jit/VisitJIT.hpp
interp/InterpAccum.hpp jit/BoxMatmulVM.hpp misc/MemalignSTL.hpp
interp/InterpBase.hpp jit/BoxMatmulVV.hpp misc/SimpleFuns.hpp
interp/InterpBinop.hpp jit/BoxReadout.hpp misc/TEA.hpp
interp/InterpCond.hpp jit/BoxRNGnormal.hpp misc/UtilFuns.hpp
interp/InterpConvert.hpp jit/BoxRNGuniform.hpp runtime/ArrayClient.hpp
interp/InterpDispatch.hpp jit/BoxScalar.hpp runtime/DeviceBase.hpp
interp/InterpDotprod.hpp jit/JITCompoundStmt.hpp runtime/DeviceMap.hpp
interp/InterpGather.hpp jit/JITRepeatStmt.hpp runtime/Executor.hpp
interp/InterpIdxdata.hpp jit/JITSingleStmt.hpp runtime/Interpreter.hpp
interp/InterpIsomorph.hpp jit/JITStatement.hpp runtime/MemManager.hpp
interp/InterpLitdata.hpp jit/JITStream.hpp runtime/Scheduler.hpp
interp/InterpMakedata.hpp jit/JITTrace.hpp runtime/Translator.hpp
interp/InterpMatmul.hpp jit/TransAccum.hpp vendor/OCLdevice.hpp
interp/InterpReadout.hpp jit/TransBase.hpp vendor/OCLinit.hpp
The source is approximately 12.5 KLOCs. I am still porting over the JIT into the current source tree. Fast matrix multiply (at least for ATI GPUs) will come from GATLAS which is roughly 13 KLOCs. More work on the runtime, profiling and random number generation will likely push total size to about 50 KLOCs for the first release.
The PeakStream beta had full support for the usual math kernel libraries. This first release will only have partial support. It will expose the OpenCL built-in functions (as far as I can). Matrix factorizations and direct methods to solve linear systems will be missing. I do not have sufficient resources (running out of time!). (I could cheat with BLAS/LAPACK exposed only through the interpreter – but the user will have better performance using these libraries directly from native code.)
PeakStream also had extensive error bounds and round-off error characterization. At least for now, I am not even going to attempt that.
One must-have for the first release is good documentation and especially sample code. The user I have in mind is not an expert with C++ and needs examples that can be copied and pasted. Ideally, someone whose primary language is MATLAB and knows enough C++ to “be dangerous” should be able to use this.
April 7, 2011
Posted by on
Basic loop rolling isn’t difficult.
Here’s a simple example:
Arrayf64 A = make1(100, cpuA);
Arrayf64 B = 5;
for (size_t i = 0; i < 2; i++)
for (size_t j = 0; j < 3; j++)
A = A + B;
B = B + 1;
Loop rolling is done in three passes:
ORIGINAL TRACE PASS 1 PASS 2 PASS 3
0.0 0.0 0.0 0.0
0.1 13 PTR ( 0x24dca40 ) 0.1 13 PTR ( 0x24dca40 ) 0.1 13 PTR ( 0x24dca40 ) 0.1 13 PTR ( 0x24dca40 )
1.0 1.0 1.0 1.0
1.1 50 5 1.1 50 5 1.1 50 5 1.1 50 5
0.2 3 25 0.1 1.1 repeat compound stmt 3 times: repeat compound stmt 3 times: repeat compound stmt 2 times:
0.3 3 25 0.2 1.1 0 3 25 0.1 1.1 0 3 25 0.1 1.1 repeat compound stmt 3 times:
0.4 3 25 0.3 1.1 0 3 25 0.1 1.1
1.2 3 25 1.1 49 1 1.2 3 25 1.1 49 1 1.2 3 25 1.1 49 1 1 3 25 1.1 49 1
0.5 3 25 0.4 1.2 0.5 3 25 0.4 1.2 repeat compound stmt 3 times:
0.6 3 25 0.5 1.2 0.6 3 25 0.5 1.2 0 3 25 0.4 1.2
0.7 3 25 0.6 1.2 0.7 3 25 0.6 1.2
1.3 3 25 1.2 49 1 1.3 3 25 1.2 49 1 1.3 3 25 1.2 49 1
April 3, 2011
Posted by on
The interpreter seems to function correctly with scheduling and vectorized memory management. Now it’s back to the OpenCL JIT.
My sense of the first release:
- PeakStream API and managed platform
- OpenCL JIT built around high throughput auto-tuned matrix multiply
- useful random number generation on the GPU
- user documentation
The specific application domains I have in mind are: business analytics, forecasting and options pricing. I have experience of the first two with Amazon’s retail supply chain. The last one is just too important to ignore.