Progress report and more details of the first release
April 12, 2011
Posted by on
In a few months, the first release will be ready. It will include the full virtual machine based language with fast matrix multiply (capable of over 50% GPU utilization) and random number generation (have not figured this out on the GPU yet) using an OpenCL JIT back-end. This is, in my opinion, the minimal feature set useful enough to overcome the adoption barrier.
I know… what people really want is something like PeakStream in a scripting language.
I see GPU acceleraed Python projects:
- PyGPU – designed for image processing, last updated in 2007 using OpenGL/Cg
- GpuPy – no download, also uses OpenGL/Cg, does not appear to optimize shader code
- enjing – this is interesting, very close in spirit to this project, uses CUDA
- PyCUDA/PyOpenCL – relatively low level foreign function interface over CUDA and OpenCL
Of the four, enjing is the most PeakStream like. It optimizes GPU kernels underneath NumPy. I must take a closer look at it.
Anyway, here are the header files in the project as of this moment.
api/peakstream.h interp/InterpRNGnormal.hpp jit/TransCond.hpp
bytecode/BC.hpp interp/InterpRNGuniform.hpp jit/TransConvert.hpp
bytecode/ByteCodes.hpp interp/InterpScalar.hpp jit/TransDispatch.hpp
bytecode/EditStak.hpp jit/BoxAccum.hpp jit/TransDotprod.hpp
bytecode/HashBC.hpp jit/BoxBase.hpp jit/TransGather.hpp
bytecode/HashJIT.hpp jit/BoxBinop.hpp jit/TransIdxdata.hpp
bytecode/PrintBC.hpp jit/BoxCond.hpp jit/TransIsomorph.hpp
bytecode/RefCnt.hpp jit/BoxConvert.hpp jit/TransLitdata.hpp
bytecode/Stak.hpp jit/BoxDotprod.hpp jit/TransMakedata.hpp
bytecode/Visit.hpp jit/BoxGather.hpp jit/TransMatmul.hpp
data/ArrayMem.hpp jit/BoxIdxdata.hpp jit/TransReadout.hpp
data/ClientTrace.hpp jit/BoxIsomorph.hpp jit/TransRNGnormal.hpp
data/Nut.hpp jit/BoxLitdata.hpp jit/TransRNGuniform.hpp
data/SingleTrace.hpp jit/BoxMakedata.hpp jit/TransScalar.hpp
data/Stream.hpp jit/BoxMatmulMM.hpp jit/VectorStream.hpp
data/VectorTrace.hpp jit/BoxMatmulMV.hpp jit/VisitJIT.hpp
interp/InterpAccum.hpp jit/BoxMatmulVM.hpp misc/MemalignSTL.hpp
interp/InterpBase.hpp jit/BoxMatmulVV.hpp misc/SimpleFuns.hpp
interp/InterpBinop.hpp jit/BoxReadout.hpp misc/TEA.hpp
interp/InterpCond.hpp jit/BoxRNGnormal.hpp misc/UtilFuns.hpp
interp/InterpConvert.hpp jit/BoxRNGuniform.hpp runtime/ArrayClient.hpp
interp/InterpDispatch.hpp jit/BoxScalar.hpp runtime/DeviceBase.hpp
interp/InterpDotprod.hpp jit/JITCompoundStmt.hpp runtime/DeviceMap.hpp
interp/InterpGather.hpp jit/JITRepeatStmt.hpp runtime/Executor.hpp
interp/InterpIdxdata.hpp jit/JITSingleStmt.hpp runtime/Interpreter.hpp
interp/InterpIsomorph.hpp jit/JITStatement.hpp runtime/MemManager.hpp
interp/InterpLitdata.hpp jit/JITStream.hpp runtime/Scheduler.hpp
interp/InterpMakedata.hpp jit/JITTrace.hpp runtime/Translator.hpp
interp/InterpMatmul.hpp jit/TransAccum.hpp vendor/OCLdevice.hpp
interp/InterpReadout.hpp jit/TransBase.hpp vendor/OCLinit.hpp
The source is approximately 12.5 KLOCs. I am still porting over the JIT into the current source tree. Fast matrix multiply (at least for ATI GPUs) will come from GATLAS which is roughly 13 KLOCs. More work on the runtime, profiling and random number generation will likely push total size to about 50 KLOCs for the first release.
The PeakStream beta had full support for the usual math kernel libraries. This first release will only have partial support. It will expose the OpenCL built-in functions (as far as I can). Matrix factorizations and direct methods to solve linear systems will be missing. I do not have sufficient resources (running out of time!). (I could cheat with BLAS/LAPACK exposed only through the interpreter – but the user will have better performance using these libraries directly from native code.)
PeakStream also had extensive error bounds and round-off error characterization. At least for now, I am not even going to attempt that.
One must-have for the first release is good documentation and especially sample code. The user I have in mind is not an expert with C++ and needs examples that can be copied and pasted. Ideally, someone whose primary language is MATLAB and knows enough C++ to “be dangerous” should be able to use this.