there are so many cores

Just another site

Progress report and more details of the first release

In a few months, the first release will be ready. It will include the full virtual machine based language with fast matrix multiply (capable of over 50% GPU utilization) and random number generation (have not figured this out on the GPU yet) using an OpenCL JIT back-end. This is, in my opinion, the minimal feature set useful enough to overcome the adoption barrier.

I know… what people really want is something like PeakStream in a scripting language.

Me too.

I see GPU acceleraed Python projects:

  1. PyGPU – designed for image processing, last updated in 2007 using OpenGL/Cg
  2. GpuPy – no download, also uses OpenGL/Cg, does not appear to optimize shader code
  3. enjing – this is interesting, very close in spirit to this project, uses CUDA
  4. PyCUDA/PyOpenCL – relatively low level foreign function interface over CUDA and OpenCL

Of the four, enjing is the most PeakStream like. It optimizes GPU kernels underneath NumPy. I must take a closer look at it.

Anyway, here are the header files in the project as of this moment.

api/peakstream.h           interp/InterpRNGnormal.hpp   jit/TransCond.hpp
bytecode/BC.hpp            interp/InterpRNGuniform.hpp  jit/TransConvert.hpp
bytecode/ByteCodes.hpp     interp/InterpScalar.hpp      jit/TransDispatch.hpp
bytecode/EditStak.hpp      jit/BoxAccum.hpp             jit/TransDotprod.hpp
bytecode/HashBC.hpp        jit/BoxBase.hpp              jit/TransGather.hpp
bytecode/HashJIT.hpp       jit/BoxBinop.hpp             jit/TransIdxdata.hpp
bytecode/PrintBC.hpp       jit/BoxCond.hpp              jit/TransIsomorph.hpp
bytecode/RefCnt.hpp        jit/BoxConvert.hpp           jit/TransLitdata.hpp
bytecode/Stak.hpp          jit/BoxDotprod.hpp           jit/TransMakedata.hpp
bytecode/Visit.hpp         jit/BoxGather.hpp            jit/TransMatmul.hpp
data/ArrayMem.hpp          jit/BoxIdxdata.hpp           jit/TransReadout.hpp
data/ClientTrace.hpp       jit/BoxIsomorph.hpp          jit/TransRNGnormal.hpp
data/Nut.hpp               jit/BoxLitdata.hpp           jit/TransRNGuniform.hpp
data/SingleTrace.hpp       jit/BoxMakedata.hpp          jit/TransScalar.hpp
data/Stream.hpp            jit/BoxMatmulMM.hpp          jit/VectorStream.hpp
data/VectorTrace.hpp       jit/BoxMatmulMV.hpp          jit/VisitJIT.hpp
interp/InterpAccum.hpp     jit/BoxMatmulVM.hpp          misc/MemalignSTL.hpp
interp/InterpBase.hpp      jit/BoxMatmulVV.hpp          misc/SimpleFuns.hpp
interp/InterpBinop.hpp     jit/BoxReadout.hpp           misc/TEA.hpp
interp/InterpCond.hpp      jit/BoxRNGnormal.hpp         misc/UtilFuns.hpp
interp/InterpConvert.hpp   jit/BoxRNGuniform.hpp        runtime/ArrayClient.hpp
interp/InterpDispatch.hpp  jit/BoxScalar.hpp            runtime/DeviceBase.hpp
interp/InterpDotprod.hpp   jit/JITCompoundStmt.hpp      runtime/DeviceMap.hpp
interp/InterpGather.hpp    jit/JITRepeatStmt.hpp        runtime/Executor.hpp
interp/InterpIdxdata.hpp   jit/JITSingleStmt.hpp        runtime/Interpreter.hpp
interp/InterpIsomorph.hpp  jit/JITStatement.hpp         runtime/MemManager.hpp
interp/InterpLitdata.hpp   jit/JITStream.hpp            runtime/Scheduler.hpp
interp/InterpMakedata.hpp  jit/JITTrace.hpp             runtime/Translator.hpp
interp/InterpMatmul.hpp    jit/TransAccum.hpp           vendor/OCLdevice.hpp
interp/InterpReadout.hpp   jit/TransBase.hpp            vendor/OCLinit.hpp
interp/InterpRNG.hpp       jit/TransBinop.hpp

The source is approximately 12.5 KLOCs. I am still porting over the JIT into the current source tree. Fast matrix multiply (at least for ATI GPUs) will come from GATLAS which is roughly 13 KLOCs. More work on the runtime, profiling and random number generation will likely push total size to about 50 KLOCs for the first release.

The PeakStream beta had full support for the usual math kernel libraries. This first release will only have partial support. It will expose the OpenCL built-in functions (as far as I can). Matrix factorizations and direct methods to solve linear systems will be missing. I do not have sufficient resources (running out of time!). (I could cheat with BLAS/LAPACK exposed only through the interpreter – but the user will have better performance using these libraries directly from native code.)

PeakStream also had extensive error bounds and round-off error characterization. At least for now, I am not even going to attempt that.

One must-have for the first release is good documentation and especially sample code. The user I have in mind is not an expert with C++ and needs examples that can be copied and pasted. Ideally, someone whose primary language is MATLAB and knows enough C++ to “be dangerous” should be able to use this.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: