there are so many cores

Just another site

Monthly Archives: November 2010

What is PeakStream?

PeakStream was a managed platform and array programming language for GPGPU. While the language was implemented in a DSL statically compiled as a C++ application, a virtual machine queued syntax trees and statement graphs at runtime. Then a JIT synthesized kernels after performing transformations to minimize costs of data movement (because PCIe bus data transfers are slow) and device scheduling overhead (that means fusing loops and streams).

But doesn’t OpenCL provide a cross-platform compute language? Writing OpenCL kernels is, in my experience, extremely difficult. Although the language is imperative with a C-like syntax, the semantics are functional (GPGPU kernels are like closures over index spaces)… except that side-effects are important (control-flow must not diverge and memory access must coalesce and maximize cache coherency)… and that vectorization constrains the blocking and data layout of problems… and this is performance optimized numerical code anyway which even in an ideal situation is challenging to design. That’s why most engineers and scientists use math kernel libraries and try to avoid reinventing the wheel.

Unfortunately, GPU programming is very different compared with CPUs. The bandwidth over the PCIe bus between CPU and GPU is very slow compared with memory. Worse, there are large overheads scheduling GPU kernels. High performance requires both minimizing data movement and compute kernels with high arithmetic intensity. Of course, the same is true for CPUs. However, the more cores in the processor, the greater the imbalance between memory bandwidth and processing capacity. GPUs are an extreme which changes how software must be written.

That’s what the PeakStream platform addressed in the compiler and managed platform. Applications were much easier to write as the runtime solved many of these problems for the developer. Human factors in a simpler programming model were a primary objective.

PeakStream was acquired by Google in 2007 and subsequently disappeared from view. At that time, the market was somewhat different. OpenCL had not been invented yet. A major competitor was RapidMind (acquired by Intel in 2009). GPUs lacked double precision and had far less performance (roughly an order of magnitude less). FPGA based solutions were attractive for high performance as well as low power consumption. A very good summary of historical interest is Accelerators For High Performance Computing Investigation dated January 24, 2007. This report directly motivated my current view that GPGPU based HPC really should be an open source technology instead of a proprietary one.

This is my current project – to build similar technology to PeakStream in a clean room as free and open source. With this goal, it is natural to start with API compatibility. However, all I have to go on are public presentations and marketing documents which reduce to the following.

At the moment, all of the code samples from Matthew Papakipos’ 2007 presentation at Stanford now compile (the YouTube presentation and PPT). The WinHEC 2007 and tomography documents have sample code with additional semantics in the API that has not been added yet. It’s more difficult as I do not have a copy of the PeakStream beta SDK release. That means even the language grammar, let alone the semantics, must be inferred from sample code. However, except for the math kernel libraries, I think API support for the basic language is probably 90% complete. That means the very front-end of the language only with syntax trees and statement graphs built at runtime.

At the other extreme are resource managers for a OpenCL back-end (buffers, images, kernels) with support for ATI and NVIDIA. This code is reasonably mature and stable. In the middle, there’s nothing for the JIT yet. A previous project GATLAS gives me confidence that I will be able to figure this out.

I believe it is most important to reach end-to-end functionality as early as possible even if this is in vertical slices of the full system. That will provide better visibility into the project and hopefully avoid any major mistakes. This means that optimization and auto-tuning of synthesized kernels will come later in the project. So right now, I’m pretty much working on everything in the “middle-end”.