November 1, 2011
Posted by on
The auto-tuning code is better, faster, and stronger. The C++ template bloat is removed. Code generation is now fully dynamic and not constrained by specialization of templates at compile time. The convenient string hacks in kernel metaprogramming have been removed – the left hand side of a statement is always a proper variable, never a value or a string.
The major functional change is: A matrix multiply kernel may use images and memory buffers of different precisions and vector lengths. There is no hard distinction between image (texture sampling) and memory buffer kernels. Code generation handles all combinations.
After this rewrite, I see that the PeakStream domain specific language should be extended. PeakStream was a HPC managed platform designed for GPU technology as it was five years ago.
Since then, GPUs have become stronger, increasing the imbalance between cores/ALU throughput and bus/memory bandwidth. It’s the supercomputing problem all over again. Data movement becomes the dominating cost.
The PeakStream language encapsulates compute devices and hides kernels in the compute graph. This is good for ease of use. It puts a heavy burden on the intelligence of the JIT. It’s unrealistic to expect that the JIT will be smart enough.
High arithmetic intensity kernels should be exposed for performance reasons. It’s difficult for a JIT to determine invariants that allow optimizations like using write-only images versus memory buffers. It’s even harder for a JIT to explore algorithm transformations. For instance, transposing a matrix, changing from row to column major data layout, can have dramatic effects on performance.