I am extending and rewriting the GATLAS kernel generation to support mixed precision, vector length, memory buffer, and image arguments. The virtual machine model almost forces this. When a JIT is generating code dynamically, there is more freedom required than the case of math kernels from a static library.
The original PeakStream didn’t support mixed precision (AFAIK). Older ATI GPUs were single precision only. Double precision was exclusive to the CPU.
I don’t know what the performance implications will be from this modification. It won’t run slower. It may run quite a bit faster. OpenCL on ATI GPUs is very sensitive to register pressure. Performance can vary dramatically.
Also, I’ve been debating whether the JIT should have an AOT (ahead of time) component. The answer is absolutely yes. Here’s why.
Let’s say you have a fleet of 1000 identical hosts with GPUs. If every virtual machine is independent, then there will be a lot of wasted JIT optimization. Each JIT must relearn the same optimal kernels. It is more efficient for the JITs to share a common understanding.
More conventional JITs gain from dynamically compiling hot functions and traces to native code. This avoids interpreter overhead. The optimization is in identifying the code to compile. This is not that expensive.
A GPU JIT (at least how I approach the problem) solves a much more expensive optimization problem. It would be combinatorial and terrible except that measured throughput happens to have a convex shape. This allows much more efficient search.
Still, it is not cheap. To fully characterize a GPU for matrix multiply, to know the problem dimensions, vector lengths, etc for which it runs fastest, takes many hours. There are also issues of platform stability. In the real world, device drivers and runtimes do crash. Labeling the kernels that cause the GPU to fail can be very expensive (as crashes often force a reboot).
This implies there should be an external persistent database that allows sharing between JITs. It is natural to preload that database ahead-of-time. So the JIT is really a hybrid with some AOT optimization too.