It wasn’t as bad as I thought. The code in GitHub from last year is ok.
There were two bugs in the new stuff. Just careless details with enqueued memory transfers. Everything else was fine. In fact, the new stuff is slightly faster (as much as 5% for small matrices).
GATLAS enqueues kernels as an ordered sequence using event objects. Each kernel event depends on the previous one. The new code takes a simpler approach with independent events. There is no difference on ATI GPUs as OpenCL kernels execute sequentially anyway. (However, NVIDIA may support concurrent kernel execution so I’ll have to do things a little differently.)
My GPU server is an unusual configuration as it typically has both ATI and NVIDIA GPUs running at the same time. Most users have a single ATI or NVIDIA GPU. If they have multiple GPUs, these will be the same model and vendor. My development environment has been at the other extreme of heterogeneous GPUs.
It’s actually a hack to get an ATI and NVIDIA GPU to run at the same time on one host. A few years ago, gamers started using ATI GPUs for graphics and NVIDIA GPUS for game physics. To stop this, the NVIDIA driver checks for an ATI GPU and won’t start if it detects one.
But wait, there’s more. The ATI driver depends on a running X server. When NVIDIA changed the driver to add the check for an ATI GPU, they also made it depend on X. Both vendors’ GPU drivers depend on the same libraries. After installing the NVIDIA driver, those libraries are modified in a way so that the ATI driver can not run! (and vice-versa)
I wouldn’t say this is deliberate so much as a rare use-case. Very few users will use an ATI and NVIDIA GPU on the same host. Most users only have one GPU. If they have more than one, then it is two, three or four of the same make and model.
Anyway, I was confused earlier as I compiled the code against the ATI OpenCL SDK but was running against the NVIDIA GPU. It’s an easy mistake to make. While both vendors’ OpenCL runtimes can see all GPUs, they can only correctly work with their own hardware. A consequence: without dynamic loading and linking magic, it is not possible for a process to use an ATI and NVIDIA GPU at the same time. A process must be linked to one or the other.
I know. That was confusing.
The good news is that stuff looks like it works. Now I have to connect the XGEMM and XGEMV to the PeakStream matmul() through the JIT. On an ATI 5870, my SGEMM exceeds 1400 GFLOPS. DGEMM exceeds 350 GFLOPS. With auto-tuning, performance scales linearly across the Evergreen architecture (5000 series, have not tested on 6000 series hardware). Of course, things slow down a lot when counting PCIe bus data transfers. One way of looking at something like PeakStream is as a very complex memory manager to minimize that data movement back and forth between host and GPU.