December 14, 2011
Posted by on
Fixed all of the problems with the auto-tuned GEMV. It wasn’t as bad as expected. Yesterday, I was rattled because every kernel specialization was failing. Not reassuring.
I’ve learned to be paranoid about numerical correctness. Automated testing is incorporated into the auto-tuning process. A kernel specialization is stress tested with random data before accepted as good.
What makes this trickier is that extensive auto-tuning, when hundreds or thousands of kernel variations are tested, meets limitations in vendor runtimes. The GPU driver might crash or the device enter a bad state. The OpenCL compiler may hang, segfault, or fail with internal error messages. Despite coming from the same design template, some specializations work perfectly on a device while others fail.
All of this adds enough ambiguity that distinguishing your bugs from toolchain and platform issues is difficult. My experience so far is: My code often has more bugs than I think it does. It’s probably a cognitive bias to blame known vendor bugs as responsible for other, as yet undiagnosed, bugs.
There are still some serious bugs in the JIT. However, even with those, I see output that agrees between generated OpenCL on ATI, NVIDIA, x86 and a reference CPU interpreter. The numbers are all the same. That gives me confidence it is really working and not garbage output.