Matrix multiply ported
September 29, 2011
Posted by on
Matrix multiply is working.
This brings back bad memories… every compute device understands a different dialect of OpenCL. A kernel that works perfectly on one compute device may fail on another. That could mean anything from garbage output to crashing the driver. Kernel performance depends on the specific combination of: compute device; runtime SDK; device driver version.
The ugly reality of OpenCL is that while it is a step in the right direction, it is not very portable in practice. The vendors don’t like to mention this.
That’s why performance tuning is a lot of work. It’s also why auto-tuning is attractive – let a compiler do the tedious search for the fast kernels. Easier said than done.
My plan is to first port over all of GATLAS as-is. Then it will be connected to the runtime and JIT. That should be pretty cool.