clMAGMA and AMD BLAS thoughts
July 16, 2012
Posted by on
At this time, I’ve decided it’s not worth integrating clMAGMA and the AMD OpenCL BLAS into the JIT back-end of Chai. It’s a lot of work for questionable gain. I’m not even sure it is the right thing to do.
Exposing the OpenCL queues managed by the Chai virtual machine is much simpler. This may be what developers really want. Then they can add clMAGMA to Chai or vice-versa as they wish. There’s no need to commit to either – give the end-user more freedom and do no harm.
It is not AMD’s or UTK’s fault – but clMAGMA performance will be poor unless matrix sizes are large (dimension in the thousands). LAPACK and BLAS were designed for vector processors and CPUs, not discrete GPUs with large overheads for enqueued kernels and I/O operations. There is no support for kernel fusion (batching).
This is why Chai natively supports batching over vectors of array data. A vector of arrays is tiled together as a single array. Multiple data transfers and kernels become singles, reducing overheads. For problems with relatively small matrices (dimension in the hundreds) and high arithmetic intensity (like dense matrix multiplication), the effect of this optimization is significant.
I have enough experience where I could fork clMAGMA (yes, I realize this is a moving target) and implement my own auto-tuning GPGPU BLAS sufficient to support it (that is also portable between AMD and Nvidia). This fork would support tiled data and kernel fusion. Also, if I were to do it right, I would need a porting/test lab and extend auto-tuning to clMAGMA itself (which has statically configured tuning parameters). From working on GATLAS and now Chai, I have a very good idea of the level of effort required. It’s not rocket science, just a few months of full-time work.
Anyway, I can’t afford to do it.