I submitted the final presentation slide deck to SIAM for the conference. There’s no NDA or other restriction on posting them elsewhere. However, it seems a little rude to do that before the conference.
I joined and RSVP’ed to the Mountain View HPC-GPU-Supercomputing Meetup. The next meeting is January 30th at Carnegie Mellon University Silicon Valley – that seems weird to me. Whenever I hear CMU, I think of back East.
With the presentation ready for the conference, I can get back to doing technical stuff.
The way I handled vectorization is kind of a hack. Here’s where NVIDIA makes it easy and ATI makes it hard. You don’t (always?) need to vectorize code on NVIDIA to have good performance (although it’s better when you vectorize). There’s no option on ATI. Vectorization is required for acceptable performance.
This leads to issues with mixtures of vector lengths when the JIT is generating code. OpenCL anticipates this (as you would expect) by making it easy to slice vector elements. For example,
__kernel void foo(__global float4 *a, __global double2 *b, __global float4 *c)
a[get_global_id(0)].lo = b[2*get_global_id(0)] + c[get_global_id(0)].lo;
a[get_global_id(0)].hi = b[2*get_global_id(0)+1] + c[get_global_id(0)].hi;
It’s not rocket science. However, if you don’t have an elegant solution, a hacked approach ends up somewhat ugly.
Another issue is with continuation. I didn’t really understand what that meant before. Actually, it was in writing the presentation that I realized there are deeper issues around this with GPUs.
When we think of an execution trace, I think most of us don’t initially think of that trace, really a boxed calculation, as having much state beyond the historical record. It’s something that returns a value when evaluated. This viewpoint is misleading. Traces, at least for this platform, have a lot of contextual state. They rely on side-effects.
Traces stick to devices. The cost of data movement with GPUs is so high that once a trace is scheduled there, the memory objects associated with it really should not move unless absolutely necessary.
This implies that traces inherit memory. A trace is really a sequence of traces (which are vectorized when scheduled). They all share the context of the compute device on which they are scheduled.
Anyway, it’s nice to return to solving puzzles.