Stencil and filter kernels should rely on the texture cache. Seems obvious? It’s very natural for images as they are built on top of texture sampling for graphics. I don’t know why I didn’t see this earlier. I kept thinking about the much harder problem of prefetching into local memory (more about this later).
This will mean more big changes to the JIT. Right now, it only autotunes GEMM and GEMV. Everything else is just generated in one pass.
I also made a very simple change last night so work group dimensions may be influenced by the configuration file to better fit natural warp and wavefront sizes. This is not quite right either.
The JIT should autotune (some) generated kernels (not only the GEMM and GEMV templates) at both the TLP and ILP levels.
- TLP – thread level parallelism: work group dimensions
- ILP – instruction level parallelism: reordering statements
This is great. I’ll be really busy for the next few weeks. Development seems to go like this, in evolutionary spurts.
I’m (re-)reading Rob Farber’s book CUDA Application Design and Development. He briefly cites an observation of Vasily Volkov that is very true in my experience. “Volkov notes that the trend in parallel architecture design is towards an inverse memory hierarchy where the number of registers is increasing compared to cache and shared memory.”
What are the reasons for this trend?
One reason is the natural imbalance between processor and memory. That’s always been an issue. Memory bandwidth tends to lag behind processor throughput. At some point, memory falls behind.
The other reason is effective programmatic use of a memory hierarchy is difficult. In practice, it’s avoided due to high software development costs. Instead, users rely on automatic mechanisms in the GPU (e.g. L1/2 cache, register spillage into shared memory for NVIDIA) and use more private registers.
This leads back to the JIT in Chai.
The autotuned GEMM and GEMV kernel templates support prefetching into local memory. This was really difficult and caused me no end of troubles. It was very tricky.
I’ve been putting off dealing with this issue of more general JIT local memory prefetching because it scares me. For the wrong reason, I may have done the right thing. Local memory prefetching may be less important. Technology is evolving around it.