there are so many cores

Just another WordPress.com site

Programmers are lazy, and this may be right

Stencil and filter kernels should rely on the texture cache. Seems obvious? It’s very natural for images as they are built on top of texture sampling for graphics. I don’t know why I didn’t see this earlier. I kept thinking about the much harder problem of prefetching into local memory (more about this later).

This will mean more big changes to the JIT. Right now, it only autotunes GEMM and GEMV. Everything else is just generated in one pass.

I also made a very simple change last night so work group dimensions may be influenced by the configuration file to better fit natural warp and wavefront sizes. This is not quite right either.

The JIT should autotune (some) generated kernels (not only the GEMM and GEMV templates) at both the TLP and ILP levels.

  • TLP – thread level parallelism: work group dimensions
  • ILP – instruction level parallelism: reordering statements

This is great. I’ll be really busy for the next few weeks. Development seems to go like this, in evolutionary spurts.

I’m (re-)reading Rob Farber’s book CUDA Application Design and Development. He briefly cites an observation of Vasily Volkov that is very true in my experience. “Volkov notes that the trend in parallel architecture design is towards an inverse memory hierarchy where the number of registers is increasing compared to cache and shared memory.”

What are the reasons for this trend?

One reason is the natural imbalance between processor and memory. That’s always been an issue. Memory bandwidth tends to lag behind processor throughput. At some point, memory falls behind.

The other reason is effective programmatic use of a memory hierarchy is difficult. In practice, it’s avoided due to high software development costs. Instead, users rely on automatic mechanisms in the GPU (e.g. L1/2 cache, register spillage into shared memory for NVIDIA) and use more private registers.

This leads back to the JIT in Chai.

The autotuned GEMM and GEMV kernel templates support prefetching into local memory. This was really difficult and caused me no end of troubles. It was very tricky.

I’ve been putting off dealing with this issue of more general JIT local memory prefetching because it scares me. For the wrong reason, I may have done the right thing. Local memory prefetching may be less important. Technology is evolving around it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: