there are so many cores

Just another WordPress.com site

Time to bolt GATLAS into the JIT back-end

My master plan has always been a JIT built around auto-tuned math library kernels. The performance comes from these. That’s what GATLAS was all about – how to make matrix multiply fast in a way that generalizes across a GPU architecture.

The JIT also generates application kernels dynamically. These are usually of low arithmetic intensity. The cost of reading and writing memory dominates the cost of processor operations. So the gain from tuning is marginal. It’s there. You can measure it. However, these kernels are so slow that throughput is still terrible even after performance tuning (automated or manual).

So why have a JIT if the application kernels generated dynamically at runtime are slow?

With current GPU technology, data movement is the dominant cost. The host CPU and compute device GPU are connected by the PCIe bus or some other relatively slow interconnect. (APUs with integrated CPU/GPU may change this somewhat but that’s in the future.) Just as with disk drives and memory, moving data around is expensive. That’s why computers try to minimize this with caches and memory hierarchies.

Computing on the GPU needs to do the same thing.

A GPU is really good at some calculations but bad at everything else. A naive implementation would move data to compute devices that are best for the current calculation. The problem is that data would then move too much. I/O would dominate total cost.

That’s where the JIT comes in. It allows generating kernels for arbitrary compute devices so the data does not have to move. It can stay where it is. Even though the compute device may be slower than others, the cost of moving the data to a faster compute device may exceed the gains.

That’s why optimization depends so much on scheduling and the JIT working together.

Anyway, the next thing for me to do is add in an auto-tuned matrix multiply (which includes data parallel GEMV when the matrix is common to all threads). The JIT is already designed around doing this. I just need to add the GATLAS back-end into the JIT.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: