there are so many cores

Just another site

Rewriting everything yet again, looks more like PeakStream

I am rewriting everything again. Some of the trees are the same. But the forest is very different.

A few months ago (somewhat shocking to realize how much time has passed), there was only a superficial resemblance between the design and PeakStream’s high-level architecture. I rationalized this as PeakStream’s need to present a clear narrative in marketing the technology and product. Perhaps, the ugly details were too confusing. If so, then the architectural diagram from Papakipos’ Stanford presentation (YouTube video, PowerPoint slides) reflected a system “analysis” rather than “design” viewpoint. In other words, I thought I was right and didn’t believe the engineering and marketing artifacts suggesting I was wrong.

Now I believe otherwise.

Papakipos was entirely open and straightforward about the technology and product during his presentation at Stanford. He said as much as he could under the restrictions of non-disclosure. The architectural diagram was accurate.

The surprise is how well the design now fits PeakStream’s high-level architecture.

      Application Binary
 ---> API                                           GPU compiler
 |    scheduler <--------> JIT compiler <---------- math libs
 |    memory manager            |
 ---- executor ----------> instrument & analyze --> profiler & debugger

During his talk, Papakipos made an observation that the scheduler is at least as important and technically impressive as the sexy JIT compiler. Most attention focuses on the JIT when the scheduler is just as vital.

In virtual machine based languages like Java and Python, runtime statistics or other heuristics are used to classify code as hot or not. Only hot code is sent to the JIT for compilation. Cold code is interpreted. Conceptually, between the interpreter and JIT is a binary classifier.

PeakStream has more problems between the interpreter and JIT. It schedules across multiple compute devices of different architectures. Even if a compute device supports concurrent kernel execution, high performance may require folding code (possibly from multiple threads) into single kernels. Conceptually, between the interpreter and JIT is a scheduler.

Coalesced versus single DGEMM on ATI HD 5870

Coalesced versus single DGEMM on ATI HD 5870

I would like to justify the statement above about folding code into kernels. The chart above is of coalesced versus single DGEMM (double precision matrix multiply) using images (texture sampling) on a Radeon HD 5870 GPU. Most performance benchmarks use large dense matrices to reach peak throughput. In my experience, this is unrealistic. Real world problems more often involve either large sparse matrices (solving PDEs and gradient descent methods – find a solution that minimizes residual) or large numbers of small dense matrices (computational kernels in quantitative statistical problems – evaluate the goodness of a strategic solution).

Large kernel execution overheads on GPUs is an issue. That’s one reason why just making the matrices larger leads to higher throughput. It amortizes out these overhead costs. Another way is to use fewer GPU kernels for the same set of calculations. In this example, combining many matrix multiplies together into a GPU kernel increased throughput by an order of magnitude for some matrix sizes.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: