there are so many cores

Just another site

Scheduling, memory management and sociopolitical complexity


I thought an interpreter would be in-place by now. The only work was porting old code. One week should have been enough time. Instead, last week was shaking out crashes due to interactions between scheduling and memory management.

It seems simple in hindsight.

The scheduler is a gather/scatter vectorizer:

  1. Gather work from user application threads (that block and wait).
  2. Package work into kernels (vectorization happens here, a significant optimization for GPUs).
  3. Schedule work on compute devices (which may be a mix of CPU cores, GPUs and APUs).
  4. Scatter output results back to the user application (threads unblock).

The scheduling policy is: choose the fastest compute device ready for work. For a given kernel, compute devices may be: faster; slower; not work at all.

The scheduler has no a priori way of knowing kernel and compute device compatibility. This is often unknowable. My experience with OpenCL on both ATI and nVidia GPUs is edge cases from the compiler and driver do happen in practice (bugs or similar). Testing a kernel on a compute device and checking the output data is the only safe way to know if: 1. a kernel runs on a compute device; 2. the output data is not garbage.

If a kernel fails for a compute device, the scheduler de-packages the work and waits for another device to become available. More work may arrive from the user application during this time. Once another compute device becomes available, the work is packaged again and scheduled. The compute devices of last resort, that always work, are the interpreters (one per CPU core).

This is just the solution I came up with. I am not claiming it is the best or even stable (haven’t thought about resource starvation, may need to add a priority mechanism – I’ll worry about this later).

OK, that’s nice but what about last week’s troubles?

Array memory is reference counted from user application code (well hidden behind the PeakStream API, of course, as C++ allows overloading everything). The scheduler is responsible for tiling this array memory when it packages kernels. So the actual array memory addresses are not known until kernels are scheduled. Yet, reference counting precedes and follow scheduling in user code. The solution is to box the memory allocation. The box is reference counted, not the memory buffer inside the box.

What tripped me up was the possibility of kernel failure. Once I started simulating this (some devices work and others do not), random crashes started. It was not rocket science. I just did not understand what was going on.

So today I am starting on the interpreter (like last week).



This project does not feel that complicated (even though it is not nearly working yet). I know this feeling is absolutely wrong. This project is a monster for one person.

PeakStream, at the very end when Google acquired them, was 35 people. Technology has advanced in five years, i.e. LLVM, OpenCL and OpenMP solve many problems. I have no administrative, marketing or support costs. All I have to do is design, code and update this blog. That is still a lot.

The last book I read was Joseph Tainter‘s The Collapse of Complex Societies. This reminded me very much of the concept of risk and complexity in statistical learning theory.

In both cases, past an optimal tradeoff, the risk of failure increases with solution complexity. In statistics, this is called overfitting. In societies, this is called collapse.

I wonder if software platforms follow similar patterns of development. They increase in complexity and offer more benefits. Eventually, the costs of “more” outweigh the gains. That’s when platforms stagnate and die as society seeks cheaper alternatives.

A few years ago at Amazon, I attended a talk on multi-threading and concurrency techniques in Java. The speaker predicted hundreds or thousands of processor cores would become common. Mainstream programming languages, generally imperative and based on shared memory, would not scale. The future would require something new (perhaps functional and message passing).

But hasn’t the highly concurrent and parallel software future always been just around the corner? It never seems to happen, at least not in the sense of general applications programming. It remains in specialized niches. The cost of wider use may be too high.

However, there may be a flip side to this. Even though we may not like a solution, it may still be the best one available. The alternatives are worse.

Specifically, the dominant high-level programming languages today (ignoring pure web application languages like JavaScript and PHP): C++, Java, Perl, Python, Ruby, etc are likely optimal for what they do. That’s why we use them. Also, the cost-effective approach to parallel computing is stream processing on GPUs. This is not especially elegant (unlike the STI Cell or Intel’s cancelled Larrabee). But it works and is cheap. It is a solution society has embraced as economical.

In the same way, PeakStream’s approach of embedding a domain specific language for stream processing may be an optimal one. Making a new language is crazy. Huge costs for everyone. Making an old language smart enough to use both CPUs and GPUs is also crazy. It’s too hard. Not that it can’t be done. Just that the practical costs of compiler development make it impractical.

I suppose that this project is far enough along to imagine the technology working. That leads me to wonder if this is the right problem to solve (or is a solution looking for a problem).



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: