there are so many cores

Just another WordPress.com site

Monthly Archives: March 2011

Scheduling, memory management and sociopolitical complexity

<status>

I thought an interpreter would be in-place by now. The only work was porting old code. One week should have been enough time. Instead, last week was shaking out crashes due to interactions between scheduling and memory management.

It seems simple in hindsight.

The scheduler is a gather/scatter vectorizer:

  1. Gather work from user application threads (that block and wait).
  2. Package work into kernels (vectorization happens here, a significant optimization for GPUs).
  3. Schedule work on compute devices (which may be a mix of CPU cores, GPUs and APUs).
  4. Scatter output results back to the user application (threads unblock).

The scheduling policy is: choose the fastest compute device ready for work. For a given kernel, compute devices may be: faster; slower; not work at all.

The scheduler has no a priori way of knowing kernel and compute device compatibility. This is often unknowable. My experience with OpenCL on both ATI and nVidia GPUs is edge cases from the compiler and driver do happen in practice (bugs or similar). Testing a kernel on a compute device and checking the output data is the only safe way to know if: 1. a kernel runs on a compute device; 2. the output data is not garbage.

If a kernel fails for a compute device, the scheduler de-packages the work and waits for another device to become available. More work may arrive from the user application during this time. Once another compute device becomes available, the work is packaged again and scheduled. The compute devices of last resort, that always work, are the interpreters (one per CPU core).

This is just the solution I came up with. I am not claiming it is the best or even stable (haven’t thought about resource starvation, may need to add a priority mechanism – I’ll worry about this later).

OK, that’s nice but what about last week’s troubles?

Array memory is reference counted from user application code (well hidden behind the PeakStream API, of course, as C++ allows overloading everything). The scheduler is responsible for tiling this array memory when it packages kernels. So the actual array memory addresses are not known until kernels are scheduled. Yet, reference counting precedes and follow scheduling in user code. The solution is to box the memory allocation. The box is reference counted, not the memory buffer inside the box.

What tripped me up was the possibility of kernel failure. Once I started simulating this (some devices work and others do not), random crashes started. It was not rocket science. I just did not understand what was going on.

So today I am starting on the interpreter (like last week).

</status>

<philosophy>

This project does not feel that complicated (even though it is not nearly working yet). I know this feeling is absolutely wrong. This project is a monster for one person.

PeakStream, at the very end when Google acquired them, was 35 people. Technology has advanced in five years, i.e. LLVM, OpenCL and OpenMP solve many problems. I have no administrative, marketing or support costs. All I have to do is design, code and update this blog. That is still a lot.

The last book I read was Joseph Tainter‘s The Collapse of Complex Societies. This reminded me very much of the concept of risk and complexity in statistical learning theory.

In both cases, past an optimal tradeoff, the risk of failure increases with solution complexity. In statistics, this is called overfitting. In societies, this is called collapse.

I wonder if software platforms follow similar patterns of development. They increase in complexity and offer more benefits. Eventually, the costs of “more” outweigh the gains. That’s when platforms stagnate and die as society seeks cheaper alternatives.

A few years ago at Amazon, I attended a talk on multi-threading and concurrency techniques in Java. The speaker predicted hundreds or thousands of processor cores would become common. Mainstream programming languages, generally imperative and based on shared memory, would not scale. The future would require something new (perhaps functional and message passing).

But hasn’t the highly concurrent and parallel software future always been just around the corner? It never seems to happen, at least not in the sense of general applications programming. It remains in specialized niches. The cost of wider use may be too high.

However, there may be a flip side to this. Even though we may not like a solution, it may still be the best one available. The alternatives are worse.

Specifically, the dominant high-level programming languages today (ignoring pure web application languages like JavaScript and PHP): C++, Java, Perl, Python, Ruby, etc are likely optimal for what they do. That’s why we use them. Also, the cost-effective approach to parallel computing is stream processing on GPUs. This is not especially elegant (unlike the STI Cell or Intel’s cancelled Larrabee). But it works and is cheap. It is a solution society has embraced as economical.

In the same way, PeakStream’s approach of embedding a domain specific language for stream processing may be an optimal one. Making a new language is crazy. Huge costs for everyone. Making an old language smart enough to use both CPUs and GPUs is also crazy. It’s too hard. Not that it can’t be done. Just that the practical costs of compiler development make it impractical.

I suppose that this project is far enough along to imagine the technology working. That leads me to wonder if this is the right problem to solve (or is a solution looking for a problem).

</philosophy>

Scheduling connects all major subsystems

Prior to this rewrite, I had focused on interpreting the PeakStream language and then on JIT compilation using a OpenCL back-end. I had to stop when it became obvious scheduling was upstream of the JIT. In fact, it turns out the scheduler is the hub that connects all major subsystems.

I normally think of scheduling like a checkout line from which customers are directed to one of many cashiers. The objective in this case is some combination of high throughput, minimal latency or optimal utilization. Note how each customer is serviced individually and remains isolated from others.

The PeakStream scheduler is not like this.

It is more like passengers waiting at an airport gate to fly somewhere. The objective is in this case is to have planes as full as possible. Empty seats lose money. Passengers are packaged together in planes and serviced together as a group when the plane takes off and flies to a destination.

Scheduling for PeakStream is a way of collecting stream computations and packaging them together in GPU kernels. Together with the scheduler, the JIT finds an efficient way to do this. Computations are performed when the kernel executes. Memory and data layouts are a side-effect of the packaging solutions found by the scheduler and JIT.

The upshot is scheduling, JIT compilation and memory management are closely related in the PeakStream approach.

At this point, I’ve written working prototypes for all of the major subsystems, the minimal set necessary for end-to-end functionality. However, I have not written them all at the same time within the same code base. This project has undergone several complete rewrites.

  1. API
  2. interpreter
  3. JIT with OpenCL
  4. scheduler
  5. memory manager
  6. executor

With the latest major rewrite (that I expect is the last one, at least for the core platform), there is now a connected API, scheduler, memory manager and executor. The next effort is merging back in the interpreter and then the very primitive JIT prototype I had written earlier.

This is not good enough to be useful.

The JIT must have auto-tuning capability with a kernel cache (did this before in GATLAS) to be worth anything. There must be a math kernel library even if contains nothing else but GEMM.

One thing I am rather ignorant of is random number generation, on GPUs or otherwise. I have to admit that I have no experience with Monte Carlo methods in real life (which is often a lot dirtier, hand-crafted and heuristic than in theory). This forces me to again rely heavily on PeakStream’s trail of artifacts: Pseudorandom Number Generation on the GPU.

Laziness in scheduling uses threads to box kernels

I’ve been working on the multi-threaded scheduler (using Pthreads) for the last two weeks. The design gathers execution traces from user application threads and scatters to compute device threads. I am embracing OpenMP and Pthreads equally. I don’t know if this is what PeakStream intended to do. All of the sample code from documentation is single-threaded only.

During his Stanford presentation in 2007, Papakipos said the scheduler was as complex and impressive as the JIT. Then I read this: “Right now, it only compiles to CTM or to a CPU backend…” which suggests that PeakStream had quite a bit more technology than ever was released in their public beta.

Here’s the sort-of weird aspect of PeakStream as a managed high level language. The virtual machine is a peer to native code. You don’t have to use it. This is both a strength and a weakness.

The strength is obvious. You can freely intermingle native code and PeakStream stuff without any sort of foreign function interfacing. It’s almost seamless.

The weakness is that the virtual machine does not manage everything. You can do stuff in native code the virtual machine is completely unaware of. This limits laziness. For instance:

double cpuA[10];
A.read1(10 * sizeof(double), cpuA);
cout << cpuA[0]; // virtual machine does not know about this

The read_scalar(), read1() and read2() functions force unboxing. Read out data is written directly to unmanaged arrays of fundamental types. There is no opportunity for indirection and laziness.

This can become a performance issue. A significant GPGPU optimization is fusing and batching kernels inside a loop. This minimizes per-kernel overheads.

Here are some of the ideas I had to address this lack of laziness.

  1. create boxed array objects (like a smart pointer) for application code to use
  2. extend the API with lazyread_scalar(), lazyread1() and lazyread2() and rely on application code to avoid reading or writing until later
  3. extend the API to return managed arrays

PeakStream didn’t do any of this.  That’s why their API is pretty simple and straightforward. I decided to stick with this.

That’s why I opted for a multi-threaded solution. Application threads can be a peculiar form of kernel boxing (makes me think of Erlang processes). Selective scheduling can fuse and batch kernels as each one corresponds to a waiting application thread. This is a little ugly with Pthreads. It’s not so bad with OpenMP.

#pragma omp parallel for num_threads(20) default(none)
for (size_t i = 0; i < 1000; i++)
{
    double cpuA[N*N], cpuB[N*N];
    Arrayf64 A = make2(N, N, cpuA);
    Arrayf64 B = make2(N, N, cpuB);
    Arrayf64 C = matmul(A, B);
    double cpuC[N*N];
    C.read1(cpuC, N*N);
}

It takes few iterations before all 20 threads have been seen by the scheduler. From that point on, matrix multiplies can be processed in batches of 20. (Yes, the example is dumb but illustrates the idiom.)