I’ve been working on the multi-threaded scheduler (using Pthreads) for the last two weeks. The design gathers execution traces from user application threads and scatters to compute device threads. I am embracing OpenMP and Pthreads equally. I don’t know if this is what PeakStream intended to do. All of the sample code from documentation is single-threaded only.
During his Stanford presentation in 2007, Papakipos said the scheduler was as complex and impressive as the JIT. Then I read this: “Right now, it only compiles to CTM or to a CPU backend…” which suggests that PeakStream had quite a bit more technology than ever was released in their public beta.
Here’s the sort-of weird aspect of PeakStream as a managed high level language. The virtual machine is a peer to native code. You don’t have to use it. This is both a strength and a weakness.
The strength is obvious. You can freely intermingle native code and PeakStream stuff without any sort of foreign function interfacing. It’s almost seamless.
The weakness is that the virtual machine does not manage everything. You can do stuff in native code the virtual machine is completely unaware of. This limits laziness. For instance:
A.read1(10 * sizeof(double), cpuA);
cout << cpuA; // virtual machine does not know about this
The read_scalar(), read1() and read2() functions force unboxing. Read out data is written directly to unmanaged arrays of fundamental types. There is no opportunity for indirection and laziness.
This can become a performance issue. A significant GPGPU optimization is fusing and batching kernels inside a loop. This minimizes per-kernel overheads.
Here are some of the ideas I had to address this lack of laziness.
- create boxed array objects (like a smart pointer) for application code to use
- extend the API with lazyread_scalar(), lazyread1() and lazyread2() and rely on application code to avoid reading or writing until later
- extend the API to return managed arrays
PeakStream didn’t do any of this. That’s why their API is pretty simple and straightforward. I decided to stick with this.
That’s why I opted for a multi-threaded solution. Application threads can be a peculiar form of kernel boxing (makes me think of Erlang processes). Selective scheduling can fuse and batch kernels as each one corresponds to a waiting application thread. This is a little ugly with Pthreads. It’s not so bad with OpenMP.
#pragma omp parallel for num_threads(20) default(none)
for (size_t i = 0; i < 1000; i++)
double cpuA[N*N], cpuB[N*N];
Arrayf64 A = make2(N, N, cpuA);
Arrayf64 B = make2(N, N, cpuB);
Arrayf64 C = matmul(A, B);
It takes few iterations before all 20 threads have been seen by the scheduler. From that point on, matrix multiplies can be processed in batches of 20. (Yes, the example is dumb but illustrates the idiom.)