Here’s a simple example of inline OpenCL with the Chai DSL.
const string sourceText = "__kernel "
"void foo(__global float* a, const float b)"
" a[get_global_id(0)] += b;"
Arrayf32 A = make1(1000, inputA, ARR_MEMBUF1); // constrain JIT memory choice
A += 123; // regular DSL code before
CL(sourceText) << A << 1.0f; // inline OpenCL kernel
A += 456; // regular DSL code after
A.read1(outputA, 1000 * sizeof(float)); // evaluate and read back result
This becomes more complicated with anything realistic.
It’s not hard to support local memory and work group dimensions. Note that order of insertion matters for kernel arguments only. The work group dimensions could appear in any order.
CL(sourceText) << Localf32(100) << GlobalWork(1000) << LocalWork(100);
Does the following look better? Perhaps a variadic function call is more natural and easier to use?
CL(sourceText, A, 1.0f, Localf32(100), GlobalWork(1000), LocalWork(100));
I am inclined to use the insertion operator idiom as it allows more flexibility. A programmer may wish to pass different arguments to the inline OpenCL kernel at runtime. That is more awkward with the variadic style. However, this debate reminds me of Perl’s philosophy: allow more than one solution. So I’ll probably end up supporting both ways.
One consequence of inline OpenCL is breaking the interpreter. The simple interpreter can not emulate the inline calculations without going through all the machinery of OpenCL. So why not do this? Unfortunately, not all OpenCL implementations support CPU compute devices (e.g. Nvidia).
This means that use of inline OpenCL affects scheduling. Execution traces with inline OpenCL must go to the JIT only. They don’t have the option to be interpreted. That’s ok now as Chai supports the full language. Random numbers was the last missing piece.
Also, multi-threaded scheduling support has been pretty badly broken for a while now. As originally designed, I had a concept of gathering similar threads into a vector, enqueuing a fused kernel, and scattering the results back. Pretty slick – a multi-thread gather/scatter vectorizing scheduler (which sounds ridiculous just writing this). After data parallel support was added, it was clear that this MT gather/scatter feature would never be used. It adds too much overhead. If you want data parallel, then just put the data into vectors directly.
Multi-threaded support is important. But the gather/scatter in the scheduler makes no sense.