there are so many cores

Just another site

Progress towards a tracing JIT with OpenCL

Design has (again) turned out to be more complex than anticipated. It’s not mysterious, just involved. An intuitive working style appears to be productive. I am able to feel my way to solutions.

However, any intuition regarding time estimates is usually grossly inaccurate. It’s actually worse than that. My level of effort expectations are very often qualitatively wrong. The design solutions found are often surprising and very unlike my preconceptions.

This in itself is interesting. It means I do not know what I do not know. In computer science terms, it means the design process is not easily computable – the algorithm finishes when it does and a priori statements about halting are difficult if not impossible.

I’m not displeased, though.

In the last few days, the overall form of the kernel back-end (for the GPU) has taken firm shape. The same stream data stack is used when evaluating bytecode. Unlike the interpreter that pushes and pops array data, the kernel back-end assembles a graph (right now, it is really a tree but that will change later). When it finishes (determining good stopping points is yet another problem), the interpreter continues from where the kernel back-end left off and forces evaluation of the boxed computation left on the stack.

As a simple example, here’s some input source:

Arrayf64 C;
    // eagerly create buffers/images on compute device
    Arrayf64 A = make1(100, cpuA);
    Arrayf64 B = make1(100, cpuB);

    // lazy boxed calculation
    C = sum(A + B);
double c = C.read_scalar(); // force evaluation, read back

The corresponding bytecode stack:

0: read_scalar_f64
1: convert_f64
2: sum
3: operatorADD
4: make1_f64
5: 100
6: 0x7ffffef48d10
7: make1_f64
8: 100
9: 0x7ffffef489f0

The OpenCL source generated:

__kernel void kx(__global double* out_a0, __global double* v0x7ffffef489f0, __global double* v0x7ffffef48d10) {
double a0 = 0;
for (int i0 = 0; i0 < 100; i0++) { 
a0 += (v0x7ffffef48d10[i0] + v0x7ffffef489f0[i0]);
out_a0[get_global_id(0) + get_global_id(1) * 1] = a0;

This is extremely primitive at the moment.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: