four five bugs.
- reference counted array memory not thread-safe
- zero elapsed kernel run time treated as compute device failure
- scheduler assumed there is always a fastest known compute device
- memory manager confused by incremental trace with too much history
- interpreter left uninitialized data in arrays of ones and zeros
It’s kind of shocking, actually.
The last bug in the interpreter took the most work to find. Yet, it was the dumbest, a for-loop that looked like this:
for (int j = 0; i < W * H; i++)
m->floatPtr()[i] = _floatValue;
That is so obviously wrong. It so happens this works just fine in the single threaded case.
Some lessons from this:
- there is no magic, only bugs and stuff not understood yet
- a stress testing and validation suite is really necessary
- if the initial release includes bugs like this, it will be dead on arrival
Now I can get back to the JIT.
What prompted the discovery of so many bugs were the middle-end JIT optimizations I’m working on. One of them is “lifting” BLAS level 2 operations (GEMV) to BLAS level 3 (GEMM). If there are N threads/traces each doing a matrix/vector multiply with the same matrix, that can be transformed to a matrix/matrix multiply. This is a huge optimization. For a discrete GPU that is I/O bound by data transfer over the PCIe bus, it can easily mean a 100x increase in throughput.
BEFORE WITH N TRACES, WHERE i = 0 .. N-1 AFTER WITH 1 VECTORIZED TRACE
Arrayf64 A = Arrayf64::make2(N, N, cpuA); Arrayf64 A = Arrayf64::make2(N, N, cpuA);
Arrayf64 p = Arrayf64::make1(N, cpuP[i]); Arrayf64 P = Arrayf64::make2(N, N, cpuP);
Arrayf64 Ap = matmul(A, p); Arrayf64 AP = matmul(A, P);
It actually works a little differently than the PeakStream code above suggests. But the transformation is conceptually the same.