there are so many cores

Just another site

Christmas week features

I basically ignored the feature freeze and have been coding furiously the last week. There are two big changes with new functionality.

  1. Scheduled traces do not terminate when reading back data from the compute device. This allows loops that stop depending on intermediate results from the calculation.
  2. Data parallelism is exposed directly as vectorized arrays. It is no longer necessary to rely on OpenMP or Pthreads and the gather/scatter scheduler to group execution traces across threads.

Both of these are basic language semantics. Without 1, execution traces would be analogous to a single stage of flat data parallel map-reduce. Without 2, outer vectorization relies on the scheduler and threads which in the best case is inefficient. (By outer vectorization, I mean tiling arrays into streams. Inner vectorization with the use of vectorized data types is performed by the auto-tuning JIT.)

It’s funny that for both features, the first solutions turned out to be failures. They were too slow and didn’t work. I couldn’t figure it out. Even if I could, the performance penalty was a big concern. That forced me to roll back a few days work and try again. By that time, I had enough insight to see simpler and more direct solutions which did turn out to work.

I’ve read that reducing development cycle times is a key factor in success. I agree with this. It’s not how good you are today. It’s how fast you learn and evolve. That’s why waterfall development is notorious for poor outcomes. In the context of a society that solves problems, learning is too slow.

Here’s an example of the two ways for expressing parallel array data (i.e. tiled arrays into streams).

Gathering execution traces across threads:

    #pragma omp parallel for
    for (size_t i = 0; i < 10; i++)
        Arrayf64 C;
            Arrayf64 A = make1(100, cpuA[i]); // double cpuA[10][100]
            Arrayf64 B = make1(100, cpuB[i]); // double cpuB[10][100]
            C = sum(A + B);
        result[i] = C.read_scalar();

Vectorized array data in a single execution trace:

    Arrayf64 C;
        Arrayf64 A = make1(100, vecA); // vector<double*> vecA
        Arrayf64 B = make1(100, vecB); // vector<double*> vecB
        C = sum(A + B);
    const vector< double > c = C.read_scalar(10);

Over the last week, I’ve become more aware of scheduling and JIT overhead. If the runtime spends too much time scheduling and compiling, that could seriously limit performance (Amdahl’s law). Keeping the runtime lean and fast is important to real world throughput.

I want to spend another few days performance tuning and fixing bugs. Then I really need to write the conference presentation for PP12. There’s an enormous amount of material here, far too much for 25 minutes.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: