there are so many cores

Just another site

resurrected interpreter from visitor hell

I wrote an interpreter back-end that allows running PeakStream examples on the CPU. It supports the entire API (which will continue to evolve). This gives me a lot of confidence in the basic design. It is somewhat complex internally as mutable state is never shared (with one exception at the highest level in the executive, and that is MROW protected). This should allow lock-free concurrency most of the time (if locks inside standard library allocators are ignored – this may become an issue later).

Here are three samples from Public Google PeakStream.ppt that now work.

Conjugate gradient example from page 13:

int Conj_Grad_GPU_PS(int N, float *cpuA, float *cpux, float *cpub)
    int iter;
    Arrayf32 x = Arrayf32::zeros(N);
        Arrayf32 A = Arrayf32::make2(N, N, cpuA);
        Arrayf32 b = Arrayf32::make1(N, cpub);
        Arrayf32 residuals = b - matmul(A, x);
        Arrayf32 p = residuals;
        Arrayf32 newRR = dot_product(residuals, residuals);

        for (iter = 0; iter < N; iter++) {
            Arrayf32 oldRR = newRR;
            Arrayf32 newX, newP, newResiduals;
            Arrayf32 Ap = matmul(A, p);
            Arrayf32 dp = dot_product(p, Ap);
            newX = x + p * oldRR / dp;
            newResiduals = residuals - Ap * oldRR / dp;
            newRR = dot_product(newResiduals, newResiduals);
            newP = newResiduals + p * newRR / oldRR;

            p = newP;
            residuals = newResiduals;

            float oldRRcpu = oldRR.read_scalar();
            if (oldRRcpu <= TOLERANCE) {
            x = newX;
    x.read1(cpux, N * sizeof(float));
    return iter;

Monte-carlo π calculation from page 17:

    RNGf32 G(SP_RNG_DEFAULT, 271828); // create an RNG
    Arrayf32 X = rng_uniform_make(G, NSET, 1, 0.0, 1.0);
    Arrayf32 Y = rng_uniform_make(G, NSET, 1, 0.0, 1.0);
    Arrayf32 distance_from_zero = sqrt(X * X + Y * Y);
    Arrayf32 inside_circle = (distance_from_zero <= 1.0f);
    return 4.0f * sum(inside_circle) / NSET;

Arrayf32 Pi = compute_pi();  // get the answer as a 1x1 array
float pi = Pi.read_scalar(); // convert answer to a simple float
printf("Value of Pi = %f\n", pi);

Monte-carlo options pricing from page 36:

float MonteCarloAntithetic(float price,
                           float strike,
                           float vol,
                           float rate,
                           float div,
                           float T)
    float deltat        = T / N;
    float muDeltat      = (rate - div - 0.5 * vol * vol) * deltat;
    float volSqrtDeltat = vol * sqrt(deltat);
    float meanCPU       = 0.0f;
    Arrayf32 meanSP; // result
    {                // a new scope to hold temporary arrays
        RNGf32 rng_hndl(SP_RNG_CEICG12M6, 0);
        Arrayf32 U = zeros_f32(M);
        for (int i = 0; i < N; i++) {
            U += rng_normal_make(rng_hndl, M);
        Arrayf32 values;
            Arrayf32 lnS1 = log(price) + N * muDeltat
                                       + volSqrtDeltat * U;
            Arrayf32 lnS2 = log(price) + N * muDeltat
                                       + volSqrtDeltat * (-U);
            Arrayf32 S1 = exp(lnS1);
            Arrayf32 S2 = exp(lnS2);
            values = (0.5 * ( max(0, S1 - strike) +
                              max(0, S2 - strike) )
                          * exp(-rate * T));
        meanSP = mean(values);
    }                // all temporaries released as we exit scope
    meanCPU = meanSP.read_scalar();
    return meanCPU; 

Getting the code samples above to work has been an interesting exercise. I can see how “Perl happens” or “C++ happens”… that’s when platform implementation comes first and language grammar is an afterthought. This tends to create dark corners (often touted as clever features by experts) in which semantics surprise the unwary. I am trying to avoid these situations.

The next major step is the JIT, the heart and soul of the system. I don’t know enough to make any accurate statements other than it will take much more time to do. So to keep things rolling, I am going to start working on multiple areas in order to pipeline work.

One of the big advantages of PeakStream over its competitors at the time was ease-of-use. The virtual machine and JIT automated more work the developer would otherwise have to do. However, many software developers, data analysts and quantitative scientists are still very uncomfortable with C++ and would prefer different languages.

The PeakStream platform had a language binding to C++ only. I think it would be nice to have bindings to more languages. Two I am thinking of now are Ruby and Scala. They are DSL friendly and may support the imperative syntax required for intuitively natural programming.

This sounds crazy! GPGPU in Ruby?! But I think it makes a lot of sense. Very often, scripting languages are the best tool for data extraction, transforming and reporting. The weakness is any sort of numerically intense analytical processing. Scripting languages are much slower than native code or the JVM. As a result, computational pipelines often have multiple stages with different language platforms. Systems become more complex as a whole.

More generally, native code fluency is a requirement for GPGPU even today. Wouldn’t it be nice if using GPUs for general purpose computation did not require knowing C or C++ at all? GPGPU would then become normal instead of something exotic that few people know about. It would be like floating point. It is part of the language and happens automatically for you.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: