there are so many cores

Just another WordPress.com site

Example: Monte Carlo options pricing

Here’s PeakStream’s demo code for Black-Scholes using a PRNG Monte Carlo method. (Yes, Black-Scholes is much more efficiently computed without Monte Carlo, e.g. Espen Haug’s entertaining page about Black-Scholes. However, this is still a good demo as it shows the technology and is easy to check results.)

float MonteCarloAntithetic(float price,
                           float strike,
                           float vol,
                           float rate,
                           float div,
                           float T)
{
    float deltat        = T / N;
    float muDeltat      = (rate - div - 0.5 * vol * vol) * deltat;
    float volSqrtDeltat = vol * sqrt(deltat);
    float meanCPU       = 0.0f;

    Arrayf32 meanSP; // result
    {
        RNGf32 rng_hndl(RNG_DEFAULT, 0);

        Arrayf32 U = zeros_f32(M);
        for (int i = 0; i < N; i++)
            U += rng_normal_make(rng_hndl, M);

        Arrayf32 values;
        {
            Arrayf32 lnS1 = log(price) + N * muDeltat + volSqrtDeltat * U;
            Arrayf32 lnS2 = log(price) + N * muDeltat + volSqrtDeltat * (-U);
            Arrayf32 S1 = exp(lnS1);
            Arrayf32 S2 = exp(lnS2);
            values = (0.5 * (max(0, S1 - strike) + max(0, S2 - strike))
                          * exp(-rate * T));
        }
        meanSP = mean(values);
    }
    meanCPU = meanSP.read_scalar();

    return meanCPU; 
}

Let’s use Espen Haug’s parameter settings.

  • stock price is $60
  • option strike price is $65
  • volatility is 30%
  • risk-free interest rate is 8% (ridiculous today!)
  • dividend is 0%
  • time to option maturity is 3 months (.25 years)

With M=1000000 and N=100 (one million trials and one hundred time steps), the calculated price is $2.13477 for the PeakStream demo code running on Chai. Espen Haug’s C++ code returns $2.13337. The results agree.

What does the generated OpenCL kernel look like? The entire thing is quite large as the JIT dynamically generates part of the Random123 PRNG library. So I’ll leave that out. The interesting part isn’t too much.

//...note, leaving out the Random123 library for brevity
//...philox_normal_4_4() is a generated stub into Random123
inline float4 philox_normal_4_4(uint32_t tid, uint32_t idx) __attribute__((always_inline));
inline float4 philox_normal_4_4(uint32_t tid, uint32_t idx) {
  philox4x32_key_t rngKey = {{ tid, 0xdecafbad }};
  philox4x32_ctr_t rngCtr = {{ idx, 0xf00dcafe, 0xdeadbeef, 0xbeeff00d }};
union { philox4x32_ctr_t c; int4 ii; float4 fp; } u;
u.c = philox4x32(rngCtr, rngKey);
u.fp = convert_float4(u.ii) * (float4)(2.328306436538696289063E-10) + (float4)(0.5);
return (float4)(sqrt(-2 * log(u.fp.s0)) * sin(2 * 3.14159 * u.fp.s1), sqrt(-2 * log(u.fp.s0)) * cos(2 * 3.14159 * u.fp.s1), sqrt(-2 * log(u.fp.s2)) * sin(2 * 3.14159 * u.fp.s3), sqrt(-2 * log(u.fp.s2)) * cos(2 * 3.14159 * u.fp.s3));
}
__kernel void f_14843374243427385703_1(
  __global float4 * a0,
  const float a1,
  const double a2,
  const float a3,
  const double a4,
  const double a5,
  const float a6,
  const float a7,
  const double a8)
{
  float4 r0;
  float4 r1;
  r1= (float4)(0);
  for (int i0 = 0; i0 < 250000; i0++)
  {
    r0= (float4)(0);
    for (int i1 = 0; i1 < 100; i1++)
    {
      r0= (r0 + philox_normal_4_4((i0), 0 + i1)) ;
    }
    r1+= ((a8 * (max((float4)(0), (exp((float4)((a2 + (a1 * r0)))) - a7)) + max((float4)(0), (exp((float4)((a4 + (a3 * -((float4)(r0)))))) - a6)))) * a5) ;
  }
  a0[0] = r1 / 1000000 ;
}

I haven’t reformatted the code. The above is exactly what the JIT generates. Note that while machine generated, it is still somewhat human readable. This is important. I rely on being able to read the code Chai’s JIT generates in order to develop it.

Also note that the code is efficient. It is perhaps as efficient as if hand-crafted. Loops are rolled and fused. All accumulation is done in registers. (Sure, the JIT isn’t that smart and there are lots of ugly corners inside. But it does show what it can do when it is at its best.)

Ok, this is nice. But it is still not a realistic example. It only uses one thread! Performance will be terrible on a GPU.

More significant is that a quantitative strategy must rely on pricing millions of options, not just one. And each option must have different parameters to explore the pricing structure. (Full disclosure: I have never worked as a quant in the financial industry. I am just talking here.)

Here’s a more realistic example.

vector< double > MonteCarloAntithetic(float price,
                                      float strike,
                                      vector< float >& volVec,
                                      vector< double >& rateVec,
                                      float div,
                                      float T)
{
    const size_t numExperiments = volVec.size() > rateVec.size()
                                      ? volVec.size()
                                      : rateVec.size();

    float deltat = T / N;

    Arrayf64 meanSP; // result
    {
        Arrayf32 vol           = make1(volVec);
        Arrayf64 rate          = make1(rateVec);
        Arrayf32 muDeltat      = (rate - div - 0.5 * vol * vol) * deltat;
        Arrayf32 volSqrtDeltat = vol * sqrt(deltat);

        RNGf64 rng_hndl(RNG_DEFAULT, 0);

        Arrayf64 U = zeros_f64(M);
        for (int i = 0; i < N; i++)
            U += rng_normal_make(rng_hndl, M);

        Arrayf64 values;
        {
            Arrayf64 lnS1 = log(price) + double(N) * muDeltat + volSqrtDeltat * U;
            Arrayf64 lnS2 = log(price) + double(N) * muDeltat + volSqrtDeltat * (-U);
            Arrayf64 S1 = exp(lnS1);
            Arrayf64 S2 = exp(lnS2);
            values = (0.5 * (max(0, S1 - strike) + max(0, S2 - strike))
                          * exp(-rate * T));
        }
        meanSP = mean(values);
    }

    return meanSP.read_scalar(numExperiments);
}

The PRNG has been increased to 64 bit to avoid overflow (adding one million normally distributed random numbers very often overflows). There’s a mixture of single and double precision too. This is more to show off. I don’t know if mixed precision runs faster (very possible on modern GPU hardware which is tailored for single precision).

Note the volatility and rate are now vectors instead of scalars. For example:

float price = 60;
float strike = 65;

// combinations of (volatility, rate)
vector< float > volatility;
vector< double > rate;
for (size_t i = 0; i < 100; i++)
{
    const float vol = .3 + i * .01;
    const double r = .01 - i * .001;

    volatility.push_back( vol );
    rate.push_back( r );
}

float dividend = 0;
float T = 0.25;

vector< double > meanSP
    = MonteCarloAntithetic(price, strike, volatility, rate, dividend, T);

for (size_t i = 0; i < meanSP.size(); i++)
    cout << "meanSP[" << i << "] is " << meanSP[i] << endl;

This will calculate the option price for 100 combinations of (volatility, rate) on the GPU.

I do not understand how deep this goes. How effective is auto-tuning? How flexible is this array programming with PRNG approach? Can it model exotic kinds of options? Can it handle real world business rules and events? I do not have the answers to these questions. What I do have is the basic technology to begin finding out.

The investment banks and hedge funds of the world have been using this kind of technology for years. That’s why options pricing is one of the often cited killer applications for GPGPU. What is new with Chai is a high level managed language for doing this.

My experience as a support quant in a non-financial context is that business sees engineering as overhead. Markets keep changing so what’s the point of deep investments in technology when it will become obsolete soon? This is also why mixed-paradigm functional programming languages like Ocaml can be popular among the SDE support quant set. It’s more like a scripting language for mapping and folding but with the speed of C. That’s what I would like for Chai except targeting the GPU.

Example: OpenCL boilerplate

First, I wish to say that I really like OpenCL. I’m not trying to make it look bad. OpenCL is good.

OpenCL is just too low-level for directly writing applications. It’s like Xlib or Win32. These are platform APIs.

Here’s a simple example of OpenCL with all of the boilerplate. This is instructive as a complete and self-contained example. It’s not a code fragment.

#include <CL/cl.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>

void exitOnFail(cl_int status, const char* message)
{
    if (CL_SUCCESS != status)
    {
        printf("error: %s\n", message);
        exit(-1);
    }
}

int main(int argc, char *argv[])
{
    // return code used by OpenCL API
    cl_int status;

    // wait event synchronization handle used by OpenCL API
    cl_event event;

    ////////////////////////////////////////
    // OpenCL platforms

    // determine number of platforms
    cl_uint numPlatforms;
    status = clGetPlatformIDs(0, NULL, &numPlatforms);
    exitOnFail(status, "number of platforms");

    // get platform IDs
    cl_platform_id platformIDs[numPlatforms];
    status = clGetPlatformIDs(numPlatforms, platformIDs, NULL);
    exitOnFail(status, "get platform IDs");

    ////////////////////////////////////////
    // OpenCL devices

    // look for a CPU and GPU compute device
    cl_platform_id cpuPlatformID, gpuPlatformID;
    cl_device_id cpuDeviceID, gpuDeviceID;
    int isCPU = 0, isGPU = 0;

    // iterate over platforms
    for (size_t i = 0; i < numPlatforms; i++)
    {
        // determine number of devices for a platform
        cl_uint numDevices;
        status = clGetDeviceIDs(platformIDs[i],
                                CL_DEVICE_TYPE_ALL,
                                0,
                                NULL,
                                &numDevices);
        if (CL_SUCCESS == status)
        {
            // get device IDs for a platform
            cl_device_id deviceIDs[numDevices];
            status = clGetDeviceIDs(platformIDs[i],
                                    CL_DEVICE_TYPE_ALL,
                                    numDevices,
                                    deviceIDs,
                                    NULL);
            if (CL_SUCCESS == status)
            {
                // iterate over devices
                for (size_t j = 0; j < numDevices; j++)
                {
                    cl_device_type deviceType;
                    status = clGetDeviceInfo(deviceIDs[j],
                                             CL_DEVICE_TYPE,
                                             sizeof(cl_device_type),
                                             &deviceType,
                                             NULL);
                    if (CL_SUCCESS == status)
                    {
                        // first CPU device
                        if (!isCPU && (CL_DEVICE_TYPE_CPU & deviceType))
                        {
                            isCPU = 1;
                            cpuPlatformID = platformIDs[i];
                            cpuDeviceID = deviceIDs[j];
                        }

                        // first GPU device
                        if (!isGPU && (CL_DEVICE_TYPE_GPU & deviceType))
                        {
                            isGPU = 1;
                            gpuPlatformID = platformIDs[i];
                            gpuDeviceID = deviceIDs[j];
                        }
                    }
                }
            }
        }
    }

    // pick GPU device if it exists, otherwise use CPU
    cl_platform_id platformID;
    cl_device_id deviceID;
    if (isGPU)
    {
        platformID = gpuPlatformID;
        deviceID = gpuDeviceID;
    }
    else if (isCPU)
    {
        platformID = cpuPlatformID;
        deviceID = cpuDeviceID;
    }
    else
    {
        // no devices found
        exitOnFail(CL_DEVICE_NOT_FOUND, "no devices found");
    }

    ////////////////////////////////////////
    // OpenCL context

    cl_context_properties props[] = { CL_CONTEXT_PLATFORM,
                                      (cl_context_properties) platformID,
                                      0 };
    cl_context context = clCreateContext(props,
                                         1,
                                         &deviceID,
                                         NULL,
                                         NULL,
                                         &status);
    exitOnFail(status, "create context");

    ////////////////////////////////////////
    // OpenCL command queue

    cl_command_queue queue = clCreateCommandQueue(context,
                                                  deviceID,
                                                  0,
                                                  &status);
    exitOnFail(status, "create command queue");

    ////////////////////////////////////////
    // OpenCL buffers

    // N x 1 row major array buffers
    size_t N = 20;
    float cpuX[N], cpuY[N];

    // initialize array data
    for (size_t i = 0; i < N; i++)
    {
        cpuX[i] = i;
        cpuY[i] = 1;
    }

    // second argument: memory buffer object for X
    cl_mem memX = clCreateBuffer(context,
                                 CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
                                 N * sizeof(float),
                                 cpuX,
                                 &status);
    exitOnFail(status, "create buffer for X");

    // third argument: memory buffer object for Y
    cl_mem memY = clCreateBuffer(context,
                                 CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
                                 N * sizeof(float),
                                 cpuY,
                                 &status);
    exitOnFail(status, "create buffer for Y");

    ////////////////////////////////////////
    // OpenCL move buffer data to device

    // data transfer for array X
    status = clEnqueueWriteBuffer(queue,
                                  memX,
                                  CL_FALSE,
                                  0,
                                  N * sizeof(float),
                                  cpuX,
                                  0,
                                  NULL,
                                  &event);
    exitOnFail(status, "write X to device");
    status = clWaitForEvents(1, &event);
    exitOnFail(status, "wait for write X to device");
    clReleaseEvent(event);

    // data transfer for array Y
    status = clEnqueueWriteBuffer(queue,
                                  memY,
                                  CL_FALSE,
                                  0,
                                  N * sizeof(float),
                                  cpuY,
                                  0,
                                  NULL,
                                  &event);
    exitOnFail(status, "write Y to device");
    status = clWaitForEvents(1, &event);
    exitOnFail(status, "wait for write Y to device");
    clReleaseEvent(event);

    ////////////////////////////////////////
    // OpenCL program and kernel

    // saxpy: Y = alpha * X + Y
    const char *kernelSrc[] = {
        "__kernel void saxpy(const float alpha,",
        "                    __global const float* X,",
        "                    __global float* Y)",
        "{",
        "    Y[get_global_id(0)] += alpha * X[get_global_id(0)];",
        "}" };

    // a program can have multiple kernels
    cl_program program = clCreateProgramWithSource(
                             context,
                             sizeof(kernelSrc)/sizeof(const char*),
                             kernelSrc,
                             NULL,
                             &status);
    exitOnFail(status, "create program");

    // compile the program
    status = clBuildProgram(program, 1, &deviceID, "", NULL, NULL);
    exitOnFail(status, "build program");

    // one kernel from the program
    cl_kernel kernel = clCreateKernel(program, "saxpy", &status);
    exitOnFail(status, "create kernel");

    ////////////////////////////////////////
    // OpenCL kernel arguments

    // first argument: a scalar float
    float alpha = 1.5f;

    // set first argument
    status = clSetKernelArg(kernel, 0, sizeof(float), &alpha);
    exitOnFail(status, "set kernel argument alpha");

    // set second argument
    status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &memX);
    exitOnFail(status, "set kernel argument X");

    // set third argument
    status = clSetKernelArg(kernel, 2, sizeof(cl_mem), &memY);
    exitOnFail(status, "set kernel argument Y");

    ////////////////////////////////////////
    // OpenCL enqueue kernel and wait

    // N work-items in groups of 4
    const size_t groupsize = 4;
    const size_t global[] = { N }, local[] = { groupsize };

    // enqueue kernel
    status = clEnqueueNDRangeKernel(queue,
                                    kernel,
                                    sizeof(global)/sizeof(size_t),
                                    NULL,
                                    global,
                                    local,
                                    0,
                                    NULL,
                                    &event);
    exitOnFail(status, "enqueue kernel");

    // wait for kernel, this forces execution
    status = clWaitForEvents(1, &event);
    exitOnFail(status, "wait for enqueue kernel");
    clReleaseEvent(event);

    ////////////////////////////////////////
    // OpenCL read back buffer from device

    // data transfer for array Y
    status = clEnqueueReadBuffer(queue,
                                 memY,
                                 CL_FALSE,
                                 0,
                                 N * sizeof(float),
                                 cpuY,
                                 0,
                                 NULL,
                                 &event);
    exitOnFail(status, "read Y from device");
    status = clWaitForEvents(1, &event);
    exitOnFail(status, "wait for read Y from device");
    clReleaseEvent(event);

    ////////////////////////////////////////
    // OpenCL cleanup

    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseMemObject(memX);
    clReleaseMemObject(memY);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    ////////////////////////////////////////
    // print computed result

    for (size_t i = 0; i < N; i++)
    {
        printf("Y[%u] is %f\n", (int)i, (double)cpuY[i]);
    }

    exit(0);
}

Here is the same example with Chai. Comparing the OpenCL and Chai code shows how much boilerplate and runtime management is done by the virtual machine and JIT.

#include <chai/chai.h>
#include <chai/ParseArgs.hpp>
#include <iostream>
#include <stdlib.h>

using namespace chai;
using namespace std;

int main(int argc, char *argv[])
{
    /////////////////////////////////////
    // start virtual machine

    // start virtual machine, exit on error
    ParseArgs(argc, argv).initVM();

    ////////////////////////////////////////
    // data buffers and arguments

    // N x 1 row major array buffers
    size_t N = 20;
    float cpuX[N], cpuY[N];

    // initialize array data
    for (size_t i = 0; i < N; i++)
    {
        cpuX[i] = i;
        cpuY[i] = 1;
    }

    float alpha = 1.5f;

    ////////////////////////////////////////
    // managed code

    Arrayf32 X = make1(N, cpuX);
    Arrayf32 Y = make1(N, cpuY);

    // saxpy: Y = alpha * X + Y
    Y += alpha * X;

    Y.read1(cpuY, N * sizeof(float));

    /////////////////////////////////////
    // stop virtual machine

    shutdown();

    ////////////////////////////////////////
    // print computed result

    for (size_t i = 0; i < N; i++)
    {
        cout << "Y[" << i << "] is " << cpuY[i] << endl;
    }

    exit(0);
}

Here is the same example with inlined OpenCL. The Chai virtual machine still manages everything. However, the JIT is bypassed with a hardcoded OpenCL kernel.

#include <chai/chai.h>
#include <chai/ParseArgs.hpp>
#include <iostream>
#include <stdlib.h>

using namespace chai;
using namespace std;

int main(int argc, char *argv[])
{
    /////////////////////////////////////
    // start virtual machine

    // start virtual machine, exit on error
    ParseArgs(argc, argv).noInterpret().initVM();

    ////////////////////////////////////////
    // data buffers and arguments

    // N x 1 row major array buffers
    size_t N = 20;
    float cpuX[N], cpuY[N];

    // initialize array data
    for (size_t i = 0; i < N; i++)
    {
        cpuX[i] = i;
        cpuY[i] = 1;
    }

    float alpha = 1.5f;

    ////////////////////////////////////////
    // managed code

    Arrayf32 X = make1(N, cpuX);
    Arrayf32 Y = make1(N, cpuY);

    // saxpy: Y = alpha * X + Y
    ProgramCL inlineCL(
        "__kernel void saxpy(const float alpha,"
        "                    __global const float* X,"
        "                    __global float* Y)"
        "{"
        "    Y[get_global_id(0)] += alpha * X[get_global_id(0)];"
        "}" );

    // inlined OpenCL kernel
    const size_t groupsize = 4;
    (inlineCL, "saxpy", alpha, X, Y)(N, groupsize);

    Y.read1(cpuY, N * sizeof(float));

    /////////////////////////////////////
    // stop virtual machine

    shutdown();

    ////////////////////////////////////////
    // print computed result

    for (size_t i = 0; i < N; i++)
    {
        cout << "Y[" << i << "] is " << cpuY[i] << endl;
    }

    exit(0);
}

Finally, here is the same example with interoperation between managed Chai code and unmanaged OpenCL. This makes it possible to use Chai with third party kernel libraries.

#include <chai/chai.h>
#include <chai/ParseArgs.hpp>
#include <CL/cl.h>
#include <iostream>
#include <stdlib.h>

using namespace chai;
using namespace std;

void exitOnFail(cl_int status, const char* message)
{
    if (CL_SUCCESS != status)
    {
        cout << "error: " << message << endl;
        exit(-1);
    }
}

int main(int argc, char *argv[])
{
    /////////////////////////////////////
    // start virtual machine

    // start virtual machine, exit on error
    ParseArgs(argc, argv).noInterpret().initVM();

    ////////////////////////////////////////
    // data buffers and arguments

    // N x 1 row major array buffers
    size_t N = 20;
    float cpuX[N], cpuY[N];

    // initialize array data
    for (size_t i = 0; i < N; i++)
    {
        cpuX[i] = i;
        cpuY[i] = 1;
    }

    float alpha = 1.5f;

    /////////////////////////////////////////
    // managed buffers

    Arrayf32 X = make1(N, cpuX);
    Arrayf32 Y = make1(N, cpuY);

    ////////////////////////////////////////
    // unmanaged OpenCL

    // call before unmanaged OpenCL
    ForceSchedule();

    // saxpy: Y = alpha * X + Y
    const char *kernelSrc[] = {
        "__kernel void saxpy(const float alpha,",
        "                    __global const float* X,",
        "                    __global float* Y)",
        "{",
        "    Y[get_global_id(0)] += alpha * X[get_global_id(0)];",
        "}" };

    cl_int status;

    // a program can have multiple kernels
    cl_program program = clCreateProgramWithSource(
                             GetContext(), // from VM
                             sizeof(kernelSrc)/sizeof(const char*),
                             kernelSrc,
                             NULL,
                             &status);
    exitOnFail(status, "create program");

    // compile the program
    cl_device_id deviceID = GetDevice(); // from VM
    status = clBuildProgram(program, 1, &deviceID, "", NULL, NULL);
    exitOnFail(status, "build program");

    // one kernel from the program
    cl_kernel kernel = clCreateKernel(program, "saxpy", &status);
    exitOnFail(status, "create kernel");

    // set first argument
    status = clSetKernelArg(kernel, 0, sizeof(float), &alpha);
    exitOnFail(status, "set kernel argument alpha");

    // set second argument
    cl_mem memX = X; // from VM
    status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &memX);
    exitOnFail(status, "set kernel argument X");

    // set third argument
    cl_mem memY = Y; // from VM
    status = clSetKernelArg(kernel, 2, sizeof(cl_mem), &memY);
    exitOnFail(status, "set kernel argument Y");

    // N work-items in groups of 4
    const size_t groupsize = 4;
    const size_t global[] = { N }, local[] = { groupsize };

    cl_event event;

    // enqueue kernel
    status = clEnqueueNDRangeKernel(GetCommandQueue(), // from VM
                                    kernel,
                                    sizeof(global)/sizeof(size_t),
                                    NULL,
                                    global,
                                    local,
                                    0,
                                    NULL,
                                    &event);
    exitOnFail(status, "enqueue kernel");

    // wait for kernel, this forces execution
    status = clWaitForEvents(1, &event);
    exitOnFail(status, "wait for enqueue kernel");

    clReleaseEvent(event);
    clReleaseKernel(kernel);
    clReleaseProgram(program);

    /////////////////////////////////////////
    // managed read back

    Y.read1(cpuY, N * sizeof(float));

    /////////////////////////////////////
    // stop virtual machine

    shutdown();

    ////////////////////////////////////////
    // print computed result

    for (size_t i = 0; i < N; i++)
    {
        cout << "Y[" << i << "] is " << cpuY[i] << endl;
    }

    exit(0);
}

Make it easier

Comments from my AFDS 12 breakout session survey:

  • The description indicates that this is something I could do, while really I can’t.
  • usage of Chai not entirely clear by code examples. Everything constantly encoded in OpenCL? Mix with C? Works for really complex programs?

I agree and am sympathetic, even though I wrote Chai. It must be easier to understand and use.

It may be that solving engineering problems may be much harder than making useful technology (which means people use it).

I ran into a coworker from five years ago at AFDS 12. He works on EEG signal processing software in C++ on multi-core CPUs. Customers want more performance so naturally GPGPU is interesting.

However, the investment in GPGPU is too large for the development risk and performance return. Unlike OpenMP and threads, acceleration requires completely rewriting code. Worse, the principles of GPU programming are unlike all previous experience.

Multi-threaded and vectorizing compiler friendly programming may be difficult – but at least you can re-use your old code in the programming languages you already know.

After looking at primitive restart, I decided not to support OpenGL buffer sharing. It’s not that hard to do. It’s also not something most developers care about. So it is time to move on. Chai is feature complete. Work is now: documentation; testing; optimization.

Chai and Lua have similar roles

Interoperability with OpenCL works. This allows freely mixing Chai managed code with third party OpenCL libraries, even if closed source (e.g. AMD’s OpenCL BLAS). Here’s an example.

// regular OpenCL C API code
void directOpenCL(cl_mem argA, float argS, const size_t indexN)
{
    const char *src[1];
    src[0] = "__kernel void add_scalar(__global float* a, float s)"
             "{"
             "  a[get_global_id(0)] += s;"
             "}";

    cl_int status;

    const cl_program prog = clCreateProgramWithSource(
                                GetContext(), // cl_context from Chai VM
                                1,
                                src,
                                NULL,
                                &status );

    cl_device_id device = GetDevice(); // cl_device_id from Chai VM

    clBuildProgram(prog, 1, &device, "", NULL, NULL);

    const cl_kernel krnl = clCreateKernel(prog, "add_scalar", &status);

    clSetKernelArg(krnl, 0, sizeof(cl_mem), &argA);
    clSetKernelArg(krnl, 1, sizeof(float), &argS);

    size_t global[1], local[1];
    global[0] = indexN;
    local[0] = 1;

    cl_event event;

    clEnqueueNDRangeKernel( GetCommandQueue(), // cl_command_queue from Chai VM
                            krnl,
                            1,
                            NULL,
                            global,
                            local,
                            NumEvents(), // event list size from Chai VM
                            GetEventList(), // cl_event* from Chai VM
                            &event );

    clWaitForEvents(1, &event);

    clReleaseEvent(event);
    clReleaseKernel(krnl);
    clReleaseProgram(prog);
}

// Chai managed code
Arrayf32 C;
{
    Arrayf32 A = make1(N, cpuA);
    Arrayf32 B = make1(N, cpuB);
    C = A + B;

    ForceSchedule(); // schedule to compute device without read back
    directOpenCL(C, 1.5f, N);

    C += A - 3.3f;
}
C.read1(cpuC, N * sizeof(float));

directOpenCL(C, 1.5f, N);

C -= 5;
C.read1(cpuC, N * sizeof(float));

The next and possibly last major feature is OpenGL integration (i.e. buffer sharing for primitive restart). Then I will spend the next three to four months working on documentation and quality (which includes optimizing generated code from the JIT).

It’s become clear to me that Chai is really like Lua – a lightweight embeddable managed language. As Lua is used for dynamic scripting, Chai could be used for game physics and signal processing.

Five years ago, I made this autonomous robot. It used an ARM9 single board computer running Lua as the control loop. The computer vision and motor control was done in C. However, just like a computer game, being able to easily change high level system configuration without recompiling (in this case, cross-compiling) made life much easier.

Here’s another robot from five years ago. It was the winner of the 2007 Robothon in Seattle. It relied on stereo cameras for terminal guidance to the orange cones that marked waypoints. The onboard computer ran interpreted MATLAB.

Today, that onboard computer would be an embedded SoC design capable of GPGPU. Signal processing and control requires crunching numbers on the GPU. Lua makes sense as a dynamic scripting language for the CPU. Chai makes sense as an array programming language for the GPU.

clMAGMA and AMD BLAS thoughts

At this time, I’ve decided it’s not worth integrating clMAGMA and the AMD OpenCL BLAS into the JIT back-end of Chai. It’s a lot of work for questionable gain. I’m not even sure it is the right thing to do.

Exposing the OpenCL queues managed by the Chai virtual machine is much simpler. This may be what developers really want. Then they can add clMAGMA to Chai or vice-versa as they wish. There’s no need to commit to either – give the end-user more freedom and do no harm.

It is not AMD’s or UTK’s fault – but clMAGMA performance will be poor unless matrix sizes are large (dimension in the thousands). LAPACK and BLAS were designed for vector processors and CPUs, not discrete GPUs with large overheads for enqueued kernels and I/O operations. There is no support for kernel fusion (batching).

This is why Chai natively supports batching over vectors of array data. A vector of arrays is tiled together as a single array. Multiple data transfers and kernels become singles, reducing overheads. For problems with relatively small matrices (dimension in the hundreds) and high arithmetic intensity (like dense matrix multiplication), the effect of this optimization is significant.

I have enough experience where I could fork clMAGMA (yes, I realize this is a moving target) and implement my own auto-tuning GPGPU BLAS sufficient to support it (that is also portable between AMD and Nvidia). This fork would support tiled data and kernel fusion. Also, if I were to do it right, I would need a porting/test lab and extend auto-tuning to clMAGMA itself (which has statically configured tuning parameters). From working on GATLAS and now Chai, I have a very good idea of the level of effort required. It’s not rocket science, just a few months of full-time work.

Anyway, I can’t afford to do it.

Alpha 5 release: anticipating embedded configuration

The alpha 5 code is committed to GitHub and as usual, there is a small download without metadata.

I spent about a week working on JIT changes to support more aggressive private register use. None of that made it into this code commit (in fact, I will throw all that away and do it differently). This drop has no functional changes. It is purely refactoring of source code.

One obvious question is the market segment for Chai:

  1. Server clusters, maybe in the cloud
  2. Mobile platforms like smart phones and tablets

After my AFDS12 technical presentation, someone asked me this question. My answer was that I had not decided. It was a lame answer but also true.

Chai was designed to mimic PeakStream from 2007 – a vision of desktop HPC using big discrete GPUs.

The world has changed. The market middle of servers, workstations, and PCs is in decline. A combination of smaller devices using services in big clouds drives growth. Society seeks cheaper solutions.

I haven’t been blind to this. With all of the effort to just make Chai work, I was too occupied to think about the big picture. Now that the basic engineering problems are solved (well enough for an alpha prototype), I am starting to look around at the situation.

Business thinks about total cost in terms of risk over a time horizon. The risks in any new platform are large. This is why software languages and platforms are usually given away for free. To offset the risk, the price must be zero.

This is also why business usually chooses conservative solutions. Throwing hardware at a problem or using the less efficient platform everyone knows may be inefficient and expensive – but the risk of surprises is much lower.

PeakStream deliberately chose to be a language inside C++ to reduce risk. Chai made the same choice. However, even with its performance advantage, C++ has become an embedded language. It is used when constraints on performance, memory, and power efficiency prevent using anything else.

That’s why I believe it is necessary to go farther and change the vision. The world has changed.

Here’s my argument. There are three ways GPGPU could go.

  1. (Status quo) Nothing changes as technology is mature and good enough.
  2. (STI Cell, Larrabee) Multi-core CPUs become many-core. GPGPU becomes irrelevant.
  3. (SoC CPU+GPU, APU) Heterogeneous balance of multi-core CPU with integrated GPU.

If the future is 1, then Chai will go nowhere. What happened to PeakStream and Rapidmind? They were acquired and disappeared. The market is so small that no standard platform has appeared to address this need since then.

If the future is 2, then Chai will go nowhere. Why do all this rocket science to program GPUs? However, again history has voted. It’s not necessarily easier to program than GPUs and is less efficient by design. The many-core processor is too expensive.

If the future is 3, then Chai may be useful. The GPU becomes the data parallel math coprocessor. Part of this is already happening as SoC processors with integrated GPUs are now standard. However, these are still graphics oriented (i.e. no double precision support).

So it may not be that future 3 will happen (although this is AMD’s existential bet with HSA). Rather, this is the only future that has Chai in it.

One last thing. Roy Spliet sent me this. It embarrassed me when I first watched it. During the DDP flashmob through the Hyatt, a camera crew stopped me for an interview. I wondered what happened to that footage.

Another alpha code drop: inline OpenCL with managed code

Alpha 4 is on GitHub. As before, there is a tarball download without the git metadata.

I changed my mind about rewriting the JIT. It all seemed so simple while in Washington state at AFDS 2012. Delusions of fun. Back in real life, I see the engineering tradeoffs involved. A rewrite is not free and likely offers little return for the end user.

So far, my hacking approach has worked. I shouldn’t alter course now just because it doesn’t please some abstract notion of purity.

There are some new viewpoints that have survived real life scrutiny.

  1. This project is not that big. I sensed this the day before the conference while at a barbecue. Everything I am doing is not that complicated. Worse, the discounted value and complexity of the PeakStream approach is much less after five years. What was speculative future value in 2007 is now our concrete present in 2012 and much less than imagined, I think.
  2. Adaptation and flexibility are more important than performance. There are many new programming languages and platforms with complex interrelationships. It’s risky to bet on anything, to make any kind of deep investment because the technology is changing so quickly.
  3. I need to work on more projects. Growth is about leverage in the future to sustain the present. This is why people like _why or Audrey Tang appear who work on many things. That is also about concurrency as a strategy to increase utilization and productivity. It’s really a corollary of number 2.

I have been toying with the idea of cloning the Google Prediction API. This would be named the Open Prediction API. There does seem to be a dearth of public technology for ML, commercial or free. It’s all done behind closed doors.

The Google Prediction API was really about a marketplace for ML filters, kind of like Google’s answer to Amazon Mechanical Turk except with machines instead of Turkers. In both cases, it does not appear to have worked. The marketplace became filled with lemons or was ignored.

After the fun, back to work

After the conference, I am staying with friends in Redmond for a few days.

They had to go to work. So I am just working in the living room.

I heard a series of musical notes behind me. It was the Roomba waking up. I had to erect a barrier to keep the robot away. It is persistent.

What I noticed is the combination of randomness with behavioural templates. The Roomba uses an ensemble of strategies to solve problems. This gave me some ideas for compiler optimization.

I’ve decided to rewrite the entire front-end and JIT compiler. Everything will change. It will only take a few days as the amount of code will be much less. I think I had to write it first in an ugly way to understand how to do it in a beautiful way. This may turn out to be important not only for correctness and performance. If the code is something anyone can understand and extend, then this project may go farther.

I won’t start on the rewrite until I get back in San Francisco. Today, I’m going to play with the Simon Funk (Brandyn Webb) stochastic gradient descent version of SVD (singular value decomposition). This works by minimizing the Frobenius norm. I also want to implement a small K-means for a collaborative filter. One of my hosts is a business guy and was talking about a recommender system. So the natural thing to do is try conventional approaches: factor and clustering filters.

AFDS 2012 Third Day

AMD surpassed the insanity of the Dot Com Bubble today. It was awesome.

Unfortunately, I lost my keynote photo of David Perry, CEO of Gaikai. Google, you should fear this man. He will disrupt your business model and make all your base belong to us.

My technical session went well.

The very last session of the day in the same room was standing room only. It was hot in there. I ended up sitting on the floor.

The evening parties were tremendous. I would have more photos but filled up my phone and lost the early images. My gods.

Dance floor next to the bowling lanes:

Start of the Distributed Dance Party outside the Bellevue Hyatt:

Flashmob on the second floor:

Pyrotechnics set off inside the Hyatt summoned police and fire units:

Skybridge to the mall:

Outside Microsoft offices in the mall:

Mall loading dock:

Me:

Reached the park:

The police were incredibly polite. They only asked that people stop dancing in the water sculpture. There was no permit. And no one complained. So a very nice officer with a shaved head explained they were there to make sure no one got hurt. Seriously, that is exactly right. You don’t want insanity to take hold of a crowd.

I asked someone about Nvidia’s GTC conference last month. They had jugglers.

AMD, you rock.

AFDS 2012 Second Day

Meydenbauer convention center lobby:

Meydenbauer elevators:

 

Keynote room in the Meydenbauer:

Pounding electronic dance music before the keynote to raise the blood. The introduction felt like being in an action movie that is really a music video. I sat up front with the bloggers:

Microsoft building across the street from the Meydenbauer, I used to walk past it without thinking. Maybe sometime soon, Google will have buildings like this:

Small breakout session room, the stage lights completely blind the speakers. However, someone else told me that you forget about them after the first minute:

Better food for speakers, almost Google quality:

I just ate two of these things:

Linear actuators on the corners of the chair provide motion feedback while driving. It’s really hard as the drive by wire wheel oversteers:

Follow

Get every new post delivered to your Inbox.