there are so many cores

Just another WordPress.com site

Monthly Archives: July 2012

Example: OpenCL boilerplate

First, I wish to say that I really like OpenCL. I’m not trying to make it look bad. OpenCL is good.

OpenCL is just too low-level for directly writing applications. It’s like Xlib or Win32. These are platform APIs.

Here’s a simple example of OpenCL with all of the boilerplate. This is instructive as a complete and self-contained example. It’s not a code fragment.

#include <CL/cl.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>

void exitOnFail(cl_int status, const char* message)
{
    if (CL_SUCCESS != status)
    {
        printf("error: %s\n", message);
        exit(-1);
    }
}

int main(int argc, char *argv[])
{
    // return code used by OpenCL API
    cl_int status;

    // wait event synchronization handle used by OpenCL API
    cl_event event;

    ////////////////////////////////////////
    // OpenCL platforms

    // determine number of platforms
    cl_uint numPlatforms;
    status = clGetPlatformIDs(0, NULL, &numPlatforms);
    exitOnFail(status, "number of platforms");

    // get platform IDs
    cl_platform_id platformIDs[numPlatforms];
    status = clGetPlatformIDs(numPlatforms, platformIDs, NULL);
    exitOnFail(status, "get platform IDs");

    ////////////////////////////////////////
    // OpenCL devices

    // look for a CPU and GPU compute device
    cl_platform_id cpuPlatformID, gpuPlatformID;
    cl_device_id cpuDeviceID, gpuDeviceID;
    int isCPU = 0, isGPU = 0;

    // iterate over platforms
    for (size_t i = 0; i < numPlatforms; i++)
    {
        // determine number of devices for a platform
        cl_uint numDevices;
        status = clGetDeviceIDs(platformIDs[i],
                                CL_DEVICE_TYPE_ALL,
                                0,
                                NULL,
                                &numDevices);
        if (CL_SUCCESS == status)
        {
            // get device IDs for a platform
            cl_device_id deviceIDs[numDevices];
            status = clGetDeviceIDs(platformIDs[i],
                                    CL_DEVICE_TYPE_ALL,
                                    numDevices,
                                    deviceIDs,
                                    NULL);
            if (CL_SUCCESS == status)
            {
                // iterate over devices
                for (size_t j = 0; j < numDevices; j++)
                {
                    cl_device_type deviceType;
                    status = clGetDeviceInfo(deviceIDs[j],
                                             CL_DEVICE_TYPE,
                                             sizeof(cl_device_type),
                                             &deviceType,
                                             NULL);
                    if (CL_SUCCESS == status)
                    {
                        // first CPU device
                        if (!isCPU && (CL_DEVICE_TYPE_CPU & deviceType))
                        {
                            isCPU = 1;
                            cpuPlatformID = platformIDs[i];
                            cpuDeviceID = deviceIDs[j];
                        }

                        // first GPU device
                        if (!isGPU && (CL_DEVICE_TYPE_GPU & deviceType))
                        {
                            isGPU = 1;
                            gpuPlatformID = platformIDs[i];
                            gpuDeviceID = deviceIDs[j];
                        }
                    }
                }
            }
        }
    }

    // pick GPU device if it exists, otherwise use CPU
    cl_platform_id platformID;
    cl_device_id deviceID;
    if (isGPU)
    {
        platformID = gpuPlatformID;
        deviceID = gpuDeviceID;
    }
    else if (isCPU)
    {
        platformID = cpuPlatformID;
        deviceID = cpuDeviceID;
    }
    else
    {
        // no devices found
        exitOnFail(CL_DEVICE_NOT_FOUND, "no devices found");
    }

    ////////////////////////////////////////
    // OpenCL context

    cl_context_properties props[] = { CL_CONTEXT_PLATFORM,
                                      (cl_context_properties) platformID,
                                      0 };
    cl_context context = clCreateContext(props,
                                         1,
                                         &deviceID,
                                         NULL,
                                         NULL,
                                         &status);
    exitOnFail(status, "create context");

    ////////////////////////////////////////
    // OpenCL command queue

    cl_command_queue queue = clCreateCommandQueue(context,
                                                  deviceID,
                                                  0,
                                                  &status);
    exitOnFail(status, "create command queue");

    ////////////////////////////////////////
    // OpenCL buffers

    // N x 1 row major array buffers
    size_t N = 20;
    float cpuX[N], cpuY[N];

    // initialize array data
    for (size_t i = 0; i < N; i++)
    {
        cpuX[i] = i;
        cpuY[i] = 1;
    }

    // second argument: memory buffer object for X
    cl_mem memX = clCreateBuffer(context,
                                 CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
                                 N * sizeof(float),
                                 cpuX,
                                 &status);
    exitOnFail(status, "create buffer for X");

    // third argument: memory buffer object for Y
    cl_mem memY = clCreateBuffer(context,
                                 CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
                                 N * sizeof(float),
                                 cpuY,
                                 &status);
    exitOnFail(status, "create buffer for Y");

    ////////////////////////////////////////
    // OpenCL move buffer data to device

    // data transfer for array X
    status = clEnqueueWriteBuffer(queue,
                                  memX,
                                  CL_FALSE,
                                  0,
                                  N * sizeof(float),
                                  cpuX,
                                  0,
                                  NULL,
                                  &event);
    exitOnFail(status, "write X to device");
    status = clWaitForEvents(1, &event);
    exitOnFail(status, "wait for write X to device");
    clReleaseEvent(event);

    // data transfer for array Y
    status = clEnqueueWriteBuffer(queue,
                                  memY,
                                  CL_FALSE,
                                  0,
                                  N * sizeof(float),
                                  cpuY,
                                  0,
                                  NULL,
                                  &event);
    exitOnFail(status, "write Y to device");
    status = clWaitForEvents(1, &event);
    exitOnFail(status, "wait for write Y to device");
    clReleaseEvent(event);

    ////////////////////////////////////////
    // OpenCL program and kernel

    // saxpy: Y = alpha * X + Y
    const char *kernelSrc[] = {
        "__kernel void saxpy(const float alpha,",
        "                    __global const float* X,",
        "                    __global float* Y)",
        "{",
        "    Y[get_global_id(0)] += alpha * X[get_global_id(0)];",
        "}" };

    // a program can have multiple kernels
    cl_program program = clCreateProgramWithSource(
                             context,
                             sizeof(kernelSrc)/sizeof(const char*),
                             kernelSrc,
                             NULL,
                             &status);
    exitOnFail(status, "create program");

    // compile the program
    status = clBuildProgram(program, 1, &deviceID, "", NULL, NULL);
    exitOnFail(status, "build program");

    // one kernel from the program
    cl_kernel kernel = clCreateKernel(program, "saxpy", &status);
    exitOnFail(status, "create kernel");

    ////////////////////////////////////////
    // OpenCL kernel arguments

    // first argument: a scalar float
    float alpha = 1.5f;

    // set first argument
    status = clSetKernelArg(kernel, 0, sizeof(float), &alpha);
    exitOnFail(status, "set kernel argument alpha");

    // set second argument
    status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &memX);
    exitOnFail(status, "set kernel argument X");

    // set third argument
    status = clSetKernelArg(kernel, 2, sizeof(cl_mem), &memY);
    exitOnFail(status, "set kernel argument Y");

    ////////////////////////////////////////
    // OpenCL enqueue kernel and wait

    // N work-items in groups of 4
    const size_t groupsize = 4;
    const size_t global[] = { N }, local[] = { groupsize };

    // enqueue kernel
    status = clEnqueueNDRangeKernel(queue,
                                    kernel,
                                    sizeof(global)/sizeof(size_t),
                                    NULL,
                                    global,
                                    local,
                                    0,
                                    NULL,
                                    &event);
    exitOnFail(status, "enqueue kernel");

    // wait for kernel, this forces execution
    status = clWaitForEvents(1, &event);
    exitOnFail(status, "wait for enqueue kernel");
    clReleaseEvent(event);

    ////////////////////////////////////////
    // OpenCL read back buffer from device

    // data transfer for array Y
    status = clEnqueueReadBuffer(queue,
                                 memY,
                                 CL_FALSE,
                                 0,
                                 N * sizeof(float),
                                 cpuY,
                                 0,
                                 NULL,
                                 &event);
    exitOnFail(status, "read Y from device");
    status = clWaitForEvents(1, &event);
    exitOnFail(status, "wait for read Y from device");
    clReleaseEvent(event);

    ////////////////////////////////////////
    // OpenCL cleanup

    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseMemObject(memX);
    clReleaseMemObject(memY);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    ////////////////////////////////////////
    // print computed result

    for (size_t i = 0; i < N; i++)
    {
        printf("Y[%u] is %f\n", (int)i, (double)cpuY[i]);
    }

    exit(0);
}

Here is the same example with Chai. Comparing the OpenCL and Chai code shows how much boilerplate and runtime management is done by the virtual machine and JIT.

#include <chai/chai.h>
#include <chai/ParseArgs.hpp>
#include <iostream>
#include <stdlib.h>

using namespace chai;
using namespace std;

int main(int argc, char *argv[])
{
    /////////////////////////////////////
    // start virtual machine

    // start virtual machine, exit on error
    ParseArgs(argc, argv).initVM();

    ////////////////////////////////////////
    // data buffers and arguments

    // N x 1 row major array buffers
    size_t N = 20;
    float cpuX[N], cpuY[N];

    // initialize array data
    for (size_t i = 0; i < N; i++)
    {
        cpuX[i] = i;
        cpuY[i] = 1;
    }

    float alpha = 1.5f;

    ////////////////////////////////////////
    // managed code

    Arrayf32 X = make1(N, cpuX);
    Arrayf32 Y = make1(N, cpuY);

    // saxpy: Y = alpha * X + Y
    Y += alpha * X;

    Y.read1(cpuY, N * sizeof(float));

    /////////////////////////////////////
    // stop virtual machine

    shutdown();

    ////////////////////////////////////////
    // print computed result

    for (size_t i = 0; i < N; i++)
    {
        cout << "Y[" << i << "] is " << cpuY[i] << endl;
    }

    exit(0);
}

Here is the same example with inlined OpenCL. The Chai virtual machine still manages everything. However, the JIT is bypassed with a hardcoded OpenCL kernel.

#include <chai/chai.h>
#include <chai/ParseArgs.hpp>
#include <iostream>
#include <stdlib.h>

using namespace chai;
using namespace std;

int main(int argc, char *argv[])
{
    /////////////////////////////////////
    // start virtual machine

    // start virtual machine, exit on error
    ParseArgs(argc, argv).noInterpret().initVM();

    ////////////////////////////////////////
    // data buffers and arguments

    // N x 1 row major array buffers
    size_t N = 20;
    float cpuX[N], cpuY[N];

    // initialize array data
    for (size_t i = 0; i < N; i++)
    {
        cpuX[i] = i;
        cpuY[i] = 1;
    }

    float alpha = 1.5f;

    ////////////////////////////////////////
    // managed code

    Arrayf32 X = make1(N, cpuX);
    Arrayf32 Y = make1(N, cpuY);

    // saxpy: Y = alpha * X + Y
    ProgramCL inlineCL(
        "__kernel void saxpy(const float alpha,"
        "                    __global const float* X,"
        "                    __global float* Y)"
        "{"
        "    Y[get_global_id(0)] += alpha * X[get_global_id(0)];"
        "}" );

    // inlined OpenCL kernel
    const size_t groupsize = 4;
    (inlineCL, "saxpy", alpha, X, Y)(N, groupsize);

    Y.read1(cpuY, N * sizeof(float));

    /////////////////////////////////////
    // stop virtual machine

    shutdown();

    ////////////////////////////////////////
    // print computed result

    for (size_t i = 0; i < N; i++)
    {
        cout << "Y[" << i << "] is " << cpuY[i] << endl;
    }

    exit(0);
}

Finally, here is the same example with interoperation between managed Chai code and unmanaged OpenCL. This makes it possible to use Chai with third party kernel libraries.

#include <chai/chai.h>
#include <chai/ParseArgs.hpp>
#include <CL/cl.h>
#include <iostream>
#include <stdlib.h>

using namespace chai;
using namespace std;

void exitOnFail(cl_int status, const char* message)
{
    if (CL_SUCCESS != status)
    {
        cout << "error: " << message << endl;
        exit(-1);
    }
}

int main(int argc, char *argv[])
{
    /////////////////////////////////////
    // start virtual machine

    // start virtual machine, exit on error
    ParseArgs(argc, argv).noInterpret().initVM();

    ////////////////////////////////////////
    // data buffers and arguments

    // N x 1 row major array buffers
    size_t N = 20;
    float cpuX[N], cpuY[N];

    // initialize array data
    for (size_t i = 0; i < N; i++)
    {
        cpuX[i] = i;
        cpuY[i] = 1;
    }

    float alpha = 1.5f;

    /////////////////////////////////////////
    // managed buffers

    Arrayf32 X = make1(N, cpuX);
    Arrayf32 Y = make1(N, cpuY);

    ////////////////////////////////////////
    // unmanaged OpenCL

    // call before unmanaged OpenCL
    ForceSchedule();

    // saxpy: Y = alpha * X + Y
    const char *kernelSrc[] = {
        "__kernel void saxpy(const float alpha,",
        "                    __global const float* X,",
        "                    __global float* Y)",
        "{",
        "    Y[get_global_id(0)] += alpha * X[get_global_id(0)];",
        "}" };

    cl_int status;

    // a program can have multiple kernels
    cl_program program = clCreateProgramWithSource(
                             GetContext(), // from VM
                             sizeof(kernelSrc)/sizeof(const char*),
                             kernelSrc,
                             NULL,
                             &status);
    exitOnFail(status, "create program");

    // compile the program
    cl_device_id deviceID = GetDevice(); // from VM
    status = clBuildProgram(program, 1, &deviceID, "", NULL, NULL);
    exitOnFail(status, "build program");

    // one kernel from the program
    cl_kernel kernel = clCreateKernel(program, "saxpy", &status);
    exitOnFail(status, "create kernel");

    // set first argument
    status = clSetKernelArg(kernel, 0, sizeof(float), &alpha);
    exitOnFail(status, "set kernel argument alpha");

    // set second argument
    cl_mem memX = X; // from VM
    status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &memX);
    exitOnFail(status, "set kernel argument X");

    // set third argument
    cl_mem memY = Y; // from VM
    status = clSetKernelArg(kernel, 2, sizeof(cl_mem), &memY);
    exitOnFail(status, "set kernel argument Y");

    // N work-items in groups of 4
    const size_t groupsize = 4;
    const size_t global[] = { N }, local[] = { groupsize };

    cl_event event;

    // enqueue kernel
    status = clEnqueueNDRangeKernel(GetCommandQueue(), // from VM
                                    kernel,
                                    sizeof(global)/sizeof(size_t),
                                    NULL,
                                    global,
                                    local,
                                    0,
                                    NULL,
                                    &event);
    exitOnFail(status, "enqueue kernel");

    // wait for kernel, this forces execution
    status = clWaitForEvents(1, &event);
    exitOnFail(status, "wait for enqueue kernel");

    clReleaseEvent(event);
    clReleaseKernel(kernel);
    clReleaseProgram(program);

    /////////////////////////////////////////
    // managed read back

    Y.read1(cpuY, N * sizeof(float));

    /////////////////////////////////////
    // stop virtual machine

    shutdown();

    ////////////////////////////////////////
    // print computed result

    for (size_t i = 0; i < N; i++)
    {
        cout << "Y[" << i << "] is " << cpuY[i] << endl;
    }

    exit(0);
}

Make it easier

Comments from my AFDS 12 breakout session survey:

  • The description indicates that this is something I could do, while really I can’t.
  • usage of Chai not entirely clear by code examples. Everything constantly encoded in OpenCL? Mix with C? Works for really complex programs?

I agree and am sympathetic, even though I wrote Chai. It must be easier to understand and use.

It may be that solving engineering problems may be much harder than making useful technology (which means people use it).

I ran into a coworker from five years ago at AFDS 12. He works on EEG signal processing software in C++ on multi-core CPUs. Customers want more performance so naturally GPGPU is interesting.

However, the investment in GPGPU is too large for the development risk and performance return. Unlike OpenMP and threads, acceleration requires completely rewriting code. Worse, the principles of GPU programming are unlike all previous experience.

Multi-threaded and vectorizing compiler friendly programming may be difficult – but at least you can re-use your old code in the programming languages you already know.

After looking at primitive restart, I decided not to support OpenGL buffer sharing. It’s not that hard to do. It’s also not something most developers care about. So it is time to move on. Chai is feature complete. Work is now: documentation; testing; optimization.

Chai and Lua have similar roles

Interoperability with OpenCL works. This allows freely mixing Chai managed code with third party OpenCL libraries, even if closed source (e.g. AMD’s OpenCL BLAS). Here’s an example.

// regular OpenCL C API code
void directOpenCL(cl_mem argA, float argS, const size_t indexN)
{
    const char *src[1];
    src[0] = "__kernel void add_scalar(__global float* a, float s)"
             "{"
             "  a[get_global_id(0)] += s;"
             "}";

    cl_int status;

    const cl_program prog = clCreateProgramWithSource(
                                GetContext(), // cl_context from Chai VM
                                1,
                                src,
                                NULL,
                                &status );

    cl_device_id device = GetDevice(); // cl_device_id from Chai VM

    clBuildProgram(prog, 1, &device, "", NULL, NULL);

    const cl_kernel krnl = clCreateKernel(prog, "add_scalar", &status);

    clSetKernelArg(krnl, 0, sizeof(cl_mem), &argA);
    clSetKernelArg(krnl, 1, sizeof(float), &argS);

    size_t global[1], local[1];
    global[0] = indexN;
    local[0] = 1;

    cl_event event;

    clEnqueueNDRangeKernel( GetCommandQueue(), // cl_command_queue from Chai VM
                            krnl,
                            1,
                            NULL,
                            global,
                            local,
                            NumEvents(), // event list size from Chai VM
                            GetEventList(), // cl_event* from Chai VM
                            &event );

    clWaitForEvents(1, &event);

    clReleaseEvent(event);
    clReleaseKernel(krnl);
    clReleaseProgram(prog);
}

// Chai managed code
Arrayf32 C;
{
    Arrayf32 A = make1(N, cpuA);
    Arrayf32 B = make1(N, cpuB);
    C = A + B;

    ForceSchedule(); // schedule to compute device without read back
    directOpenCL(C, 1.5f, N);

    C += A - 3.3f;
}
C.read1(cpuC, N * sizeof(float));

directOpenCL(C, 1.5f, N);

C -= 5;
C.read1(cpuC, N * sizeof(float));

The next and possibly last major feature is OpenGL integration (i.e. buffer sharing for primitive restart). Then I will spend the next three to four months working on documentation and quality (which includes optimizing generated code from the JIT).

It’s become clear to me that Chai is really like Lua – a lightweight embeddable managed language. As Lua is used for dynamic scripting, Chai could be used for game physics and signal processing.

Five years ago, I made this autonomous robot. It used an ARM9 single board computer running Lua as the control loop. The computer vision and motor control was done in C. However, just like a computer game, being able to easily change high level system configuration without recompiling (in this case, cross-compiling) made life much easier.

Here’s another robot from five years ago. It was the winner of the 2007 Robothon in Seattle. It relied on stereo cameras for terminal guidance to the orange cones that marked waypoints. The onboard computer ran interpreted MATLAB.

Today, that onboard computer would be an embedded SoC design capable of GPGPU. Signal processing and control requires crunching numbers on the GPU. Lua makes sense as a dynamic scripting language for the CPU. Chai makes sense as an array programming language for the GPU.

clMAGMA and AMD BLAS thoughts

At this time, I’ve decided it’s not worth integrating clMAGMA and the AMD OpenCL BLAS into the JIT back-end of Chai. It’s a lot of work for questionable gain. I’m not even sure it is the right thing to do.

Exposing the OpenCL queues managed by the Chai virtual machine is much simpler. This may be what developers really want. Then they can add clMAGMA to Chai or vice-versa as they wish. There’s no need to commit to either – give the end-user more freedom and do no harm.

It is not AMD’s or UTK’s fault – but clMAGMA performance will be poor unless matrix sizes are large (dimension in the thousands). LAPACK and BLAS were designed for vector processors and CPUs, not discrete GPUs with large overheads for enqueued kernels and I/O operations. There is no support for kernel fusion (batching).

This is why Chai natively supports batching over vectors of array data. A vector of arrays is tiled together as a single array. Multiple data transfers and kernels become singles, reducing overheads. For problems with relatively small matrices (dimension in the hundreds) and high arithmetic intensity (like dense matrix multiplication), the effect of this optimization is significant.

I have enough experience where I could fork clMAGMA (yes, I realize this is a moving target) and implement my own auto-tuning GPGPU BLAS sufficient to support it (that is also portable between AMD and Nvidia). This fork would support tiled data and kernel fusion. Also, if I were to do it right, I would need a porting/test lab and extend auto-tuning to clMAGMA itself (which has statically configured tuning parameters). From working on GATLAS and now Chai, I have a very good idea of the level of effort required. It’s not rocket science, just a few months of full-time work.

Anyway, I can’t afford to do it.

Alpha 5 release: anticipating embedded configuration

The alpha 5 code is committed to GitHub and as usual, there is a small download without metadata.

I spent about a week working on JIT changes to support more aggressive private register use. None of that made it into this code commit (in fact, I will throw all that away and do it differently). This drop has no functional changes. It is purely refactoring of source code.

One obvious question is the market segment for Chai:

  1. Server clusters, maybe in the cloud
  2. Mobile platforms like smart phones and tablets

After my AFDS12 technical presentation, someone asked me this question. My answer was that I had not decided. It was a lame answer but also true.

Chai was designed to mimic PeakStream from 2007 – a vision of desktop HPC using big discrete GPUs.

The world has changed. The market middle of servers, workstations, and PCs is in decline. A combination of smaller devices using services in big clouds drives growth. Society seeks cheaper solutions.

I haven’t been blind to this. With all of the effort to just make Chai work, I was too occupied to think about the big picture. Now that the basic engineering problems are solved (well enough for an alpha prototype), I am starting to look around at the situation.

Business thinks about total cost in terms of risk over a time horizon. The risks in any new platform are large. This is why software languages and platforms are usually given away for free. To offset the risk, the price must be zero.

This is also why business usually chooses conservative solutions. Throwing hardware at a problem or using the less efficient platform everyone knows may be inefficient and expensive – but the risk of surprises is much lower.

PeakStream deliberately chose to be a language inside C++ to reduce risk. Chai made the same choice. However, even with its performance advantage, C++ has become an embedded language. It is used when constraints on performance, memory, and power efficiency prevent using anything else.

That’s why I believe it is necessary to go farther and change the vision. The world has changed.

Here’s my argument. There are three ways GPGPU could go.

  1. (Status quo) Nothing changes as technology is mature and good enough.
  2. (STI Cell, Larrabee) Multi-core CPUs become many-core. GPGPU becomes irrelevant.
  3. (SoC CPU+GPU, APU) Heterogeneous balance of multi-core CPU with integrated GPU.

If the future is 1, then Chai will go nowhere. What happened to PeakStream and Rapidmind? They were acquired and disappeared. The market is so small that no standard platform has appeared to address this need since then.

If the future is 2, then Chai will go nowhere. Why do all this rocket science to program GPUs? However, again history has voted. It’s not necessarily easier to program than GPUs and is less efficient by design. The many-core processor is too expensive.

If the future is 3, then Chai may be useful. The GPU becomes the data parallel math coprocessor. Part of this is already happening as SoC processors with integrated GPUs are now standard. However, these are still graphics oriented (i.e. no double precision support).

So it may not be that future 3 will happen (although this is AMD’s existential bet with HSA). Rather, this is the only future that has Chai in it.

One last thing. Roy Spliet sent me this. It embarrassed me when I first watched it. During the DDP flashmob through the Hyatt, a camera crew stopped me for an interview. I wondered what happened to that footage.