there are so many cores

Just another site

Monthly Archives: June 2012

Another alpha code drop: inline OpenCL with managed code

Alpha 4 is on GitHub. As before, there is a tarball download without the git metadata.

I changed my mind about rewriting the JIT. It all seemed so simple while in Washington state at AFDS 2012. Delusions of fun. Back in real life, I see the engineering tradeoffs involved. A rewrite is not free and likely offers little return for the end user.

So far, my hacking approach has worked. I shouldn’t alter course now just because it doesn’t please some abstract notion of purity.

There are some new viewpoints that have survived real life scrutiny.

  1. This project is not that big. I sensed this the day before the conference while at a barbecue. Everything I am doing is not that complicated. Worse, the discounted value and complexity of the PeakStream approach is much less after five years. What was speculative future value in 2007 is now our concrete present in 2012 and much less than imagined, I think.
  2. Adaptation and flexibility are more important than performance. There are many new programming languages and platforms with complex interrelationships. It’s risky to bet on anything, to make any kind of deep investment because the technology is changing so quickly.
  3. I need to work on more projects. Growth is about leverage in the future to sustain the present. This is why people like _why or Audrey Tang appear who work on many things. That is also about concurrency as a strategy to increase utilization and productivity. It’s really a corollary of number 2.

I have been toying with the idea of cloning the Google Prediction API. This would be named the Open Prediction API. There does seem to be a dearth of public technology for ML, commercial or free. It’s all done behind closed doors.

The Google Prediction API was really about a marketplace for ML filters, kind of like Google’s answer to Amazon Mechanical Turk except with machines instead of Turkers. In both cases, it does not appear to have worked. The marketplace became filled with lemons or was ignored.


After the fun, back to work

After the conference, I am staying with friends in Redmond for a few days.

They had to go to work. So I am just working in the living room.

I heard a series of musical notes behind me. It was the Roomba waking up. I had to erect a barrier to keep the robot away. It is persistent.

What I noticed is the combination of randomness with behavioural templates. The Roomba uses an ensemble of strategies to solve problems. This gave me some ideas for compiler optimization.

I’ve decided to rewrite the entire front-end and JIT compiler. Everything will change. It will only take a few days as the amount of code will be much less. I think I had to write it first in an ugly way to understand how to do it in a beautiful way. This may turn out to be important not only for correctness and performance. If the code is something anyone can understand and extend, then this project may go farther.

I won’t start on the rewrite until I get back in San Francisco. Today, I’m going to play with the Simon Funk (Brandyn Webb) stochastic gradient descent version of SVD (singular value decomposition). This works by minimizing the Frobenius norm. I also want to implement a small K-means for a collaborative filter. One of my hosts is a business guy and was talking about a recommender system. So the natural thing to do is try conventional approaches: factor and clustering filters.

AFDS 2012 Third Day

AMD surpassed the insanity of the Dot Com Bubble today. It was awesome.

Unfortunately, I lost my keynote photo of David Perry, CEO of Gaikai. Google, you should fear this man. He will disrupt your business model and make all your base belong to us.

My technical session went well.

The very last session of the day in the same room was standing room only. It was hot in there. I ended up sitting on the floor.

The evening parties were tremendous. I would have more photos but filled up my phone and lost the early images. My gods.

Dance floor next to the bowling lanes:

Start of the Distributed Dance Party outside the Bellevue Hyatt:

Flashmob on the second floor:

Pyrotechnics set off inside the Hyatt summoned police and fire units:

Skybridge to the mall:

Outside Microsoft offices in the mall:

Mall loading dock:


Reached the park:

The police were incredibly polite. They only asked that people stop dancing in the water sculpture. There was no permit. And no one complained. So a very nice officer with a shaved head explained they were there to make sure no one got hurt. Seriously, that is exactly right. You don’t want insanity to take hold of a crowd.

I asked someone about Nvidia’s GTC conference last month. They had jugglers.

AMD, you rock.

AFDS 2012 Second Day

Meydenbauer convention center lobby:

Meydenbauer elevators:


Keynote room in the Meydenbauer:

Pounding electronic dance music before the keynote to raise the blood. The introduction felt like being in an action movie that is really a music video. I sat up front with the bloggers:

Microsoft building across the street from the Meydenbauer, I used to walk past it without thinking. Maybe sometime soon, Google will have buildings like this:

Small breakout session room, the stage lights completely blind the speakers. However, someone else told me that you forget about them after the first minute:

Better food for speakers, almost Google quality:

I just ate two of these things:

Linear actuators on the corners of the chair provide motion feedback while driving. It’s really hard as the drive by wire wheel oversteers:

AFDS 2012 First Day

Badging booth:

Welcome Reception in the ballroom (note the DJ booth with red lights):

Hyatt Regency Bellevue venue map:


A Beta Tomorrow

I originally envisioned a beta release this summer. Well, that’s now. The new date is before next year. It’s a more real date (at least for me) as I am saying it in public at a conference in a few days.

These two talks are interesting.  “Evolving GPUs into a Substrate for Cloud Computing” – The takeaway is that GPUs are changing. SoC processors with integrated GPUs (e.g. AMD Fusion APU, Intel Core, Nvidia Tegra) may lack the sexy high peak throughput of big discrete cards. However, they may also have much lower penalties for data transfers. The talk showcased an example of porting memcached indexing to an AMD Fusion APU. Yes, it does run significantly faster even with the I/O costs and control flow divergence. A big discrete like an ATI HD 5870 has terrible performance in this use case. “Simon Peyton Jones: Data Parallel Haskell ” – The takeaway is pure functional is good for parallel code but bad for locality of data. That’s bad as data movement costs often dominate performance in practice.

The many-core accelerator designs all seem to fail in the marketplace. STI Cell. Larrabee. They are too expensive for what they do, the Swiss Army knife effect of not excelling at anything. What seems to win are the multi-core CPU and the GPU. These are very good at what they do.

It is the bazaar of CPUs and GPUs over the cathedral of many-core.

There’s a lot of interest now in “device fission”. I see this as part of virtualization which is really about two things which are both the same: sell more with less; increase device utilization. It’s like slicing the pizza and selling the slices. You make more money that way and better utilize the pizza too.

I haven’t seen any talk of “device fusion”. Why can’t a cloud look like a big stream processor? If there were virtual fused compute devices, then applications could scale arbitrarily.

But maybe the reason “device fusion” isn’t considered is that, like the nuclear counterpart, it’s something cool that is impractically difficult. That’s a little like the many-core vision of Swiss Army knife homogeneous computing devices.

So the lesson here is that heterogeneity is real. It’s ugly. But that’s diversity in the bazaar. We want cathedrals because they appeal to inner drives to seek beauty through perfection and order. That’s a trap.

Inlining OpenCL with managed code works

It’s a dumb example but shows the basic idea.

ProgramCL progCL(
    "__kernel void add_scalar(__global float* a, float s)"
    "  a[get_global_id(0)] += s;"
    "}" );

Arrayf32 C;
    Arrayf32 A = make1(N, cpuA);
    Arrayf32 B = make1(N, cpuB);

    C = A + B;

    (progCL, "add_scalar", C, 1.5f)(N, 1);

    C += A - 3.3f;
C.read1(cpuC, N * sizeof(float));

OpenCL kernels can be inserted anywhere in the managed execution trace. The virtual machine and JIT do the right thing regarding memory transfers and generated kernel boundaries. It just works. If nothing else, this allows programming in OpenCL as a kernel language and letting the virtual machine manage memory. It’s also a very natural way to extend the platform dynamically.

OpenCL programs can have multiple kernels, local memory, images, etc. The ProgramCL object parses the source code text and does the right thing with arguments. The work space dimensions can be 1D or 2D. The example above is 1D with “(global0, local0)“. For 2D, this would be “(global0, global1, local0, local1)“.

The syntax is unusual. It is also the best compromise I could find. I did not want the verbose Java “setData()” style of interface design. However, I also did not want to expose magic macros or C++ templates in the API. The variadic function call does not work as the array variable objects are not Plain Old Data. So I overloaded the comma and function call operators. It’s a balance between flexibility, natural syntax, and compile-time safety.

I have to switch gears and work on my presentation for AFDS 2012. The final slide deck deadline with AMD is June 6th. There’s also another slide deck I’m working on too.