there are so many cores

Just another site

Stencil kernels will be next major feature

The conference is over for me. My presentation went over well. It was recorded so I will post a link once it is available through SIAM/Blue Sky Broadcast.

Sincerest thanks and gratitude to Keita Teranishi for inviting me.

Jan Tore Korneliussen inquired about per-element specification of computational kernels. My answer was image convolution. I have changed my mind. General stencil kernels are needed. This will require extending the DSL.

Matthias Christen’s presentation Automatic Code Generation and Tuning on Stencil Kernels on Modern Microarchitecture changed my mind. It is not much harder to support stencils. Non-linearity is also incredibly important for computer vision and feature detection. Linear image convolution is not enough.

Stencil kernels is the highest priority major new feature in Chai.

I have seen Chai in the context of the kernel trick and scoring functions using fast matrix multiply. This is marking positions to value quantitative strategies. As with non-linear stencils versus linear convolution, my vision has changed.

In the last two weeks, two major themes have loomed larger for me.

  1. GPGPU inverts the original intent of the GPU: instead of rendering image data from a geometric model, a model is fitted to data
  2. Compilers must write software for us

The biggest theme I saw from the conference, although it seemed novel to everyone except Teranishi-san, is that supercomputing and statistical learning theory are converging.

What drives this is scale.

The human mind, even when we work together in groups, is unable to solve problems once they become too large and complex. As supercomputers grow ever larger, the ability of human beings to write software begins to break down. We must rely on machines to help us write software.

GPU compute devices may be more significant in embedded applications than in HPC. There are two reasons for this.

The risk of new technology is an extra cost. Most traditional high performance computing is done using CPU cores. Users are government and large institutions. Throwing hardware at the problem and minimizing the change to how software is written is cheaper for them.

At the same time, embedded applications are highly constrained by power and volume. The pace of design and architectural change is much faster. You can not throw hardware at this problem. Traditional software development techniques break down. Human beings can not write and optimize software fast enough to keep up.

IBM’s Watson that defeated Ken Jennings had 2880 cores. The system was almost entirely Java.

Someone in the audience asked Eric Brown (IBM Research), “Why Java?” Pose the same question to Amazon, Google or almost any enterprise IT organization with a software development engineering arm. The answer would be the same.

After his presentation about Watson, I asked Dr. Brown if he thought there was a connection between ensembles and the Markowitz portfolio. That’s what I kept thinking during his talk.

Watson was developed around the same time as the Netflix Prize. That contest ended up being BellKor’s Pragmatic Chaos versus The Ensemble. Watson is a large ensemble of scoring functions trained using a regression model.

This reminds me of standard portfolio theory in which the optimal strategy is betting on everything. The market basket of the world is the largest possible ensemble. Yet, the standard theory has broken down. This is why mutual funds have been in decline while hedge funds have been ascendant. The world is not growing but becoming more volatile.

So my thought was, will the ensemble market basket of machine learning similarly break down? I guess no one knows.

Then I said, “Ensemble of Watson. Skynet.”

I was speaking for everyone who has ever read Slashdot.

These three presentations:

  1. Evaluation of Numerical Policy Function on Generalized Auto-Tuning Interface Openatlib, Takao Sakurai
  2. Spiral: Black Belt Autotuning for Parallel Platforms, Franz Franchetti
  3. High-order DG Wave Propagation on GPUs: Infrastructure and Implementation, Andreas Klöckner

showed the need for:

  1. mixed strategy ensembles with a meta-JIT
  2. kernel specification as a tensor algebra
  3. code generation from a meta-compiler

In preparing for this conference, I learned consideration of the audience. Think of the user. I am not working for myself. I am working for you.

After attending the conference, I understand this in my bones.

Most research and development work is done with the patronage of large institutions, both public and private. The people who create new things may have a larger vision for the future. But they are also constrained to work in the narrow silo defined by their patron.

It is the cost for being in the establishment or funded by it.

What I am doing is working inside the silo defined by you. I am trying to make something that maximizes total social welfare. Build the right thing that lasts and helps people live better.

About Savannah – it makes me sad to leave – it is like a smaller, more relaxed, Seattle, except with Southern hospitality – not perfect and obviously with a legacy of social issues dating back to slavery – but no place is perfect – I really enjoyed my time here.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: