there are so many cores

Just another site

Monthly Archives: March 2012

Why integers and floating point are necessary

Integer type support is needed for predictive analytics. This is far bigger than stencil kernels and filters.

Today, most unstructured data that is interesting (i.e. computationally tractable) is text. This is really integers. However, classification and regression of this integer data requires floating point for the scoring functions. What’s really going on is the integral text data is mapped into a vector space over the real numbers using a kernel trick. Then statistical learning happens there using the only complete theory we have: linear algebra.

Machine learning technology as culture:

  1. neural networks – before computers were connected in a global network called the Internet
  2. statistical learning theory – in the age of MapReduce
  3. predictive analytics – in the age of apps

The Wikipedia says, “Technology can be viewed as an activity that forms or changes culture.” This is the right way to view machine learning. It is defined by the relationship with society.

We are structuring our societies around markets for labor. When I first heard of Amazon Mechanical Turk, I thought it was twisted genius. “Artificial artificial intelligence,” said Jeff Bezos. Human intelligence is packaged and given value through a market.

I believe this cultural view of intelligence is not limited to people. It extends to machines too. If we restructure society so human activity revolves around machines, then machine activities revolve around people.

Second alpha release uploaded

The second alpha is committed to GitHub. There’s also a gzipped tarball without the git metadata. Sorry this took so long.

The most interesting new feature is the integer array support. Mixed integer and floating point calculation is supported. Also, the JIT code is completely reorganized by stages: AST; bytecode; statements; kernel related processing.

As promised, there is a vectorized MD5 sample application which runs on the GPU. I cheated with the array subscripting. Actually, my first implementation did overload operator[], extend the bytecode, interpreter, and JIT. It didn’t feel right at this point so I backed all that out. I went with a simpler and potentially more flexible approach by using STL and making the JIT a bit smarter about private registers.

I’ve been distracted lately. That’s another reason things have taken longer.

Slide from BASE meetup

Speakers: Owen Martin, Kontagent; Greg Shirakyan, Microsoft Robotics; Peter Kassakian, Quantcast

Last week, I went to the BASE meetup about ML/AI/Robotics. I caught a bug along the way and lost about two days to sickness afterwards.

This June, I will be speaking at the AMD Fusion Developer Summit. It’s actually quite a bit of work to prepare conference presentations. I feel obligated to do the best job I can to entertain, inform, and educate an audience. It’s a lot like teaching a class in university. The teacher is doing something wrong if the class is bored or lost. It’s not easy.

Array subscripts delay the second alpha

I need to extend the language to better support gather operations. This will delay the alpha 2 code drop for about a week.

Crypto is all about shuffling data. Each round requires arbitrary element-wise gathering. It’s really obvious once you start implementing these algorithms, even a simple one like MD5, on the GPU.

The way I want to do this is with an overloaded subscript operator for the Array(u32|i32|f32|f64) objects. If the Array is 1D, then the subscript indexes elements. If the Array is 2D, then the subscript indexes by rows (as memory layout is naturally row-major). All other complexities should be handled by the JIT internally. This is the most elegant and intuitively natural way. It’s how a programmer would want it to work.

This approach supports algorithms that do a lot of gathering such as cryptographic hash functions while allowing streaming at the same time. In the case of MD5, the input cleartext is then a 2D array. The rows correspond to character positions in the cleartext. Within each row, the columns correspond to different texts to be hashed. So the MD5 algorithm is vectorized.

I realize this may not be especially useful as written. This is just a demo application to show off the platform capabilities. (It’s also very necessary for me to develop the language and JIT in the first place. Straightforward examples are very helpful to get the code transformations working correctly.)

It is funny to me that I was at first dismissive of element-wise subscripted array support. Soon thereafter, I encountered several people, none of whom have any relationship, live in different places, and are working on various problems, either give presentations involving gathering or mentioning it as important. So I recognized it was very important. But I still wanted to give it second-class support.

With crypto algorithms, it’s impossible to hide from this. First-class gathering support is needed. That’s the funny part. I think I was resistant to “not invented here.” That syndrome is very dangerous. It was a combination of my limited vision and reluctance to tackle yet another problem. Now I have come around and realize this must go all the way.

MD5 on the GPU

I managed to get MD5 working with the new integer support. That means PeakStream style code (extended to support unsigned integers as well as floating point types) that runs on the GPU and calculates the MD5 hash function. This is interesting. It leads in a completely different direction than traditional science/engineering/quantitative applications.

What led to this is the need to support stencil kernels using gather operations. If the JIT knows when gather subscripts are literal constants, then it can implement more optimizations. It can prefetch into local memory and autotune. So I decided to extend the Chai DSL with first-class integer type support.

What I didn’t realize is how this opens up entirely new classes of applications (e.g. crypto stuff).

I’ll check in the second alpha in a few days and add another GitHub download. Unfortunately, this doesn’t fix anything that was broken in the first alpha release. However, to my knowledge, I haven’t introduced any regressions. If it worked before, it should still work.

The two new things in the second alpha are: reorganized JIT; first-class integer type support. The MD5 sample code will be in this release. Note: the output agrees with md5sum and the original RSA reference implementation.

Several people have asked what I am doing. My honest answer is, “I have no idea what I am doing.”

Next, they ask what I am building. The simplest response I have is, “Perl for GPGPU. It’s something that you may not like but must use anyway as it is too useful.”

What else can I say?

Also, I’m finally reading about CUDA. At the last meetup, I volunteered to give a book report on CUDA Application Design and Development by Rob Farber. So far, it reads like a good PG-13 movie that really wants to be R rated and was originally a trilogy long believed impossible to translate into film. There’s a lot of stuff in the book and even more between the lines. I promised a guy from NVIDIA that I would try out the code and play with it. So now I kind of feel on the hook. But that’s good. I’m impressed with what NVIDIA has done. I should play with their stuff more.

Added first class integer type support

There are now four basic types in the following promotion hierarchy (operations on mixed types are promoted to the highest one).

  1. double
  2. float
  3. int32_t
  4. uint32_t

This is definitely moving beyond PeakStream which supported single and double precision floating point only. The Chai API now has “u32” and “i32” functions along with the PeakStream legacy “f32” and “f64” calls. The OpenCL built-in integer functions are also added to the API. My intention is first-class support for integer, floating point, and mixed precision and type calculation. So I am trying to do everything.

Implementing this turned out to be a miniature nightmare. There are exactly 7900 more lines of code since the alpha release three weeks ago. That’s much more than anticipated.

I still plan on a monthly release cycle – the alpha2 code should be uploaded in about a week. Given the amount of change, the next week will be spent testing.

I originally assumed data could only be two types: single and double precision floating point. This cross-cutting assumption was scattered through the code. The concepts of vector length, floating point precision, and data type were aliased and confused.

The worst part was the interpreter. With only two types, the Cartesian product of types for a binary operator has four combinations (2 x 2 = 4). With four types, there are now 16 combinations (4 x 4 = 16). This is worse when an operator accepts three arguments. Now there are 4 x 4 x 4 = 64 combinations. This is even worse as operations like matrix multiply have four cases: vector * vector; vector * matrix; matrix * vector; matrix * matrix. So that makes 4 x 64 = 256 combinations.

The JIT turned out to be much less work. That was a surprise.

One issue that has become obvious after doing this is code bloat. The API has grown large enough where a single monolithic header file and library is impractical. It needs to be factored into modular components that applications can selectively use. Even if the code bloat is removed, the Chai platform language is big enough where it should be partitioned.

During the long years when C++ was a draft standard, I read of a proposal to have a single massive header file for the entire language. Applications would have a single #include and get the STL and everything else. I don’t know if this could work – but I think everyone has the same gut feeling I do – it seems very wrong.

It’s already March.

My plan for this year is a beta release sometime this summer and a production release by the end of the year. That is not much time at all. The beta should include every major feature in the production 1.0 release. From the summer beta release to the production release, the focus should be on bug fixing, stability and quality.

So from now until the beta, I will be throwing new features in rapidly.

These features include:

  • auto-tuned filter kernels with pre-fetching into local memory (original motivation behind adding integer type support – so the JIT could distinguish constant subscripts in gather operations)
  • (pseudo) random number generation
  • modularized platform