there are so many cores

Just another site

Matrix multiply ported

Matrix multiply is working.

This brings back bad memories… every compute device understands a different dialect of OpenCL. A kernel that works perfectly on one compute device may fail on another. That could mean anything from garbage output to crashing the driver. Kernel performance depends on the specific combination of: compute device; runtime SDK; device driver version.

The ugly reality of OpenCL is that while it is a step in the right direction, it is not very portable in practice. The vendors don’t like to mention this.

That’s why performance tuning is a lot of work. It’s also why auto-tuning is attractive – let a compiler do the tedious search for the fast kernels. Easier said than done.

My plan is to first port over all of GATLAS as-is. Then it will be connected to the runtime and JIT. That should be pretty cool.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: