there are so many cores

Just another WordPress.com site

Brain dump of slide topics

This is just a brain dump of slide topics for PP12. It’s far too much, roughly 77 slides which would mean about 20 seconds per slide to fit in a 25 minute talk.

It helps me to try and write down some of these ideas. In a past life, I was on the math professor career track. Teaching helps the teacher to understand perhaps as much as it does students.

Introduction

  pedagogical and historical order:
  1. GPGPU basics
  2. parameterized kernel design
  3. auto-tuning JIT back-end
  4. application virtual machine architecture
  5. memory management
  6. JIT middle-end
  7. application
  8. performance, error, and the long view


GPGPU basics

  CPU and GPU, the von Neumann machine and stream processor
  kernelized functional closures where side-effects are important
  trade time for space rather than space for time
  arithmetic intensity is time/space complexity ratio
  six kinds of memory: host, driver, global, local, private and texture
  optimization is mostly memory efficiency
  inner and outer blocking
  inner and outer vectorization


Parameterized kernel design

  the shader compiler won't do these loop code transformations for you:
  a. unrolling
  b. fusion
  c. interchange
  d. strip mining
  e. tiling
  f. scalar expansion
  ten general kernel design principles for auto-tuning:
  1. vectorize memory access and use mad() scalar arithmetic
  2. directional bias in memory reads
  3. blocked and tiled layout is not always better than simple linear
  4. tune outer and inner blocking (problem size, work groups and registers)
  5. kernel execution time variance
  6. check correctness of data output
  7. designs generalize within but not necessarily across device architectures
  8. memory buffers can be faster than textures
  9. coalescing kernels to amortize overhead
  10. synthesize low intensity kernels JIT, optimize high intensity kernels AOT

Auto-tuning JIT back-end

  auto-tuning as statistical optimization (expectation maximization works)
  register pressure and convexity (why should optimization converge?)
  define a model family with endogenous kernel template parameters
  avoid the curse of dimensionality with exogenous brute force
  memoization and journaling as practical technology considerations
  everything fails: compilers, drivers, and kernels
  auto-tuning ahead of time and just in time (or cold start versus warm start)
  invest in arithmetic intensity to maximize returns (the 80/20 rule again)


Application virtual machine architecture

  managed platform as C++ domain specific language (inspired by PeakStream)
  bytecode stack machine
  tracing JIT and interpreter
  concurrency and data parallelism with the gather/scatter scheduler
  application, device and scheduler threads
  hashing execution traces and vectorizing threads
  JIT translator and OpenCL compute devices
  interpreter clears trace queue when the translator fails


Memory management

  managed platforms are always about memory
  levels of memory indirection: host, front, back, and compute device
  ultimate owners are arrays on the stack frame and application threads
  garbage collection through nested reference counting
  unifying memory across traces after gathering
  translator explicitly enqueues data transfers with compute devices
  interpreter implicitly swizzles back to CPU host and scatters
  compute device state makes continuation hard (data movement is expensive!)


JIT middle-end

  tracing without the instruction pointer
  constant lifting and heuristically loop rolling bytecode
  don't box too much: ASTs, statements, and variables
  don't box too little: kernelization and index spaces
  sending live variables and creating dead temporaries
  auto-tuning warm start from kernelization
  reconciling auto-tuned vector lengths
  no worries mixing images and memory buffers together

Application

  array programming as stream processing (inspired by Brook)
  data parallel styles: OpenMP loop; Pthreads; array data vectors
  data parallel reductions
  four kinds of matrix multiplication
  concurrent loops and extra complications with stream processing
  mixed single and double precision
  intermingling with unmanaged native code
  runtime configuration: avoid the GPU chimera binary
  biased for performance: auto-tuning ahead of time
  dynamically extending the virtual machine


Performance, error and the long view

  some benchmark numbers for comparison
  compute devices are different and complementary
  thinking about error in terms of basis points
  quantitative strategy and analytics rather than solving PDEs
  classification and clustering to find structure in data
  it's a search problem again
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: