there are so many cores

Just another site

API dispatch table and stream shape

Here is a table that connects:

  • virtual machine operation bytecodes
  • PeakStream API function points
  • CPU back-end
  • stream shape
  • computational graph

The most basic JIT optimization is finding subgraphs with homogeneous stream shape (multiple consecutive loops that can be fused together into a single loop). These subgraphs correspond to synthesized compute kernels for the GPU. That’s important as kernel scheduling overhead is very high for GPUs. Minimizing the number of kernels scheduled can have a dramatic effect on performance (my experience is about 5x higher dense matrix multiply throughput on an ATI Radeon HD 5870 in double precision).

code    API name                CPU back-end            stream shape    graph

0       abs                     CPUisomorph<cpu_ABS>    same            interior
1       cond                    CPUcond                 polymorphic     interior
2       construct_f32           CPUscalar               none            leaf
3       construct_f64           CPUscalar               none            leaf
4       convert_f32             CPUconvert<float>       same            interior

5       convert_f64             CPUconvert<double>      same            interior
6       dot_product             CPUdotprod<false>       reduction       interior
7       exp                     CPUisomorph<cpu_EXP>    same            interior
8       gather1_floor           CPUshuffle<1>           same            interior
9       gather2_floor           CPUshuffle<2>           same            interior

10      index1_f32              CPUidxdata<float, 1>    none            leaf
11      index1_f64              CPUidxdata<double, 1>   none            leaf
12      index2_f32              CPUidxdata<float, 2>    none            leaf
13      index2_f64              CPUidxdata<double, 2>   none            leaf
14      make1_f32               CPUmakedata<float, 1>   (W, 1)          leaf

15      make1_f64               CPUmakedata<double, 1>  (W, 1)          leaf
16      make2_f32               CPUmakedata<float, 2>   (W, H)          leaf
17      make2_f64               CPUmakedata<double, 2>  (W, H)          leaf
18      matmul                  CPUmatmul               polymorphic     interior
19      max                     CPUbinop<cpu_MAX>       same            interior

20      mean                    CPUaccum<true>          reduction       interior
21      min                     CPUbinop<cpu_MIN>       same            interior
22      negate                  CPUisomorph<cpu_NEGATE> same            interior
23      ones1_f32               CPUlitdata<float, 1>    none            leaf
24      ones1_f64               CPUlitdata<double, 1>   none            leaf

25      ones2_f32               CPUlitdata<float, 2>    none            leaf
26      ones2_f64               CPUlitdata<double, 2>   none            leaf
27      operatorADD             CPUbinop<cpu_ADD>       same            interior
28      operatorAND             CPUbinop<cpu_AND>       same            interior
29      operatorDIV             CPUbinop<cpu_DIV>       same            interior

30      operatorEQ              CPUbinop<cpu_EQ>        same            interior
31      operatorGE              CPUbinop<cpu_GE>        same            interior
32      operatorGT              CPUbinop<cpu_GT>        same            interior
33      operatorLE              CPUbinop<cpu_LE>        same            interior
34      operatorLT              CPUbinop<cpu_LT>        same            interior

35      operatorMUL             CPUbinop<cpu_MUL>       same            interior
36      operatorNE              CPUbinop<cpu_NE>        same            interior
37      operatorOR              CPUbinop<cpu_OR>        same            interior
38      operatorSUB             CPUbinop<cpu_SUB>       same            interior
39      pusharray_f32           CPUpusharray<float>     (W, H)          leaf

40      pusharray_f64           CPUpusharray<double>    (W, H)          leaf
41      read_scalar_f32         CPUnop                  -               root
42      read_scalar_f64         CPUnop                  -               root
43      read1_f32               CPUnop                  -               root
44      read1_f64               CPUnop                  -               root

45      read2_f32               CPUnop                  -               root
46      read2_f64               CPUnop                  -               root
47      rng_normal_make_f32     CPUrngnormal<float>     none            leaf
48      rng_normal_make_f64     CPUrngnormal<double>    none            leaf
49      rng_uniform_make_f32    CPUrnguniform<float>    none            leaf

50      rng_uniform_make_f64    CPUrnguniform<double>   none            leaf
51      sqrt                    CPUisomorph<cpu_SQRT>   same            interior
52      sum                     CPUaccum<false>         reduction       interior
53      zeros1_f32              CPUlitdata<float, 1>    none            leaf
54      zeros1_f64              CPUlitdata<double, 1>   none            leaf

55      zeros2_f32              CPUlitdata<float, 2>    none            leaf
56      zeros2_f64              CPUlitdata<double, 2>   none            leaf

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: