there are so many cores

Just another WordPress.com site

Distribution of error in ATI GPU output

ATI GPUs do not have ECC memory. NVIDIA GPUs (at least the professional models) use ECC memory. In practical terms, what difference does this make to GPGPU applications?

The mixed precision kernel evergreenmatmul_4_8_4_2_0_0_10_0_400_400_400_1_0_10_10_4_4_23 from the last post is really ten 400x400x400 matrix multiplications tiled into a kernel invocation. One test with random positive numbers gives a final error against a reference calculation on the CPU as follows.

ABSDIFF 6
(6, 224, 121) diff by 1    host array: 93.9413    gpu calc: 92.9138
(6, 224, 135) diff by 1    host array: 102.461    gpu calc: 101.153
(6, 225, 135) diff by 1    host array: 100.383    gpu calc: 101.405
(9, 184, 51) diff by 1    host array: 95.3789    gpu calc: 96.4027
(9, 184, 54) diff by 1    host array: 104.34    gpu calc: 105.444
(9, 187, 49) diff by 1    host array: 102.789    gpu calc: 101.757

Out of 1.6 million numbers in calculated output (10 * 400 * 400 = 1600000), exactly six of them are wrong. The rest are within 1e-07 of the reference calculation. Repeating this test yields a different set of errors each time. Anywhere from a half dozen to two dozen numbers will be wrong. It’s random.

There was a bug. I used abs() instead of fabs(). That truncated most errors to zero. Here is what the error distribution really looks like:

                       1.0 <= error : 9
                 0.1 <= error < 1.0 : 369
                0.01 <= error < 0.1 : 67
              0.001 <= error < 0.01 : 2
            0.0001 <= error < 0.001 : 901
          0.00001 <= error < 0.0001 : 1116457
        0.000001 <= error < 0.00001 : 318057
      0.0000001 <= error < 0.000001 : 0
    0.00000001 <= error < 0.0000001 : 0
  0.000000001 <= error < 0.00000001 : 0
0.0000000001 <= error < 0.000000001 : 0
               error < 0.0000000001 : 164138

Each element of the output array is calculated from the sum of 400 double precision random numbers. There will be some round-off error. However, the error histogram shows much higher errors occur fairly often. I recall that PeakStream had done fairly detailed analysis of arithmetic error on the GPU and touted this as a selling point over other frameworks that largely ignored the issue. (It will be interesting to see what the error distribution looks like on NVIDIA GPUs. Despite ECC memory, I would be surprised if there are not similar issues with arithmetic error.)

Is this acceptable for your application? It is for mine. (Note: this is the stuff the vendors never talk about!)

When I was a support quant, error metrics of the forecast strategy against realized actuals could easily fluctuate by 50 to 100 basis points just due to accumulated round-off error. That was not too surprising as the data was long-tailed. Adding small and large floating point numbers together results in significant errors. Fortunately, this was not a problem for us.

In quantitative strategy problems, there is already so much uncertainty from data quality issues that a trade of higher number crunching performance for relaxed consistency in calculated output correctness may be good. That’s probably not true of many scientific computing problems, however. That explains why NVIDIA’s trade-off of reduced performance and higher cost for correctness is really a must-have for much of the HPC market. The higher performance of ATI GPUs is generally used for problems like Bitcoin mining and password cracking.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: