there are so many cores

Just another WordPress.com site

Monthly Archives: August 2012

Example: Monte Carlo options pricing

Here’s PeakStream’s demo code for Black-Scholes using a PRNG Monte Carlo method. (Yes, Black-Scholes is much more efficiently computed without Monte Carlo, e.g. Espen Haug’s entertaining page about Black-Scholes. However, this is still a good demo as it shows the technology and is easy to check results.)

float MonteCarloAntithetic(float price,
                           float strike,
                           float vol,
                           float rate,
                           float div,
                           float T)
{
    float deltat        = T / N;
    float muDeltat      = (rate - div - 0.5 * vol * vol) * deltat;
    float volSqrtDeltat = vol * sqrt(deltat);
    float meanCPU       = 0.0f;

    Arrayf32 meanSP; // result
    {
        RNGf32 rng_hndl(RNG_DEFAULT, 0);

        Arrayf32 U = zeros_f32(M);
        for (int i = 0; i < N; i++)
            U += rng_normal_make(rng_hndl, M);

        Arrayf32 values;
        {
            Arrayf32 lnS1 = log(price) + N * muDeltat + volSqrtDeltat * U;
            Arrayf32 lnS2 = log(price) + N * muDeltat + volSqrtDeltat * (-U);
            Arrayf32 S1 = exp(lnS1);
            Arrayf32 S2 = exp(lnS2);
            values = (0.5 * (max(0, S1 - strike) + max(0, S2 - strike))
                          * exp(-rate * T));
        }
        meanSP = mean(values);
    }
    meanCPU = meanSP.read_scalar();

    return meanCPU; 
}

Let’s use Espen Haug’s parameter settings.

  • stock price is $60
  • option strike price is $65
  • volatility is 30%
  • risk-free interest rate is 8% (ridiculous today!)
  • dividend is 0%
  • time to option maturity is 3 months (.25 years)

With M=1000000 and N=100 (one million trials and one hundred time steps), the calculated price is $2.13477 for the PeakStream demo code running on Chai. Espen Haug’s C++ code returns $2.13337. The results agree.

What does the generated OpenCL kernel look like? The entire thing is quite large as the JIT dynamically generates part of the Random123 PRNG library. So I’ll leave that out. The interesting part isn’t too much.

//...note, leaving out the Random123 library for brevity
//...philox_normal_4_4() is a generated stub into Random123
inline float4 philox_normal_4_4(uint32_t tid, uint32_t idx) __attribute__((always_inline));
inline float4 philox_normal_4_4(uint32_t tid, uint32_t idx) {
  philox4x32_key_t rngKey = {{ tid, 0xdecafbad }};
  philox4x32_ctr_t rngCtr = {{ idx, 0xf00dcafe, 0xdeadbeef, 0xbeeff00d }};
union { philox4x32_ctr_t c; int4 ii; float4 fp; } u;
u.c = philox4x32(rngCtr, rngKey);
u.fp = convert_float4(u.ii) * (float4)(2.328306436538696289063E-10) + (float4)(0.5);
return (float4)(sqrt(-2 * log(u.fp.s0)) * sin(2 * 3.14159 * u.fp.s1), sqrt(-2 * log(u.fp.s0)) * cos(2 * 3.14159 * u.fp.s1), sqrt(-2 * log(u.fp.s2)) * sin(2 * 3.14159 * u.fp.s3), sqrt(-2 * log(u.fp.s2)) * cos(2 * 3.14159 * u.fp.s3));
}
__kernel void f_14843374243427385703_1(
  __global float4 * a0,
  const float a1,
  const double a2,
  const float a3,
  const double a4,
  const double a5,
  const float a6,
  const float a7,
  const double a8)
{
  float4 r0;
  float4 r1;
  r1= (float4)(0);
  for (int i0 = 0; i0 < 250000; i0++)
  {
    r0= (float4)(0);
    for (int i1 = 0; i1 < 100; i1++)
    {
      r0= (r0 + philox_normal_4_4((i0), 0 + i1)) ;
    }
    r1+= ((a8 * (max((float4)(0), (exp((float4)((a2 + (a1 * r0)))) - a7)) + max((float4)(0), (exp((float4)((a4 + (a3 * -((float4)(r0)))))) - a6)))) * a5) ;
  }
  a0[0] = r1 / 1000000 ;
}

I haven’t reformatted the code. The above is exactly what the JIT generates. Note that while machine generated, it is still somewhat human readable. This is important. I rely on being able to read the code Chai’s JIT generates in order to develop it.

Also note that the code is efficient. It is perhaps as efficient as if hand-crafted. Loops are rolled and fused. All accumulation is done in registers. (Sure, the JIT isn’t that smart and there are lots of ugly corners inside. But it does show what it can do when it is at its best.)

Ok, this is nice. But it is still not a realistic example. It only uses one thread! Performance will be terrible on a GPU.

More significant is that a quantitative strategy must rely on pricing millions of options, not just one. And each option must have different parameters to explore the pricing structure. (Full disclosure: I have never worked as a quant in the financial industry. I am just talking here.)

Here’s a more realistic example.

vector< double > MonteCarloAntithetic(float price,
                                      float strike,
                                      vector< float >& volVec,
                                      vector< double >& rateVec,
                                      float div,
                                      float T)
{
    const size_t numExperiments = volVec.size() > rateVec.size()
                                      ? volVec.size()
                                      : rateVec.size();

    float deltat = T / N;

    Arrayf64 meanSP; // result
    {
        Arrayf32 vol           = make1(volVec);
        Arrayf64 rate          = make1(rateVec);
        Arrayf32 muDeltat      = (rate - div - 0.5 * vol * vol) * deltat;
        Arrayf32 volSqrtDeltat = vol * sqrt(deltat);

        RNGf64 rng_hndl(RNG_DEFAULT, 0);

        Arrayf64 U = zeros_f64(M);
        for (int i = 0; i < N; i++)
            U += rng_normal_make(rng_hndl, M);

        Arrayf64 values;
        {
            Arrayf64 lnS1 = log(price) + double(N) * muDeltat + volSqrtDeltat * U;
            Arrayf64 lnS2 = log(price) + double(N) * muDeltat + volSqrtDeltat * (-U);
            Arrayf64 S1 = exp(lnS1);
            Arrayf64 S2 = exp(lnS2);
            values = (0.5 * (max(0, S1 - strike) + max(0, S2 - strike))
                          * exp(-rate * T));
        }
        meanSP = mean(values);
    }

    return meanSP.read_scalar(numExperiments);
}

The PRNG has been increased to 64 bit to avoid overflow (adding one million normally distributed random numbers very often overflows). There’s a mixture of single and double precision too. This is more to show off. I don’t know if mixed precision runs faster (very possible on modern GPU hardware which is tailored for single precision).

Note the volatility and rate are now vectors instead of scalars. For example:

float price = 60;
float strike = 65;

// combinations of (volatility, rate)
vector< float > volatility;
vector< double > rate;
for (size_t i = 0; i < 100; i++)
{
    const float vol = .3 + i * .01;
    const double r = .01 - i * .001;

    volatility.push_back( vol );
    rate.push_back( r );
}

float dividend = 0;
float T = 0.25;

vector< double > meanSP
    = MonteCarloAntithetic(price, strike, volatility, rate, dividend, T);

for (size_t i = 0; i < meanSP.size(); i++)
    cout << "meanSP[" << i << "] is " << meanSP[i] << endl;

This will calculate the option price for 100 combinations of (volatility, rate) on the GPU.

I do not understand how deep this goes. How effective is auto-tuning? How flexible is this array programming with PRNG approach? Can it model exotic kinds of options? Can it handle real world business rules and events? I do not have the answers to these questions. What I do have is the basic technology to begin finding out.

The investment banks and hedge funds of the world have been using this kind of technology for years. That’s why options pricing is one of the often cited killer applications for GPGPU. What is new with Chai is a high level managed language for doing this.

My experience as a support quant in a non-financial context is that business sees engineering as overhead. Markets keep changing so what’s the point of deep investments in technology when it will become obsolete soon? This is also why mixed-paradigm functional programming languages like Ocaml can be popular among the SDE support quant set. It’s more like a scripting language for mapping and folding but with the speed of C. That’s what I would like for Chai except targeting the GPU.