Tuesday, June 23, 2009

CUDA programming in F#: Part 2, Accelerating Mandelbrot sample

In the previous post (CUDA programming in F#: Part 1), we talked about how to setup dev environment to get started with CUDA programming in F#. BitonicSort sample used there is a classical parallel sorting algorithm, with non trivial GPU kernel (taken from CUDA SDK) . Unfortunately, it's not good enough to evaluate benefits of GPU acceleration as it has execution time comparable to CPU.

This time, as I promised, we will see an example problem which can be solved on GPU with 10x speedup. This is a calculation of widely known Mandelbrot set.

I've added Mandelbrot sample to FsGPU project so you can get it there.

Mandelbrot sample

There is a fascinating Expert F# book which I highly recommend to read, especially if you are new to functional programming world. It contains a lot of real world samples, and Mandelbrot is one of them. Here is book web site where you can get source code for all the samples in the book. And here is Mandelbrot source code that I selected for acceleration on GPU.

Originally, Mandelbrot sample uses straightforward F# function to calculate whether given point belongs to mandelbrot set M or not

let sqrMod (x:complex) = x.r * x.r + x.i * x.i

let rec mandel maxit (z:Complex) (c: Complex) count =
if (sqrMod(z) < 4.0) && (count < maxit) then
mandel maxit ((z * z) + c) c (count + 1)
else count

> mandel 50 Complex.Zero Complex.One 0;;
val it : int = 2

> mandel 50 Complex.Zero Complex.OneI 0;;
val it : int = 50
This function calculates the number of iterations before we move outside closed disk of radius 2 around the origin. If we can reach maxit iterations, then c belongs to set M. The less value mandel function returns the more quickly we move from the set M which is represented by different colors.

The point is that this function is executed for each pixel on the screen. For 800x600 screen we have at least 800x600=480000 calculations and 800x600x50=24000000 when all screen pixels belong to M. If we want to implement real-time navigation and visualization we need to make this calculations as fast as possible.

Mandelbrot acceleration on GPU

If we look at mandel function above we notice that it can be executed for every pixel in parallel. This is ideal case for data parallel computations, so our implementation can scale up to thousands of independently executing parallel GPU threads.

That is how we need to change original sample:
  • Introduce a new function CalcMandelbrot like this
let threadsX, threadsY = 32,16  (* only 512 threads per block allowed *)
(* GPU execution code for mandelbrot calculation.
NOTE: This method is not thread safe, so you need to call it from
appropriate thread. (see CudaExecutor class) *)
let MandelbrotExecution(ctx, 
   (width : int, height : int, crunch : int, 
    xOff : float32, yOff: float32, scale : float32)) =
let exec = ctx.Execution

let num = width * height
let colors = Array.zeroCreate<int> num
exec.AddParameter("colors", colors, ParameterDirection.Out) |> ignore
exec.AddParameter("width", uint32(width)) |> ignore
exec.AddParameter("height", uint32(height)) |> ignore
exec.AddParameter("crunch", uint32(crunch)) |> ignore
exec.AddParameter("xOff", float32(xOff)) |> ignore
exec.AddParameter("yOff", float32(yOff)) |> ignore
exec.AddParameter("scale", float32(scale)) |> ignore

let blocksX, blocksY = iDivUp(width, threadsX), iDivUp(height, threadsY)
exec.Launch(blocksX, blocksY, threadsX, threadsY, 1) |> ignore
     exec.ReadData(colors, 0)

(* Calculate mandelbrot set for specified parameters using GPU kernel execution.
NOTE: this is thread-safe invoker of corresponding GPU execution *)
let CalcMandelbrot =
let executor = new CudaExecutor<_, int[]>(0, "Mandelbrot_sm10.cubin",
           "Mandelbrot0_sm10", MandelbrotExecution)
  • Implement this function using GPU kernel which I borrowed from CUDA SDK
template<class T>
__device__ inline int CalcMandelbrot(const T xPos, const T yPos, const int crunch)
T y = yPos;
T x = xPos;
T yy = y * y;
T xx = x * x;

int i = crunch;
while (--i && (xx + yy < T(4.0))) {
y = x * y * T(2.0) + yPos;
x = xx - yy + xPos;
yy = y * y;
xx = x * x;
return i;

// The Mandelbrot CUDA GPU kernel function
extern "C" __global__
void Mandelbrot0_sm10(int *dst, const int imageW, const int imageH, const int crunch,
const float xOff, const float yOff, const float scale)
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;

if ((ix < < imageW) && (iy < imageH)) // Calculate the location
const float xPos = xOff + (float)ix * scale;
const float yPos = yOff - (float)iy * scale;

// Calculate the Mandelbrot index for the current location
int m = CalcMandelbrot<float>(xPos, yPos, crunch);
m = m > 0 ? crunch - m : crunch;

// Output the pixel
int pixel = imageW * iy + ix;
dst[pixel] = m;
  • Integrate this function into UI menu and use it to render Bitmap object on the screen
let calcColors (bw:int) (bh:int) maxit map =
    let tl : Complex = map 0 0       // top left conner
let bl = map 0 (bh-1) // bottom left conner
let xOff, yOff = float32(tl.r), float32(tl.i)
let scale = (float32(tl.i) - float32(bl.i)) / float32(bh)
let num = bw * bh
CalcMandelbrot(bw, bh, maxit, xOff, yOff, scale)

let linearFillGPU (bw:int) (bh:int) maxit map =
let colors = calcColors bw bh maxit map
let rect = new Rectangle(0, 0, bw, bh);
lock bitmap (fun () ->
let mutable bmpData = null
bmpData <- bitmap.LockBits(rect,
let ptr = bmpData.Scan0
let bytes = bmpData.Stride * bh
let rgbValues = Array.create bytes 0uy
Marshal.Copy(ptr, rgbValues, 0, bytes)
for y = 0 to bh - 1 do
for x = 0 to bw - 1 do
let index = (y * bw + x)
let offset = index * 3
let c = colors.[index]
let color = pickColor maxit c
rgbValues.[offset] <- color.B
rgbValues.[offset + 1] <- color.G
rgbValues.[offset + 2] <- color.R

Marshal.Copy(rgbValues, 0, ptr, bytes)
You can compare Mandelbrot.fs file in my sample with original version from Expert F# to see all the changes.

Threading problem

One significant problem with CUDA interop should be mentioned here as it is closely relate to implementation of CalcMandelbrot sample above.
Here is how it is described in CUDA Programming Guide:

"... Several host threads can execute device code on the same device, but by design, a
host thread can execute device code on only one device. As a consequence, multiple
host threads are required to execute device code on multiple devices. Also, any
CUDA resources created through the runtime in one host thread cannot be used by
the runtime from another host thread. ..."

Thus if we need to call GPU kernels from multiple threads, we cannot do it directly, insteed we need to wrap actual GPU execution code with some thread safe wrapper. When we call wrapper from any our working thread it will provide that GPU code will be called from appropriate thread.
That is exactly what CudaExecutor helper type allow us to do.

Summing up

When you start Mandelbrot sample it runs two tests to compare GPU and CPU performance.
Here is the output on my hardware

Stress test mandelbrot calculation on GPU
20 times: 0 min, 0 sec, 335 ms
Stress test mandelbrot calculation on CPU
20 times: 0 min, 16 sec, 646 ms

Feel the difference!