Tuesday, June 23, 2009

CUDA programming in F#: Part 2, Accelerating Mandelbrot sample

In the previous post (CUDA programming in F#: Part 1), we talked about how to setup dev environment to get started with CUDA programming in F#. BitonicSort sample used there is a classical parallel sorting algorithm, with non trivial GPU kernel (taken from CUDA SDK) . Unfortunately, it's not good enough to evaluate benefits of GPU acceleration as it has execution time comparable to CPU.

This time, as I promised, we will see an example problem which can be solved on GPU with 10x speedup. This is a calculation of widely known Mandelbrot set.

I've added Mandelbrot sample to FsGPU project so you can get it there.

Mandelbrot sample

There is a fascinating Expert F# book which I highly recommend to read, especially if you are new to functional programming world. It contains a lot of real world samples, and Mandelbrot is one of them. Here is book web site where you can get source code for all the samples in the book. And here is Mandelbrot source code that I selected for acceleration on GPU.

Originally, Mandelbrot sample uses straightforward F# function to calculate whether given point belongs to mandelbrot set M or not

let sqrMod (x:complex) = x.r * x.r + x.i * x.i

let rec mandel maxit (z:Complex) (c: Complex) count =
if (sqrMod(z) < 4.0) && (count < maxit) then
mandel maxit ((z * z) + c) c (count + 1)
else count

> mandel 50 Complex.Zero Complex.One 0;;
val it : int = 2

> mandel 50 Complex.Zero Complex.OneI 0;;
val it : int = 50
This function calculates the number of iterations before we move outside closed disk of radius 2 around the origin. If we can reach maxit iterations, then c belongs to set M. The less value mandel function returns the more quickly we move from the set M which is represented by different colors.

The point is that this function is executed for each pixel on the screen. For 800x600 screen we have at least 800x600=480000 calculations and 800x600x50=24000000 when all screen pixels belong to M. If we want to implement real-time navigation and visualization we need to make this calculations as fast as possible.

Mandelbrot acceleration on GPU

If we look at mandel function above we notice that it can be executed for every pixel in parallel. This is ideal case for data parallel computations, so our implementation can scale up to thousands of independently executing parallel GPU threads.

That is how we need to change original sample:
  • Introduce a new function CalcMandelbrot like this
let threadsX, threadsY = 32,16  (* only 512 threads per block allowed *)
(* GPU execution code for mandelbrot calculation.
NOTE: This method is not thread safe, so you need to call it from
appropriate thread. (see CudaExecutor class) *)
let MandelbrotExecution(ctx, 
   (width : int, height : int, crunch : int, 
    xOff : float32, yOff: float32, scale : float32)) =
let exec = ctx.Execution

let num = width * height
let colors = Array.zeroCreate<int> num
exec.AddParameter("colors", colors, ParameterDirection.Out) |> ignore
exec.AddParameter("width", uint32(width)) |> ignore
exec.AddParameter("height", uint32(height)) |> ignore
exec.AddParameter("crunch", uint32(crunch)) |> ignore
exec.AddParameter("xOff", float32(xOff)) |> ignore
exec.AddParameter("yOff", float32(yOff)) |> ignore
exec.AddParameter("scale", float32(scale)) |> ignore

let blocksX, blocksY = iDivUp(width, threadsX), iDivUp(height, threadsY)
exec.Launch(blocksX, blocksY, threadsX, threadsY, 1) |> ignore
     exec.ReadData(colors, 0)

(* Calculate mandelbrot set for specified parameters using GPU kernel execution.
NOTE: this is thread-safe invoker of corresponding GPU execution *)
let CalcMandelbrot =
let executor = new CudaExecutor<_, int[]>(0, "Mandelbrot_sm10.cubin",
           "Mandelbrot0_sm10", MandelbrotExecution)
  • Implement this function using GPU kernel which I borrowed from CUDA SDK
template<class T>
__device__ inline int CalcMandelbrot(const T xPos, const T yPos, const int crunch)
T y = yPos;
T x = xPos;
T yy = y * y;
T xx = x * x;

int i = crunch;
while (--i && (xx + yy < T(4.0))) {
y = x * y * T(2.0) + yPos;
x = xx - yy + xPos;
yy = y * y;
xx = x * x;
return i;

// The Mandelbrot CUDA GPU kernel function
extern "C" __global__
void Mandelbrot0_sm10(int *dst, const int imageW, const int imageH, const int crunch,
const float xOff, const float yOff, const float scale)
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;

if ((ix < < imageW) && (iy < imageH)) // Calculate the location
const float xPos = xOff + (float)ix * scale;
const float yPos = yOff - (float)iy * scale;

// Calculate the Mandelbrot index for the current location
int m = CalcMandelbrot<float>(xPos, yPos, crunch);
m = m > 0 ? crunch - m : crunch;

// Output the pixel
int pixel = imageW * iy + ix;
dst[pixel] = m;
  • Integrate this function into UI menu and use it to render Bitmap object on the screen
let calcColors (bw:int) (bh:int) maxit map =
    let tl : Complex = map 0 0       // top left conner
let bl = map 0 (bh-1) // bottom left conner
let xOff, yOff = float32(tl.r), float32(tl.i)
let scale = (float32(tl.i) - float32(bl.i)) / float32(bh)
let num = bw * bh
CalcMandelbrot(bw, bh, maxit, xOff, yOff, scale)

let linearFillGPU (bw:int) (bh:int) maxit map =
let colors = calcColors bw bh maxit map
let rect = new Rectangle(0, 0, bw, bh);
lock bitmap (fun () ->
let mutable bmpData = null
bmpData <- bitmap.LockBits(rect,
let ptr = bmpData.Scan0
let bytes = bmpData.Stride * bh
let rgbValues = Array.create bytes 0uy
Marshal.Copy(ptr, rgbValues, 0, bytes)
for y = 0 to bh - 1 do
for x = 0 to bw - 1 do
let index = (y * bw + x)
let offset = index * 3
let c = colors.[index]
let color = pickColor maxit c
rgbValues.[offset] <- color.B
rgbValues.[offset + 1] <- color.G
rgbValues.[offset + 2] <- color.R

Marshal.Copy(rgbValues, 0, ptr, bytes)
You can compare Mandelbrot.fs file in my sample with original version from Expert F# to see all the changes.

Threading problem

One significant problem with CUDA interop should be mentioned here as it is closely relate to implementation of CalcMandelbrot sample above.
Here is how it is described in CUDA Programming Guide:

"... Several host threads can execute device code on the same device, but by design, a
host thread can execute device code on only one device. As a consequence, multiple
host threads are required to execute device code on multiple devices. Also, any
CUDA resources created through the runtime in one host thread cannot be used by
the runtime from another host thread. ..."

Thus if we need to call GPU kernels from multiple threads, we cannot do it directly, insteed we need to wrap actual GPU execution code with some thread safe wrapper. When we call wrapper from any our working thread it will provide that GPU code will be called from appropriate thread.
That is exactly what CudaExecutor helper type allow us to do.

Summing up

When you start Mandelbrot sample it runs two tests to compare GPU and CPU performance.
Here is the output on my hardware

Stress test mandelbrot calculation on GPU
20 times: 0 min, 0 sec, 335 ms
Stress test mandelbrot calculation on CPU
20 times: 0 min, 16 sec, 646 ms

Feel the difference!

Friday, May 15, 2009

CUDA programming in F#: Part 1, Getting started

If you develop computationaly intensive applications, you will probably benefit from massive parallelization on CUDA devices. CUDA is a technology that can be used to harness full power of NVIDIA GPU device(s) installed on usual computer (server or desktop) for general purpose computations. Many applications can be parallelized using data-parallel algorithms and executed on CUDA enabled devices with performance increased in order of magnitude.

In this post I'm going to give you a brief description of steps and links to get started with CUDA programming in F#.

Here is a list of NVIDIA devices that support CUDA technology. If you have any one, you can install SDK and develop your own applications. All the tools you need can be downloaded from nVidia site and installed on different platforms for free.

Programming language for CUDA is C (with simple extensions) and its knowledge is required to write CUDA applications. Many samples available in SDK use Visial C++ 2005 and 2008 solutions (for Windows platform).

Typical CUDA application is a combination of:
  1. GPU Kernels - functions written in extended C that execute on GPU devices performing computaionally intensive tasks. (See GPGPU)
  2. CPU code writen in C++ that calls GPU kernels through the CUDA API.
In very simple CUDA scenario, CPU code prepares some array of input values, then it calls one of the GPU kernels to process array elements in parallel on GPU and finally recieves results in output array. See CUDA documentation for details.

If you are .Net developer then you can write CPU code in any .Net language with native interop features. I used CUDA.NET (v2.1) - a third party library as a wrapper around CUDA 2.1 native API and F# language for CPU code.
Note: that CUDA.NET is free a library but it is not open source.

Installing prerequisite tools:
1) Download and install CUDA tools following the instructions on NVIDIA site.
2) Launch NVIDIA CUDA SDK Browser under Start menu
3) Search for "Device Query" sample and run it. You will see console window with your device properties.
4) Download and unpack CUDA.NET library which is simply zip file with binary, samples and docs.
5) Open one of the C# samples, say "bitonic"
6) Check that the project references CUDA.NET.dll
7) Open bitonic project properties and check that "Post-build command line:" contain following command:
C:\CUDA\bin\nvcc.exe "..\..\bitonic.cu" --cubin -ccbin "c:\Program Files\Microsoft Visual Studio 9.0\VC\bin"
8) Now you can build and run the sample to make sure that CUDA.NET works fine.

Running F# sample:
1) You need F# for VS2008
2) FsGPU project where I've created F# version of bitonic sample.
3) Open solution in VS2008 and follow steps 5)-8) to start.

Because F# is full featured .Net language you can consume CUDA.NET library exactly the same way you do it in C#.
My sample also contains some performance measurements and comparisons with CPU sorting.
You will see that excessive data transfer and GPU memory allocations can result in even worse execution times than CPU sorting.

Implementation constraints

While bitonic sort is a classical parallel sorting algoritm, its execution time (at least on my GPU hardware) is comparable to sequential CPU algorithm. This is due to some constraints of CUDA architecture on bitonic sort implementation:
  1. Maximum array size = 512 elements, it's a maximum number of parallel threads that can be run in one execution block. (see CUDA Programming Guide)
  2. All threads of a block are expected to reside on the same multiprocessor core. The number of threads per block is therefore restricted by the limited memory resources of a multiprocessor core. On current GPUs, a thread block may contain up to 512 threads
  3. Current implementation of algorithm doesn't support launching the kernel with many execution blocks, as it relies on barrier syncronization of threads.
  4. Thus we can utilize only one multiprocessor core of GPU and has limited parallelizm.
For this reasons this is not very good example to evaluate benefits or GPU computations.
Next time I'll show an example where these benefits are obvious.