This time, as I promised, we will see an example problem which can be solved on GPU with 10x speedup. This is a calculation of widely known Mandelbrot set.

I've added Mandelbrot sample to FsGPU project so you can get it there.

**Mandelbrot sample**

There is a fascinating Expert F# book which I highly recommend to read, especially if you are new to functional programming world. It contains a lot of real world samples, and Mandelbrot is one of them. Here is book web site where you can get source code for all the samples in the book. And here is Mandelbrot source code that I selected for acceleration on GPU.

Originally, Mandelbrot sample uses straightforward F# function to calculate whether given point belongs to mandelbrot set M or not

**let** sqrMod **(**x:complex**)** **=** x.r ***** x.r + x.i ***** x.i

**let** **rec** mandel maxit **(**z:Complex**)** **(**c: Complex**)** count **=**

** if** **(**sqrMod**(**z**)** **<** 4.0**)** **&&** **(**count **<** maxit**)** **then**

mandel maxit **((**z ***** z**)** + c**)** c **(**count + 1**)**

** else** count

**>** mandel 50 Complex.Zero Complex.One 0**;;**

**val** it : **int** **=** 2

**>** mandel 50 Complex.Zero Complex.OneI 0**;;**

**val** it : **int** **=** 50

This function calculates the number of iterations before we move outside closed disk of radius 2 around the origin. If we can reach *iterations, then*

**maxit***belongs to set M. The less value*

**c**

**mandel**function returns the more quickly we move from the set M which is represented by different colors.The point is that this function is executed for each pixel on the screen. For 800x600 screen we have at least 800x600=480000 calculations and 800x600x50=24000000 when all screen pixels belong to M. If we want to implement real-time navigation and visualization we need to make this calculations as fast as possible.

**Mandelbrot acceleration on GPU**

If we look at

*function above we notice that it can be executed for every pixel in parallel. This is ideal case for data parallel computations, so our implementation can scale up to thousands of independently executing parallel GPU threads.***mandel**That is how we need to change original sample:

- Introduce a new function CalcMandelbrot like this

**let** threadsX, threadsY **=** 32,16 (* only 512 threads per block allowed *)

`(* GPU execution code for mandelbrot calculation.`

NOTE: This method is not thread safe, so you need to call it from

appropriate thread. (see CudaExecutor class) *)

**let** MandelbrotExecution**(**ctx,

** (**width : **int**, height : **int**, crunch : **int**,

` xOff : float32, yOff: float32, scale : float32`**)) =**

** let** exec **=** ctx.Execution

** let** num **=** width ***** height

** let** colors **=** Array.zeroCreate<**int>** num

exec.AddParameter**(**"colors", colors, ParameterDirection.Out**)** **|>** ignore

exec.AddParameter**(**"width", uint32**(**width**))** **|>** ignore

exec.AddParameter**(**"height", uint32**(**height**))** **|>** ignore

exec.AddParameter**(**"crunch", uint32**(**crunch**))** **|>** ignore

exec.AddParameter**(**"xOff", float32**(**xOff**))** **|>** ignore

exec.AddParameter**(**"yOff", float32**(**yOff**))** **|>** ignore

exec.AddParameter**(**"scale", float32**(**scale**))** **|>** ignore

** let** blocksX, blocksY **=** iDivUp**(**width, threadsX**)**, iDivUp**(**height, threadsY**)**

exec.Launch**(**blocksX, blocksY, threadsX, threadsY, 1**)** **|****>** ignore

` exec.ReadData`**(**colors, 0**)**

colors

(* Calculate mandelbrot set for specified parameters using GPU kernel execution.

NOTE: this is thread-safe invoker of corresponding GPU execution *)

**let** CalcMandelbrot **=**

** let** executor **=** **new** CudaExecutor**<****_**, **int**[]**>****(**0, "Mandelbrot_sm10.cubin",

` "Mandelbrot0_sm10", MandelbrotExecution`**)**

executor.GetInvoker()

- Implement this function using GPU kernel which I borrowed from CUDA SDK

template<classT>

__device__inlineintCalcMandelbrot(constT xPos,constT yPos,constintcrunch)

{

T y = yPos;

T x = xPos;

T yy = y * y;

T xx = x * x;inti = crunch;while(--i && (xx + yy < T(4.0))) {

y = x * y * T(2.0) + yPos;

x = xx - yy + xPos;

yy = y * y;

xx = x * x;

}returni;

}

// The Mandelbrot CUDA GPU kernel functionextern"C" __global__

voidMandelbrot0_sm10(int*dst,constintimageW,constintimageH,constintcrunch,

constfloatxOff,constfloatyOff,constfloatscale)

{constintix = blockDim.x * blockIdx.x + threadIdx.x;constintiy = blockDim.y * blockIdx.y + threadIdx.y;if((ix < < imageW) && (iy < imageH)) // Calculate the location

constfloatxPos = xOff + (float)ix * scale;

constfloatyPos = yOff - (float)iy * scale;

// Calculate the Mandelbrot index for the current location

intm = CalcMandelbrot<float>(xPos, yPos, crunch);

m = m > 0 ? crunch - m : crunch;

// Output the pixel

intpixel = imageW * iy + ix;

dst[pixel] = m;

}

}

- Integrate this function into UI menu and use it to render Bitmap object on the screen

**let**calcColors

**(**bw:

**int**

**)**

**(**bh:

**int**

**)**maxit map

**=**

` `**let** tl : Complex **=** map 0 0 // top left conner

**let** bl **=** map 0 **(**bh-1**)** // bottom left conner

**let** xOff, yOff **=** float32**(**tl.r**)**, float32**(**tl.i**)**

**let** scale **=** **(**float32**(**tl.i**)** - float32**(**bl.i**))** / float32**(**bh**)**

**let** num **=** bw ***** bh

CalcMandelbrot**(**bw, bh, maxit, xOff, yOff, scale**)**

**let** linearFillGPU **(**bw:**int****)** **(**bh:**int****)** maxit map **=**

**let** colors **=** calcColors bw bh maxit map

**let** rect **=** **new** Rectangle**(**0, 0, bw, bh**)****;**

lock bitmap **(fun** () **->**

**let** **mutable** bmpData **=** null

**try**

bmpData **<-** bitmap.LockBits**(**rect,

` System.Drawing.Imaging.ImageLockMode.ReadWrite, `

` Imaging.PixelFormat.Format24bppRgb`**)****;**

**let** ptr **=** bmpData.Scan0

**let** bytes **=** bmpData.Stride ***** bh

**let** rgbValues **=** Array.create bytes 0uy

Marshal.Copy**(**ptr, rgbValues, 0, bytes**)**

**for** y **=** 0 **to** bh - 1 **do**

**for** x **=** 0 **to** bw - 1 **do**

**let** index **=** **(**y ***** bw + x**)**

**let** offset **=** index ***** 3

**let** c **=** colors.**[**index**]**

**let** color **=** pickColor maxit c

rgbValues.**[**offset**]** **<-** color.B

rgbValues.**[**offset + 1**]** **<-** color.G

rgbValues.**[**offset + 2**]** **<-** color.R

Marshal.Copy**(**rgbValues, 0, ptr, bytes**)**

finally

bitmap.UnlockBits**(**bmpData**)**

form.Invalidate()

)

You can compare Mandelbrot.fs file in my sample with original version from Expert F# to see all the changes.

**Threading problem**

One significant problem with CUDA interop should be mentioned here as it is closely relate to implementation of CalcMandelbrot sample above.

Here is how it is described in CUDA Programming Guide:

"... Several host threads can execute device code on the same device, but by design, a

host thread can execute device code on only one device. As a consequence, multiple

host threads are required to execute device code on multiple devices. Also, any

CUDA resources created through the runtime in one host thread cannot be used by

the runtime from another host thread. ..."

host thread can execute device code on only one device. As a consequence, multiple

host threads are required to execute device code on multiple devices. Also, any

CUDA resources created through the runtime in one host thread cannot be used by

the runtime from another host thread. ..."

Thus if we need to call GPU kernels from multiple threads, we cannot do it directly, insteed we need to wrap actual GPU execution code with some thread safe wrapper. When we call wrapper from any our working thread it will provide that GPU code will be called from appropriate thread.

That is exactly what CudaExecutor helper type allow us to do.

**Summing up**

When you start Mandelbrot sample it runs two tests to compare GPU and CPU performance.

Here is the output on my hardware

Stress test mandelbrot calculation on GPU

20 times: 0 min, 0 sec, 335 ms

Stress test mandelbrot calculation on CPU

20 times: 0 min, 16 sec, 646 ms

Feel the difference!

Very cool project.

ReplyDeleteAre you planning to attend the GPU Technology Conference in San Jose, CA? (www.nvidia.com/gtc)

- wramey (at) nvidia.com