Friday, May 15, 2009

CUDA programming in F#: Part 1, Getting started


If you develop computationaly intensive applications, you will probably benefit from massive parallelization on CUDA devices. CUDA is a technology that can be used to harness full power of NVIDIA GPU device(s) installed on usual computer (server or desktop) for general purpose computations. Many applications can be parallelized using data-parallel algorithms and executed on CUDA enabled devices with performance increased in order of magnitude.

In this post I'm going to give you a brief description of steps and links to get started with CUDA programming in F#.

Here is a list of NVIDIA devices that support CUDA technology. If you have any one, you can install SDK and develop your own applications. All the tools you need can be downloaded from nVidia site and installed on different platforms for free.

Programming language for CUDA is C (with simple extensions) and its knowledge is required to write CUDA applications. Many samples available in SDK use Visial C++ 2005 and 2008 solutions (for Windows platform).

Typical CUDA application is a combination of:
  1. GPU Kernels - functions written in extended C that execute on GPU devices performing computaionally intensive tasks. (See GPGPU)
  2. CPU code writen in C++ that calls GPU kernels through the CUDA API.
In very simple CUDA scenario, CPU code prepares some array of input values, then it calls one of the GPU kernels to process array elements in parallel on GPU and finally recieves results in output array. See CUDA documentation for details.

If you are .Net developer then you can write CPU code in any .Net language with native interop features. I used CUDA.NET (v2.1) - a third party library as a wrapper around CUDA 2.1 native API and F# language for CPU code.
Note: that CUDA.NET is free a library but it is not open source.

Installing prerequisite tools:
1) Download and install CUDA tools following the instructions on NVIDIA site.
2) Launch NVIDIA CUDA SDK Browser under Start menu
3) Search for "Device Query" sample and run it. You will see console window with your device properties.
4) Download and unpack CUDA.NET library which is simply zip file with binary, samples and docs.
5) Open one of the C# samples, say "bitonic"
6) Check that the project references CUDA.NET.dll
7) Open bitonic project properties and check that "Post-build command line:" contain following command:
C:\CUDA\bin\nvcc.exe "..\..\bitonic.cu" --cubin -ccbin "c:\Program Files\Microsoft Visual Studio 9.0\VC\bin"
8) Now you can build and run the sample to make sure that CUDA.NET works fine.

Running F# sample:
1) You need F# for VS2008
2) FsGPU project where I've created F# version of bitonic sample.
3) Open solution in VS2008 and follow steps 5)-8) to start.

Because F# is full featured .Net language you can consume CUDA.NET library exactly the same way you do it in C#.
My sample also contains some performance measurements and comparisons with CPU sorting.
You will see that excessive data transfer and GPU memory allocations can result in even worse execution times than CPU sorting.

Implementation constraints

While bitonic sort is a classical parallel sorting algoritm, its execution time (at least on my GPU hardware) is comparable to sequential CPU algorithm. This is due to some constraints of CUDA architecture on bitonic sort implementation:
  1. Maximum array size = 512 elements, it's a maximum number of parallel threads that can be run in one execution block. (see CUDA Programming Guide)
  2. All threads of a block are expected to reside on the same multiprocessor core. The number of threads per block is therefore restricted by the limited memory resources of a multiprocessor core. On current GPUs, a thread block may contain up to 512 threads
  3. Current implementation of algorithm doesn't support launching the kernel with many execution blocks, as it relies on barrier syncronization of threads.
  4. Thus we can utilize only one multiprocessor core of GPU and has limited parallelizm.
For this reasons this is not very good example to evaluate benefits or GPU computations.
Next time I'll show an example where these benefits are obvious.