Things I have learnt about porting algorithms to GPUs (using CUDA)

2020-06-08 1466 words 7 minutes

Contents

I’ve recently ported one of my algorithms onto a GPU using CUDA. Here are some things I’ve learnt about the process (geared towards an algorithm dealing with genomic data).

Firstly, the documentation that helped me most:

Getting started: https://devblogs.nvidia.com/even-easier-introduction-cuda/ https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/
Understanding device memory: https://devblogs.nvidia.com/unified-memory-cuda-beginners/ https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/ https://devblogs.nvidia.com/using-shared-memory-cuda-cc/
Putting it all together: https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
Optimising your own code: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

Start small, add complexity in slowly

I started off following the ’even easier introduction to cuda’ guide to get a basic version of my algorithm working. The overall workflow was:

Flatten my data structures down 1D arrays, storing their strides as a struct.
Copy the data (input and output arrays) to the device memory.
Write a grid-stride loop which calls the kernel function (in my case, distance between genomes) on the right part of the data, using the thread and block index.
Make a __device__ version of the kernel function, using the documentation to convert function calls as necessary.
Copy the results now in the output array from device back to the host.

Overall, this isn’t too difficult to start with, especially if you can mostly follow the guide. If you have a more complex kernel function, consider a simple algorithm to get started with.

Managing your own flattening and striding of data is not too hard, but you can look at standard matrix libraries to get an idea of what this means (e.g. Eigen, Armadillo, Numpy). For memory copies I opted to use vectors from thrust rather than arrays, as it feels much more like C++11 rather than C (and you don’t need to remember your free() calls). Just use a raw_pointer_cast(&vec[0]) when you pass it to the kernel, where it will need to become a pointer.

Mainly, it’s useful at this stage just to get some code working and testable on your GPU. It will be a lot easier to iron out compiler or basic memory issues now. For example, with my own code I knew that at some point loading everything into the device memory would fail as it would exceed the amount available. Dealing with this was much more involved, and it was the right choice to leave this until the end.

Sort out your compiler workflow early on

CUDA code (files ending .cu, which contain the triple bracket «<»>, or global / device / host decorators) needs to be compiled with the nvidia compiler (nvcc). This compiler can in theory be used to compile your C++ code too, but in my experience is likely to run into issues with more complex libraries (e.g. Eigen). It is possible to compile your .cpp files with a standard C++ compiler such as g++ and your .cu files with nvcc, and link everything together at the end. For any larger project with a non-trivial amount of CPU code I would recommend this. My tips here are:

Separate your CPU C++ code and GPU CUDA code into files with appropriate extensions, and compile them with g++/nvcc respectively.
CMake has excellent support for CUDA, and allows you to turn it on/off depending on whether nvcc is available on the system. It will also deal with linking for you. There’s a guide here: https://devblogs.nvidia.com/building-cuda-applications-cmake/
If CMake isn’t working for you (I had this issue with conda) you’ll want to write your own Makefile. The key thing you need to know is that at the end you need to link twice, first with nvcc (with the -dlink flag), then as normal with g++. Enable position independent code in nvcc with -Xcompiler -fPIC --relocatable-device-code=true .
I found it useful to statically link the nvidia libraries with --cudart static and -lcudadevrt -lcudart_static.
Use -gencode arch=compute_35,code=compute_35 -gencode arch=compute_50,code=sm_50 flags to compile code for different GPU device versions all in one go. My card was newer (v7.5), but I also added backward compatibility this way.
If you are planning to release your code on anaconda, conda-forge has support for CUDA, though it may be difficult to get to work. You can test your code by changing the CI config files to pull the CUDA image, then change these back before merging. They will be automatically activated on your feedstock.

Work out a way to test your results

Adding printf() calls into kernels is an easy way to investigate your code, and will synchronise threads making concurrency easier to follow.
Setting both the blockSize and blockCount to 1 is really helpful if your kernel isn’t working at all.
Compare results to their CPU algorithm, or at least a partial version of this.

Consider optimisations earlier than you would usually

There’s a famous Donald Knuth quote about premature optimisation which I ended up violating (as I often do, but this time I think with more justification). Once your basic algorithm is working, it’s worth looking through the best practises guide to see if you can speed it up.

Try to think of the kernel in SIMD terms, such that adjacent runs (in the same warp) will execute each line at the same time, and in doing so whether that introduces significant overhead.

In particular:

Be careful with branches. You can use if statements or loops of variable length, but if another thread doesn’t follow the same path it will block until the conditional is over. So writing two big blocks of an if statement can be as slow as if you had to execute both in their entirety every time.
Make sure your accesses to global device memory are coalesced. This has a huge effect on performance, as memory access takes ~100x cycles and blocks until complete. You need to have adjacent threads accessing adjacent memory addresses so they can fit on the bus together, or be broadcast to all threads. In my case, this meant picking my strides of flattened data correctly. Adjacent threads dealt with adjacent samples, so the sample stride needed to be 1 (think of this like row-major vs column-major in 2D). For more details, see https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory-in-matrix-multiplication-c-aa__optimized-version-coalesced-reads-global-memory
Take a look at the range of blockSizes from 32-512 in steps of 32. In some cases, this can make a big difference.
Use floats rather than doubles. This doesn’t make a difference with CPU code, but is more efficient on a GPU (especially with global memory accesses).

Understand how the device memory is organised

Most likely, you’ll be working closer to the hardware than you normally do, and need to pay more attention to how you organise your code and memory, making sure it matches the underlying architecture. You have three memory areas to use, in descending order of preference:

Registers - very fast access to local variables, which can be used for intermediate results of computation each thread. ~1000 available I think.
__shared__ memory. Each processing unit (which runs a warp of 32 threads) has 48kiB of on-chip memory which is fast to access, and shared between threads. If you have small pieces of global memory which are shared between all threads in a warp/block, it is much faster to read them into here, have threads operate on this memory, and then write back to global memory. You need to use the third argument of the <<<>>> launch to specify the amount of size needed in bytes for your shared array. By default, some of this is used as L1 cache (storing the most recent access to global memory). You can manually control the amount used as cache and the amount used for your own arrays, and even make this be controlled by function.
Global memory. A large amount of RAM on the device which can be accessed by all threads (my card has 11Gb, but the top end devices have 24 or even 40Gb). Typically you will load into and out of this memory from the host RAM at the beginning and end of your computation, as this is a slow operation. Note that threads accessing global memory is also slow (as noted above), so you wish to lay out your memory so that multiple threads can have their values loaded at each call, otherwise you’ll end up with 32 threads waiting 100 cycles each, all blocking.

Other notes

__device__ functions are automatically inlined. This may affect similar function prototypes in host code.
As with CPU threads, you should use atomic functions where threads may want to read/write from the same address in global memory.
Use the CUDA math functions: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__SINGLE.html#group__CUDA__MATH__SINGLE
CUDA-GDB is available should you need a proper debugger.

Examples

Examples are less widely available than for many programming languages. You may find my attempt useful reference: https://github.com/johnlees/pp-sketchlib

CMakeLists.txt – combining CUDA and C++ compilers automatically.
src/Makefile – combining CUDA and C++ compilers manually.
src/dists.cu – CUDA code implementing my algorithm following the principles above.
src/gpu_api.cpp – C++ code which manages data flow between the CUDA functions and the rest of the C++ code.