Things I have learnt about porting algorithms to GPUs (using CUDA)
I’ve recently ported one of my algorithms onto a GPU using CUDA. Here are some things I’ve learnt about the process (geared towards an algorithm dealing with genomic data).
Firstly, the documentation that helped me most:
- Getting started: https://devblogs.nvidia.com/even-easier-introduction-cuda/ https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/
- Understanding device memory: https://devblogs.nvidia.com/unified-memory-cuda-beginners/ https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/ https://devblogs.nvidia.com/using-shared-memory-cuda-cc/
- Putting it all together: https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
- Optimising your own code: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
Start small, add complexity in slowly
I started off following the ’even easier introduction to cuda’ guide to get a basic version of my algorithm working. The overall workflow was:












