/images/jl11_lots.jpg

John Lees' blog

Pathogens, informatics and modelling at EMBL-EBI

Things I have learnt about porting algorithms to GPUs (using CUDA)

I’ve recently ported one of my algorithms onto a GPU using CUDA. Here are some things I’ve learnt about the process (geared towards an algorithm dealing with genomic data).

Firstly, the documentation that helped me most:

I started off following the ’even easier introduction to cuda’ guide to get a basic version of my algorithm working. The overall workflow was:

Things I have learnt about using R (as a python/C++ programmer)

  • For loops are ok. I thought these were impossibly slow in R, and you always had to vectorise code – but not so. A common error is to try and append to arrays. It always best to preallocate an array/list with result <- rep(NA, length) or similar, and then write to it by index.
  • strings aren’t the easiest thing to deal with, but stringr might help, and there is a regular expression package built-in. Use paste0 to concatenate.
  • That being said, if you can easily vectorise code you should do so.
  • It’s really easy to set a function as a variable and pass these around, bind its arguments (a partial in python) etc. Use this feature!
  • A matrix type has to contain the same data type, so if you have different data types and convert to a matrix, they will all be converted to a compatible type silently.
  • Use %*% for matrix multiplication.
  • Use % in % for sets.
  • Use array[array$column == value, ] or similar, for selecting values from a data.frame.
  • Use a single & or | when combining conditions in the above. Double && will short circuit.
  • data.frame columns must have the same type. If you want to mix types you’ll likely need a list.
  • Don’t use apply, as the first step is to convert to a matrix, making all your types the same. Use vapply, which defines the expected return type
  • R will use functions which are partial matches to the name you called without commenting on this behaviour (!?!). Add options(warnPartialMatchAttr=TRUE, warnPartialMatchDollar=TRUE, warnPartialMatchArgs=TRUE) to ~/.Rprofile to turn this strange default off.
  • A numeric (float) and an integer are different types. Use 1L to make an integer ‘1’.
  • furrr is a nice library for parallelisation.
  • In general, tidyverse packages offer good alternatives to many data science functions in base R.
  • Extremely basic OOP is available using ‘S3’ objects (pretty much inheritance, and overriding of some typical functions such as print, summary based on type). ‘S4’ is to be avoided, apparently. ‘R6’ gives you more typical features.
  • Use the devtools package for your development, it automates most building and testing of the code.
  • RStudio has a nice built-in profiler.
  • R has a good FFI with C and can automatically sort out compiling for you. C++ is also possible with RCpp, but I believe a little more involved.

Bot or not? Anti-research/University accounts on twitter

tl;dr I think it’s really hard to tell whether a twitter account is run by who it looks like on first glance, and what its intentions are. Maybe we should try and take more care before amplifying opinions of unknown actors.

Recently, when scrolling through my twitter feed, I noticed some fairly aggressive tweets from the account @Help4StudentsUK, criticising Wellcome and some of its management. A typical example:

Some thoughts on bioinformatics software maintenance

I released my first software package for bioinformatics about four years ago. I now have four, all of which see some usage, but certainly nothing like the heavy usage of the most popular utilities. Despite this, I would guess on average I spend around 20-30% of my time maintaining these packages.

I love that people find our software of some use, and it’s still exciting getting messages from users from countries all around the world. I want to maintain and improve our software, and help people use it as much as I can. After all, one of the great things about working at a University is that everything I work on goes into the public domain, rather than being kept closed-source.

Plots that went wrong

There’s loads of these on https://twitter.com/accidental__aRt, here are some of mine from the past couple of years:

/images/intergenic_tchal24.png
The beach?
/images/boxplot_and_violin_idiocy.png
You can never have too many violins
/images/likelihood_surface.png
A nice smooth likelihood surface
/images/nice_contours-1024x745.png
Always good to see the zero contour
/images/terrible_dendrogram.png
’terrible_dendrogram.png'
/images/isogenic_tau.png
Apply some smoothing then I’m sure it will be fine, right?
/images/regression_separate.png
R2 = 0.11, p < 10E-10
/images/score_contours.png
  1. Let’s cram everything in to 1/10 of the space. 2) The ever so informative labels: ‘1’, ‘2’, ‘3’, …
/images/SPARC_twoComponentBGMM-1024x910.png
Machine learning (or would we call this AI now?). Bonus: the large white bit at the bottom which I couldn’t get rid of
/images/perfect_separation.png
Pong
/images/classification_error_0.png
Cheap Rothko?
/images/classification_error.png
Sheet metal

Paper summary – Joint sequencing of human and pathogen genomes reveals the genetics of pneumococcal meningitis

This a summary of our paper on a joint pathogen and human GWAS that has just been published in Nature Communications: https://doi.org/10.1038/s41467-019-09976-3

This is the last bit of research from my PhD thesis. Also, this was the first thing I started working on back in 2014 (my first GWAS), and our collaborators have been collecting data since 2006 – so it’s good to see this one out!

We collected cases from pneumococcal meningitis patients enrolled in a nationwide Dutch cohort. We were also able to match these with bacterial isolates collected in the nationwide reference lab. For both the patients and the bacteria, we then collected population-matched control samples to perform a case-control genome-wide association study (GWAS), plus some other statistical genetics. We accumulated similar data from case-control cohorts in other populations, again in both patients and bacteria, to increase the number of samples, and perform meta-analysis.