John Lees' blog

Pathogens, informatics and modelling at EMBL-EBI

Thoughts on 'Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species'

I was happy to see that this paper, which originally appeared as a preprint back in April 2019 (!), was published earlier this month. I thought it was one of the most thought-provoking papers I’ve read recently, so suggested a journal club on the final version (it’s long paper – over 80 pages). There were some parts that I liked a lot, and some parts I didn’t like, which I wanted to summarise here.

Things I have learnt about porting algorithms to GPUs (using CUDA)

I’ve recently ported one of my algorithms onto a GPU using CUDA. Here are some things I’ve learnt about the process (geared towards an algorithm dealing with genomic data). Firstly, the documentation that helped me most: Getting started: https://devblogs.nvidia.com/even-easier-introduction-cuda/ https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/ Understanding device memory: https://devblogs.nvidia.com/unified-memory-cuda-beginners/ https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/ https://devblogs.nvidia.com/using-shared-memory-cuda-cc/ Putting it all together: https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/ Optimising your own code: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ Start small, add complexity in slowly I started off following the ’even easier introduction to cuda’ guide to get a basic version of my algorithm working.

Things I have learnt about using R (as a python/C++ programmer)

For loops are ok. I thought these were impossibly slow in R, and you always had to vectorise code – but not so. A common error is to try and append to arrays. It always best to preallocate an array/list with result <- rep(NA, length) or similar, and then write to it by index. strings aren’t the easiest thing to deal with, but stringr might help, and there is a regular expression package built-in.

Bot or not? Anti-research/University accounts on twitter

tl;dr I think it’s really hard to tell whether a twitter account is run by who it looks like on first glance, and what its intentions are. Maybe we should try and take more care before amplifying opinions of unknown actors. Just another angry tweet Recently, when scrolling through my twitter feed, I noticed some fairly aggressive tweets from the account @Help4StudentsUK, criticising Wellcome and some of its management. A typical example:

Some thoughts on bioinformatics software maintenance

Overall thoughts I released my first software package for bioinformatics about four years ago. I now have four, all of which see some usage, but certainly nothing like the heavy usage of the most popular utilities. Despite this, I would guess on average I spend around 20-30% of my time maintaining these packages. I love that people find our software of some use, and it’s still exciting getting messages from users from countries all around the world.

Plots that went wrong

There’s loads of these on https://twitter.com/accidental__aRt, here are some of mine from the past couple of years: The beach? You can never have too many violins A nice smooth likelihood surface Always good to see the zero contour ’terrible_dendrogram.png' Apply some smoothing then I’m sure it will be fine, right? R2 = 0.11, p < 10E-10 Let’s cram everything in to 1/10 of the space. 2) The ever so informative labels: ‘1’, ‘2’, ‘3’, … Machine learning (or would we call this AI now?