/images/jl11_lots.jpg

John Lees' blog

Pathogens, informatics and modelling at EMBL-EBI

Screamadelica, Primal Scream

Why is it only in 2021 that I am listening to Primal Scream’s Screamadelica for the first time?

/images/Screamadelica.png

A lot of critically acclaimed music from the 1980s maintains a pop appeal that means it still gets radio play, is featured in club nights, and is heavily promoted in my Youtube home. However, perhaps the post-rock, trip-hop and grunge of the early 1990s doesn’t have the same enduring commercial appeal. Whatever the reason, I’ve been missing out.

Porting a bioinformatics tool to the web using WebAssembly, React and javascript

We recently released a beta version of PopPUNK-web (https://web.poppunk.net). This is a WebAssembly (WASM) version of pp-sketchlib which sketches an user-input genome assembly in the browser; transmits this sketch as a JSON to a server running PopPUNK using gunicorn and flask; runs query assignment against a large database of genomes from the GPS project; returns a JSON containing strain assignment, a tree and network; these are then displayed using a react app.

Thoughts on 'Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species'

I was happy to see that this paper, which originally appeared as a preprint back in April 2019 (!), was published earlier this month. I thought it was one of the most thought-provoking papers I’ve read recently, so suggested a journal club on the final version (it’s long paper – over 80 pages).

There were some parts that I liked a lot, and some parts I didn’t like, which I wanted to summarise here. Overall, I thought the paper brought an interesting ‘outsider’ approach to the problem of bacterial population genomics, and quantified some issues in new and useful ways. However, I was less keen on the presentation, which to me was overly confrontational, and failed to put the research within a proper modern context. I’ve summarised some of the discussion here, which I note is not meant to be a thorough review, and is subjective.

Things I have learnt about porting algorithms to GPUs (using CUDA)

I’ve recently ported one of my algorithms onto a GPU using CUDA. Here are some things I’ve learnt about the process (geared towards an algorithm dealing with genomic data).

Firstly, the documentation that helped me most:

I started off following the ’even easier introduction to cuda’ guide to get a basic version of my algorithm working. The overall workflow was:

Things I have learnt about using R (as a python/C++ programmer)

  • For loops are ok. I thought these were impossibly slow in R, and you always had to vectorise code – but not so. A common error is to try and append to arrays. It always best to preallocate an array/list with result <- rep(NA, length) or similar, and then write to it by index.
  • strings aren’t the easiest thing to deal with, but stringr might help, and there is a regular expression package built-in. Use paste0 to concatenate.
  • That being said, if you can easily vectorise code you should do so.
  • It’s really easy to set a function as a variable and pass these around, bind its arguments (a partial in python) etc. Use this feature!
  • A matrix type has to contain the same data type, so if you have different data types and convert to a matrix, they will all be converted to a compatible type silently.
  • Use %*% for matrix multiplication.
  • Use % in % for sets.
  • Use array[array$column == value, ] or similar, for selecting values from a data.frame.
  • Use a single & or | when combining conditions in the above. Double && will short circuit.
  • data.frame columns must have the same type. If you want to mix types you’ll likely need a list.
  • Don’t use apply, as the first step is to convert to a matrix, making all your types the same. Use vapply, which defines the expected return type
  • R will use functions which are partial matches to the name you called without commenting on this behaviour (!?!). Add options(warnPartialMatchAttr=TRUE, warnPartialMatchDollar=TRUE, warnPartialMatchArgs=TRUE) to ~/.Rprofile to turn this strange default off.
  • A numeric (float) and an integer are different types. Use 1L to make an integer ‘1’.
  • furrr is a nice library for parallelisation.
  • In general, tidyverse packages offer good alternatives to many data science functions in base R.
  • Extremely basic OOP is available using ‘S3’ objects (pretty much inheritance, and overriding of some typical functions such as print, summary based on type). ‘S4’ is to be avoided, apparently. ‘R6’ gives you more typical features.
  • Use the devtools package for your development, it automates most building and testing of the code.
  • RStudio has a nice built-in profiler.
  • R has a good FFI with C and can automatically sort out compiling for you. C++ is also possible with RCpp, but I believe a little more involved.

Bot or not? Anti-research/University accounts on twitter

tl;dr I think it’s really hard to tell whether a twitter account is run by who it looks like on first glance, and what its intentions are. Maybe we should try and take more care before amplifying opinions of unknown actors.

Recently, when scrolling through my twitter feed, I noticed some fairly aggressive tweets from the account @Help4StudentsUK, criticising Wellcome and some of its management. A typical example: