Some thoughts on bioinformatics software maintenance

John Lees published on 2019-09-16 included in Bioinformatics

Overall thoughts

I released my first software package for bioinformatics about four years ago. I now have four, all of which see some usage, but certainly nothing like the heavy usage of the most popular utilities. Despite this, I would guess on average I spend around 20-30% of my time maintaining these packages.

I love that people find our software of some use, and it’s still exciting getting messages from users from countries all around the world. I want to maintain and improve our software, and help people use it as much as I can. After all, one of the great things about working at a University is that everything I work on goes into the public domain, rather than being kept closed-source.

Plots that went wrong

John Lees published on 2019-09-13 included in Python Statistics

There’s loads of these on https://twitter.com/accidental__aRt, here are some of mine from the past couple of years:

/images/intergenic_tchal24.png — The beach?

/images/boxplot_and_violin_idiocy.png — You can never have too many violins

/images/likelihood_surface.png — A nice smooth likelihood surface

/images/nice_contours-1024x745.png — Always good to see the zero contour

/images/terrible_dendrogram.png — ’terrible_dendrogram.png'

/images/isogenic_tau.png — Apply some smoothing then I’m sure it will be fine, right?

/images/regression_separate.png — R2 = 0.11, p < 10E-10

/images/score_contours.png — Let’s cram everything in to 1/10 of the space. 2) The ever so informative labels: ‘1’, ‘2’, ‘3’, …

/images/SPARC_twoComponentBGMM-1024x910.png — Machine learning (or would we call this AI now?). Bonus: the large white bit at the bottom which I couldn’t get rid of

/images/classification_error_0.png — Cheap Rothko?

/images/classification_error.png — Sheet metal

Paper summary – Joint sequencing of human and pathogen genomes reveals the genetics of pneumococcal meningitis

John Lees published on 2019-05-15 included in Bioinformatics Paper-Summary Pneumococcus

This a summary of our paper on a joint pathogen and human GWAS that has just been published in Nature Communications: https://doi.org/10.1038/s41467-019-09976-3

This is the last bit of research from my PhD thesis. Also, this was the first thing I started working on back in 2014 (my first GWAS), and our collaborators have been collecting data since 2006 – so it’s good to see this one out!

Overview

We collected cases from pneumococcal meningitis patients enrolled in a nationwide Dutch cohort. We were also able to match these with bacterial isolates collected in the nationwide reference lab. For both the patients and the bacteria, we then collected population-matched control samples to perform a case-control genome-wide association study (GWAS), plus some other statistical genetics. We accumulated similar data from case-control cohorts in other populations, again in both patients and bacteria, to increase the number of samples, and perform meta-analysis.

Readthedocs failing to build: module 'setuptools.build_meta' has no attribute 'legacy'

John Lees published on 2019-04-26 included in Bioinformatics Python

As we all know, it’s critical that your code’s github README.md

However, to my horror, I noticed one of my many nice green badges signalling my professionalism had turned an alarming shade of red. What were potential users to think, seeing that the docs for my most recent commit had failed to build on readthedocs

Conservation of core genes in S. pneumoniae

John Lees published on 2019-04-17 included in Bioinformatics Pneumococcus

A question I am sometimes asked is whether a gene of interest, usually being studied in vitro or in vivo, is conserved. Although the availability of population genomic datasets allows this question to be answered, it can be hard to find this kind of analysis in the literature, and doing it yourself is not trivial. This post hopes to be an easy way to access this information for S. pneumoniae.

Using unitigs for bacterial GWAS with pyseer

John Lees published on 2019-04-08 included in Bioinformatics

This post briefly explains how you can now use unitigs, nodes of sequence in a compressed de Bruijn graph enumerated using DBGWAS, in the pyseer software. Broadly, this has the following advantages over k-mer based association:

Computational burden: fewer resources used in counting the unitigs, and fewer unitigs that need to have their association tested.
Lower multiple testing burden, as unitigs reduce redundancy present in k-mer counting.
Easier to interpret: unitigs are usually longer than k-mers, and further context (surrounding sequence) can be analysed by using the graph structure they come from.

More details below.

John Lees' blog