Some thoughts on bioinformatics software maintenance

Overall thoughts

I released my first software package for bioinformatics about four years ago. I now have four, all of which see some usage, but certainly nothing like the heavy usage of the most popular utilities. Despite this, I would guess on average I spend around 20-30% of my time maintaining these packages.

I love that people find our software of some use, and it’s still exciting getting messages from users from countries all around the world. I want to maintain and improve our software, and help people use it as much as I can. After all, one of the great things about working at a University is that everything I work on goes into the public domain, rather than being kept closed-source.

I do feel like this effort is not likely to be rewarded by typical indicators such as publications or funding. With analysis papers I’ve published, the paper itself has usually been the end of the story, and it is then time to move on to the next thing, whereas with software it is really only the start of the work. That’s ok for me right now, mostly because it’s one of the things I enjoy most about my job, but maybe that will change as I spend more time doing it. Perhaps this is a broader problem in genomics/bioinformatics that needs change from funders, publishers etc? (I enjoyed this recent article which talked about this and proposed some solutions.)

Here are some things which have helped me lighten this load, or I think I’ve gotten better at, so far:

Make sure people can install it

I really think this is the number one thing to do! Spending a little time on this at the start has saved me so much more time later, and it’s one of the hardest things to diagnose via email. The first package I wrote was really hard to install, and realistically you probably needed to be a C++ developer to do it. Possibly, you needed to be me. This was a big mistake!

On conda

I’d come to view bioconda as panacea for all such woes, as you just need to spend a bit of time sorting out all your compile lines and dependencies, the maintainers are really helpful in getting it to work and maintaining good standards within their project, and it’s easy to use.

But I’ve seen a lot of posts/banter on twitter about conda being bad or not working, so clearly it doesn’t solve these issues for everyone. As more gets installed, dependencies get more complex, and things like version pinning come to the forefront. I’ve so far managed to resolve any issues I’ve encountered, but I presume this is because I have become competent with the system from having written so many recipes for it. It’s probably incorrect to assume that most users have the same level of familiarity with the system. I still think conda is the best solution out there, but I think perhaps we need better guides and documentation for when things go wrong.

My three biggest tips to fix stuff in it:

  • Create a new environment if something isn’t working when installing:
    conda create -n pyseer_env pyseer
  • If the environment solving is taking a while, try with a new environment too. Another option is mamba (itself installable through conda) which worked well when I tried it.
  • If you are getting a library error, run conda update on the package listed. Or, try a new environment.
  • Run on linux (if you are using OS X, you can use a virtual machine)

Write documentation

It’s a false economy not to. I’ve spent least 3-5 days on this for anything I want other people to use. reST, sphinx and readthedocs make it really easy to make something that’s easy to write, looks nice, and has a website. Update the docs when you change the code too!

Fully documenting/commenting code (e.g. including full valid docstrings, maintaining pep standards) has felt of smaller benefit for most projects – I think this becomes useful when you have more people working on the code. Comment it however is best for you.

Write some tests, add continuous integration

Probably obvious, and certainly not the most exciting revelation I have had. This has however saved me from the biggest issues when I’ve been updating code. travis-CI, circleCI or azure pipelines are all easy ways to automatically run these with each new push.

Though, I don’t feel I have time to write unit tests.

I’m sure they would help prevent some regressions I’ve seen in my code. I think these would be easiest to write if I made them each time I wrote a new function. But at what point to I decide/know this code is going to be used enough to make this time worthwhile? It feels like something for projects with multiple developers. (perhaps this is blasphemy?)

We are not all Heng Li (or, Make Use of Dependencies)

Heng Li is clearly a legendary bioinformatician, and his software is some of the best. Heng has advocated avoiding maintenance of code by not using external dependencies, making a simple and clean user interface, and duplicating code bases:

I agree with two of these points, but not with not using dependencies. If, like me, you are not Heng Li, you may find it hard to bash out an aligner from scratch rather than including a library as a dependency. It saves me so much time relying heavily on dependencies for many common tasks, and there are some things I simply don’t think I could reimplement (either as efficiently, or correctly). If you are making a file format or something fundamental this seems more doable, but for most things, why reinvent the wheel?

With conda, I’m happy knowing that I can sort out any dependency installation difficulties when I write the recipe to install the package. I save a lot of time overall.

Engage users

Previously I have ‘advertised’ software through papers, conference presentations and twitter. This seems to work quite well, but doesn’t really help beyond the initial release. I have recently tried to get feedback on a more ongoing basis. I wish I’d done this earlier.

Most of the feedback I get is usually through github issues or emails saying something doesn’t work or is missing. But I rarely hear from people for whom the code worked, but had suggestions to improve things. Or people who were put off trying in the first place. And maybe I’m missing out on some positive feedback through this too!

I tried running a survey, which although probably far from ideal, definitely helped fill some of these gaps. I did try promoting it to see if this got more responses – I felt a bit icky giving money to twitter, but at least it would have pushed some other ads out of people’s feeds. I am still unsure whether this was good value for money or not.

A little money can go a long way

While I can give things like graphic design the old college try, it generally looks ok at best, and takes me all day. I tried paying for some images from the noun project ($2.99 each, or $39.99 for unlimited use) and a html/css template ($29). These both saved me loads and load of time, and looked much better than anything I could have done.

I’m pretty sure these costs pale in comparison to most other things in science, and definitely your own time, so I didn’t find them hard to justify.

There’s also lots of free stuff to use out there. I am a growing fan of using emojis in figures – universal, well-designed images that put across a concept efficiently to a wide range of audiences 🙏.

Write your package in R

R is one of my less favourite languages, but I’ve never had a problem installing something from cran. Maybe R developers are just better coders?

Plots that went wrong

There’s loads of these on, here are some of mine from the past couple of years:

The beach?
You can never have too many violins
A nice smooth likelihood surface
Always good to see the zero contour
Apply some smoothing then I’m sure it will be fine, right?
R2 = 0.11, p < 10E-10


1) Let’s cram everything in to 1/10 of the space. 2) The ever so informative labels: ‘1’, ‘2’, ‘3’, …
Machine learning (or would we call this AI now?). Bonus: the large white bit at the bottom which I couldn’t get rid of
A more breezy Rothko
Sheet metal


Paper summary – Joint sequencing of human and pathogen genomes reveals the genetics of pneumococcal meningitis

This a summary of our paper on a joint pathogen and human GWAS that has just been published in Nature Communications:

This is the last bit of research from my PhD thesis. Also, this was the first thing I started working on back in 2014 (my first GWAS), and our collaborators have been collecting data since 2006 – so it’s good to see this one out!


We collected cases from pneumococcal meningitis patients enrolled in a nationwide Dutch cohort. We were also able to match these with bacterial isolates collected in the nationwide reference lab. For both the patients and the bacteria, we then collected population-matched control samples to perform a case-control genome-wide association study (GWAS), plus some other statistical genetics. We accumulated similar data from case-control cohorts in other populations, again in both patients and bacteria, to increase the number of samples, and perform meta-analysis.

This allows us to look for the affect of any genetic variant across the species (without needing a prior hypothesis of an interesting gene), in the natural population (humans not mice, and natural variation not lab strains). I hope this generates some useful hypotheses for further testing in more controlled experiments, for example using isogenic mutants in a relevant model organism.

I think these were our key findings:

    1. Bacterial genetics did not seem to affect disease severity, but was a big factor in whether invasive disease would be caused after carriage.
    2. Pneumococcal serotype is important in determining invasiveness, but other genetic factors are also involved.
    3. We found some candidate genetic factors other than serotype, including some previously implicated proteins, and some new ones.
    4. Human genetics however affected both disease severity and susceptibility, and we found a couple of specific candidate loci.
    5. We performed an interaction (genome-to-genome) analysis, but were limited in power due to small number of samples (N = 460) versus large number of possible interactions (~1010).

If you’re interested in the biological ramifications of any of these points I’d encourage you to have a look through the paper. I was particularly interested by 1) which seems to rule out pathogen genotype as a diagnostic of disease severity, and 2) and 3) which suggest that targeting only serotype is maybe not the optimal way to a) define pneumococci or b) design vaccines which stop invasive disease.

Technical summary

I’ll also try and describe some of the more technical aspects here, which were interesting to me, but ended up being only briefly covered in the paper.

Bacterial GWAS (pGWAS)

We tried to get as much as we could out of the whole-genome sequence data here, calling as many genetic features as possible for later use in the association:

  • SNPs, genes and k-mers. As have been used in bacterial GWAS studies before.
  • INDELs. Can be less well called by some methods. Here I just looked for short INDELs which might result in frameshifts.
  • Antigen alleles. Noted in various papers before,(e.g. and, these might not get well tagged by the SNPs. I built a new machine-learning classifier for three genes (pspC, pspA, zmpA), but it’s probably quite similar to other methods (and was relatively labour and computationally intensive).
  • Copy number variants (including large deletions).
  • A phase-variable inverting locus (ivr). This requires using the read pair information due to repeats breaking up assembly and mapping.

We then fed these into some association models:

  • SEER/pyseer. Used to try and find individual genetic associations with meningitis. We found the mixed effects model was required to get anything sensible, as phenotype was strongly correlated with genetic background.
  • limix. Used to obtain heritability estimates, including partitioning by genome segment.
  • A hierarchical Bayesian model was used for the ivr, as it contains variation within each sample.

Ultimately, only the use of SNPs, INDELs and k-mers led to significant findings. The former were also useful for the heritability analysis with limix. This was also the first time I tried to include association of rare variation (aggregating through gene and operon burden tests). This seemed to be one of the more successful approaches, perhaps due to lower confounding with background, and I would certainly recommend it to anyone trying a bacterial GWAS in future (and you can easily do it in pyseer!). It definitely benefited from the calling of short INDELs.

One further point is that some of the strongest associations we found were to transposons and BOX elements. While BOX elements have been associated with gene expression and competence, none of the transposons had any obvious biological effect (no cargo, nothing consistently interrupted, as far as we could tell). I am not sure whether these associations are a biological effect which we may have ignored, or if this is an artefact of the association model, where the p-value is elevated by these elements apparently occurring independently multiple times on different genetic backgrounds, with uneven case and control numbers. I’d like to investigate this further at some point.

Human GWAS (hGWAS)

What a pleasure it was to work with human genotypes and GWAS methods! It was really noticeable how smooth this analysis was, and how it got easier over the years we added cohorts. I think the main reasons for this compared to pGWAS are:

  • Many long standing methods and software packages which directly work with your data, no need for format conversion (e.g. plink, bolt-lmm, locuszoom, LDAK).
  • Excellent web-services and databases which help enrich the interpretation of your findings (e.g. the imputation servers,, GTEx).
  • It’s easier than bacterial GWAS. The well designed genotyping arrays with SNPs that cover most the common variation (after imputation) and lack of strong population structure/fast LD decay definitely made a big difference.

This didn’t make it into the final paper, but I also looked into some models of effect sizes with bayesR, which suggested as well as a few high effect loci (oligogenic), much of the heritability explained is from smaller effect loci (polygenic). So collecting more samples would certainly be useful, though this is hard due to the low incidence.

Have a look at the Data Availability section if you want to use our GWAS summary statistics in any future meta-analysis.


It’s been a long time in the works mostly due to the time taken to accumulate all of this data: meningitis is a fairly rare disease, with only about 200 cases each year reported in the Netherlands. As new patients were enrolled and genotypes I ended up repeating the analysis (which got easier each time, as human GWAS methods became more and more impressive and easy to use). We also put a lot of effort into expanding the dataset – we were eventually able to gather three separate cohorts of human genotypes (from the Netherlands, Denmark and the UK Biobank) and two cohorts with bacterial sequences (from the Netherlands and South Africa). Consequently we have a nice long author list, with almost as many affiliations as names. I am particularly grateful to all of these authors who each contributed data, analysis and their time.

Particular thanks to Bart Ferwerda, Philip Kremer and Nicole Wheeler who worked with me on much of the analysis; Thomas Benfield, Anne von Gottberg, Arie van der Ende and Mattijs Brouwer who set up and ran the largest cohorts included; and Jeff Barrett, Stephen Bentley and Diederik van de Beek who supervised the work. Also to the reviewers, who provided lots of useful suggestions for improving the paper (you can see the reviews here).

Readthedocs failing to build: module ‘setuptools.build_meta’ has no attribute ‘__legacy__’

As we all know, it’s critical that your code’s github

uses badges

However, to my horror, I noticed one of my many nice green badges signalling my professionalism had turned an alarming shade of red. What were potential users to think, seeing that the docs for my most recent commit had failed to build on readthedocs

docs failing
The horror

Looking at the build logs, one of my dependencies (in this case hdbscan) was failing with the error message:

AttributeError: module 'setuptools.build_meta' has no attribute '__legacy__'

This issue is apparently caused by certain versions of pip and setuptools in a virtual environment, which appear to be satisfied by the readthedocs build environment:

After trying unsuccessfully (with an embarrassingly long series of pushed commits) to fix this through a readthedocs.yml file, only then did I question why installation of all of the dependencies was required to make the docs. The answer is my use of autodoc (:automodule:) to make the API documentation from my docstrings. I believe this can also incorporate documentation from external modules, but in this case (and I would guess most cases) this is not necessary. Indeed, sphinx has a parameter you can add to the file to pretend the modules are installed when you don’t actually need to install them:

So, this was caused by:

  • Having a ‘requirements.txt’ file being picked up by readthedocs.
  • Some of the dependencies causing installation issues within their virtualenv.
  • Needing these dependencies to be installed though use of autodoc in places.

I was able to fix it by:

  • Adding a readthedocs.yml file which specified a separate dependencies.txt file for the docs.
  • Adding autodoc_mock_imports = ['hdbscan'] to avoid installation of the dependencies.

Conservation of core genes in S. pneumoniae

A question I am sometimes asked is whether a gene of interest, usually being studied in vitro or in vivo, is conserved. Although the availability of population genomic datasets allows this question to be answered, it can be hard to find this kind of analysis in the literature, and doing it yourself is not trivial. This post hopes to be an easy way to access this information for S. pneumoniae.

I think three useful measures of conservation are:

  • Is it a core gene? If not, what is the frequency in the population?
  • What is the omega value (dN/dS) of the gene, which is the ratio of non-synonymous to synonymous changes? This tells us something about how it appears to be evolving, and the selection pressures it is under.
  • What is the pi value of the gene, which is the average number of sequence differences if two sequences from the population are compared. This tells us about the amount of sequence diversity in the population, and whether multiple alleles are maintained. (The pis used here are for amino acid rather than nucleotide differences).

Tajima’s D can also be useful, but I haven’t included it here as I think its interpretation is less straightforward.


The plots below are a re-presentation of some data first published in
Croucher, N. J. et al. Diverse evolutionary patterns of pneumococcal antigens identified by pangenome-wide immunological screening. Proc. Natl. Acad. Sci. U. S. A. (2017). doi:10.1073/pnas.1613937114

In this paper, using 616 genomes isolated from carriage in children in Massachusetts, Croucher et al defined and annotated core gene clusters, produced MSAs and calculated many useful statistics about which gene. This information can all be found in Dataset_S01.xlsx from the paper. I’ve listed my methods at the end.


The data is all available in the supplementary information of the cited paper. I also re-analysed it using the following steps:

  1. Extract each core gene DNA sequence using the annotations from the paper.
  2. Translate sequences.
  3. Align protein sequences with MUSCLE.
  4. Use dendropy.calculate.popgenstat.nucleotide_diversity to calculate pi from 3).
  5. Use RevTrans to produce a nucleotide alignment from the output of 1) and 3).
  6. Use SLAC in HyPhy to calculate dN, dS and omega from 3).

See for the code.

Is it a core gene?

If it is in this table/plots below, yes.
If not, no, it is an accessory gene.

I don’t have accessory gene frequencies included here, as these were only partially analysed in the cited paper.

dN/dS and pi

Interactive scatter plot (full size here):

Table of data:


Using unitigs for bacterial GWAS with pyseer

This post briefly explains how you can now use unitigs, nodes of sequence in a compressed de Bruijn graph enumerated using DBGWAS, in the pyseer software. Broadly, this has the following advantages over k-mer based association:

  • Computational burden: fewer resources used in counting the unitigs, and fewer unitigs that need to have their association tested.
  • Lower multiple testing burden, as unitigs reduce redundancy present in k-mer counting.
  • Easier to interpret: unitigs are usually longer than k-mers, and further context (surrounding sequence) can be analysed by using the graph structure they come from.

More details below.


A recent, excellent, paper by Jaillard, Lima et al. showed how unitigs (explained below) can be used instead of k-mers in bacterial GWAS:

The authors present their software DBGWAS which is an end-to-end bacterial GWAS solution to go from assemblies to associated graph elements (using bugwas/gemma to perform the association).

We are currently adding some new association models to pyseer, and I wanted to follow DBGWAS’ example and use unitigs in these new approaches. While we’re not ready to release the new models quite yet, it is now possible to use unitigs instead of k-mers with the existing association models implemented in pyseer (fixed effects, mixed effects, lineage effects). This has been achieved through minor modifications to the DBGWAS code, so I am indebted to these authors for making this possible.


I have updated the documentation to include these details.

Count the unitigs

  1. Install unitig-counter using bioconda (‘conda install -c bioconda unitig-counter’).
  2. Create a list of assemblies and their names as input (see for details).
  3. Run the counting step, using multiple cores if available.

Run the association

  1. Simply use the same options as for a k-mer association, but drop in output/unitigs.txt.gz produced above as the –kmers option.
  2. As the number of tests for significance threshold, use the number of unique patterns reported (or count the lines in output/unitigs.unique_rows.Rtab).

Interpret the results

  1. As you would for k-mers, use the included scripts to map to a reference and produce a Manhattan plot in Phandango, or annotate the significant sequences.
  2. To provide extra context, or lengthen short sequences, unitigs can be extended leftwards and rightwards following the graph using ‘cdbg-ops extend’.


If you use unitigs in your association, please cite the DBGWAS paper:
Jaillard, M., Lima L, et al. A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events. PLoS Genet. 14, e1007758 (2018). doi:10.1371/journal.pgen.1007758.

If you find pyseer useful, citation would be appreciated:
Lees, John A., Galardini, M., et al. pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 34:4310–4312 (2018). doi:10.1093/bioinformatics/bty539.

Paper summary – PopPUNK for bacterial epidemiology

A paper describing our recent method for bacterial epidemiology PopPUNK has just been published in Genome Research, which you can read here:

You can install our software by running
conda install poppunk
and that full details and documentation can be found at

In this blog post I will attempt to describe some of our key features and findings in a shorter format. Broadly, I think there are three main parts:

  1. Core and accessory distances can be estimated using k-mer distances.
  2. In many species, finding clusters in core-accessory space gives good quality clusters of strains. The core and accessory distances from 1) provide further resolution within clusters of strains.
  3. These clusters can be expanded online as new samples are added, and their names stay consistent between studies.

I’ve also noted some of the work we added in our revision, for those that might have seen the first version as a pre-print on bioRxiv. We added more direct comparisons with phylogenies and cg/wgMLST schemes, showing that PopPUNK was preferable to wg/cgMLST, while still fulfilling the criteria desirable for an epidemiological typing system laid out by Nadon et al.

Core and accessory distances can be estimated using k-mer distances

The importance of accessory genome evolution and divergence has been increasingly recognised over the past few years. To analyse the accessory genome, one typically attempts to find clusters of orthologous genes (COGs) using roary, panX or another similar method. These methods compare all annotated genes to all others, which results in a number of comparisons which increases with the square of the number of sequences. Though efficiencies in these pieces of software keep this computation possible, for larger populations this takes a significant amount of time, especially if reruns are needed due to new samples or poorly chosen clustering thresholds.

For some downstream purposes just extracting the core and accessory distances between pairs of samples is sufficient, as information on individual COGs and annotations is not needed. We wanted to use a k-mer based approach to do this, so that we:

  • wouldn’t be reliant on gene annotations, which differ in quality between species and sample collection, and can be hard to standardise across labs and studies.
  • do not require an alignment step, which takes a long time, and again can be hard to standardise.
  • could take advantage of recent developments in efficient k-mer based software. Specifically, we used mash, which rapidly calculates distances between sequences by taking the lowest scoring subset (a sketch) of k-mers from a sequence assembly.

Noting that longer k-mers are more likely to mismatch between samples due to SNP in a shared (core) region, but that k-mers of all sizes are equally likely to mismatch between samples due to a missing accessory element (longer than the k-mer length), we were able to formulate a relation between mash distances at various k-mer lengths and core and accessory distances. Ultimately, this allows us to calculate core and accessory distances between all sequences in a population tens or hundreds of times faster than from clustering and aligning genes. In a population of 128 Listeria monocytogenes PopPUNK took about ten minutes, whereas a run of roary alone took 31 hrs. We also compared our results to this method in simulations (figure 2 in the paper) and in ten varied species (figure 3 in the paper) and found our faster estimates to be consistent with both our simulations and the real data.

We can then plot the core and accessory distances for all pairwise comparisons of samples, adding density countours where many points overlap. Here is the L. monocytogenes example:

L. monocytogenes distance distribution
Core and accessory distances between every pair of 128 L. monocytogenes samples, with density contours

This distribution is useful for a number of things, particularly clustering – the focus of the rest of the paper – but can also tell us about overall core-accessory evolution, and can be used to pick out samples which have unexpected divergence in either core or accessory content (see supplementary figure 11 for a detailed example of this).

Using core-accessory distances to define sequence clusters (strains)

A good, widely used method to define clusters of closely related sequences in a population is hierBAPS, or the recent upgrade fastbaps (both fit the same model, but the newer version is significantly faster at doing so and can also use a phylogeny to constrain the possible clusters). While this approach has many nice features such as being able to cluster recursively and being able to extract likelihoods for fits and assignments, the following limitations make it challenging to directly apply in all the places where subtyping of a population is useful for epidemiology:

  • Requires a good quality alignment as input (time-consuming).
  • Is difficult to update (every new genome requires a complete refit).
  • Difficult to keep cluster names consistent (different studies and runs will have different cluster IDs, and possibly assignments).
  • Usually forms a bin cluster of outlier sequences (putting all low frequency clusters together, rather than separating them).
  • Can be slow to fit (though perhaps no longer, with the development of fastbaps).

These drawbacks make a species-level definition of subtypes potentially challenging. Additionally, for some species (of particular interest to us was the Streptococcus genus) the solution found by optimising the BAPS likelihood does not provide great quality clusters across the tree, I think due to unmodelled recombination events and many small clusters.

This is what we set out to try and improve with PopPUNK, hoping that our fast estimation of core and accessory distances could be used for this purpose.

By finding clusters that are clearly separated in core-accessory space (using one of two standard machine-learning methods) we are able to determine a cutoff for which distances are within the same strain. Applying this to the same distances as above:

GMM fit to L. monocytogenes
Four component Gaussian mixture model fitted to pairwise distances in L. monocytogenes

The light-blue cluster closest to the origin is the within-strain cluster – distances in this cluster represent comparisons between samples in the same strain. We can then draw links between any pair of samples less than this distance apart. Linked samples form the clusters of strains in a network, the connected components (see figure 5A,B). In the network samples A and B may be greater than the cutoff distance apart, if both are close enough to a third sample C they will be in the same cluster. For most of the species we applied our method to this approach gave good clusters very quickly (table 1, figure 4). For two Streptococcal species where extensive recombination blurred the separation between the components, we needed to apply a final step to adjust the position of the boundary. Optimising properties of the network, avoiding clusters which are straggly and linked by only a few samples connected to many things, and reducing the overall number of links, then gave good results in these species without further extensive computation.

Adding new isolates in to a fitted cluster model – quickly and consistently

We found some very useful advantages to representing the clusters as a network, which solve many of the above issues:

  • Outlier sequences or small groups are correctly placed in their own clusters, rather than being grouped in a bin.
  • New isolates can be added in by calculating distances to samples already in the network (avoiding calculating everything again).
  • Removing fully connected sets of samples (cliques) removes redundancy, and further improves speed/memory/storage.
  • Cluster names are by size, and can be constrained to be the same between studies, or as a database is added to.

This means that you can download a PopPUNK database (usually 10-100Mb) and run using --assign-samples with new assemblies. This will cluster new samples within the context of an existing population, without having to redo/care about the model fit. The databases can be expanded without having to refit the model, or worry about cluster names changing (which is one of the nice features of MLST). We tested this with an emerging E. coli strain not seen in the database at the first time point in a longitudinal series, and PopPUNK was able to track its emergence and expansion (see figure 5D,E).

Comparison with gene-by-gene methods

For the second version of this paper we were asked to add in a more explicit comparison with gene-by-gene methods and phylogenies. My understanding of how MLST and cgMLST/wgMLST schemes are applied in epidemiology is:

  1. Download the MLST allele database, or upload your sample to an online database.
  2. BLAST (or another alignment/matching tool) is used to find the typing genes in the uploaded sample.
  3. These genes are compared to a list of previously seen sequences for that gene (alleles). If they exactly match they are assigned the same identifier, otherwise a new one.
  4. A sequence type (ST) is assigned to each unique combination of alleles in the typed genes.

In step 3, counting any number of changes within a gene as a single change loses some resolution, but has the advantage that it does not overcount recombination events. With a good choice of genes making up the scheme, MLST schemes have been shown to capture population structure very well. It is faster than alignment and modelling with hierBAPS, a single sample can easily be added, and with centralised databases it can also deal with keeping names of clusters (STs) consistent.

However, some drawbacks are:

  • MLST lacks resolution, and throws data from whole genome sequencing away. Extended schemes (using core genes or all genes) improve this, but still ignore intergenic regions and group any number of changes together as a single allele difference.
  • Defining allele databases is difficult and requires continual curation for people to add new alleles, and avoid duplicate genes. There are some impressive efforts to do this (e.g. enterobase, BIGSdb), but these only cover some species.
  • It is unclear how to form clusters from allele assignment. How many changes of allele should be in the same cluster? Common interpretations of a single change is far too specific for extended schemes, and seems to be a convenience-based choice rather than a biological one.

In practice, I found that downloading a cgMLST scheme and applying it to my own data was quite challenging due to how the gene database needed to be formatted, and to make sure all the dependencies worked (thanks to João Carriço and Mickael Silva for helping me with this, and for their chewBBACA software which made the comparison possible). MLST methods and databases have been around for longer, and so this was easier to work with. Defining and maintaining a new scheme for a species which doesn’t have one yet seems like it would be a significant undertaking, though I didn’t try this myself.

To directly compare PopPUNK and these methods, we performed MLST and cgMLST assignment on two different species with good typing schemes (L. monocytogenes and E. coli). We then calculated pairwise distances in terms of the number of allele changes, which gave a gene distance matrix rather than a core and accessory distance matrix. By using these with the PopPUNK network we could find how many allele changes to connect to form similar clusters, and how good projections at various cutoffs are (see supplementary tables 6 and 7). We could make the clusters similar between PopPUNK and (cg)MLST, but only by manually testing many values of the cutoff for number of allele changes.

I don’t have lots of experience using gene-by-gene methods or analysing surveillance datasets, but from these tests I ended up concluding that PopPUNK has the following advantages over gene-by-gene methods:

  • The clusters are likely to be real biological entities based on their separation in core-accessory space. With cgMLST/MLST it is unclear how many allele changes should be chosen as a cutoff, but PopPUNK optimises this.
  • The calculation of core distances allow trees to be drawn within clusters (e.g. with the --microreact option), giving further resolution and relationships within clusters.
  • PopPUNK is faster to run.
  • You don’t need curated database of genes (difficult to format/curate when it exists, more difficult to create when it doesn’t).

So we ended up concluding that PopPUNK also retains the advantages of gene-by-gene approaches, and meets the criteria of Nadon et al for a genomic surveillance scheme.

Challenging species – low diversity

The main place we found PopPUNK’s clusters to be worse than those from RhierBAPS was for populations with limited genetic diversity, for example within an identified strain. The calculation of core and accessory distances will in theory work to any resolution (but one may need to increase the sketch size to the genome length divided by the variants per genome). But if there is no clear within-strain versus between-strain separation in the distances and instead just a cloud of points, the spatial clustering methods are not likely to converge on a good solution. Network-based model refinement is needed in this case, though it is likely to split the strain into many substrains.

One example of this was Neisseria gonorrhoeae, which is essentially a strain of Neisseria meningitidis (which did work using default settings). Using refinement of core distances we did get a reasonable fit, and were able to use this to find accessory elements moving within strains (we also looked at this within a well studied strain of S. pneumoniae). In Mycobacterium tuberculosis diversity was even more limited, so while the core distance based phylogeny PopPUNK produced was consistent with the lineages estimated by the first level of hierBAPS, PopPUNK’s clustering split the population into many more substrains (comparable to spoligotyping).

See the supplementary text S1 and figure S11 for a full discussion of this point.

Recommendations for running PopPUNK

Some tips:

  • The most important part of the model fit is to uniquely select the cluster closest to the origin.
  • If you have high accessory distance points check for contamination in assemblies and remove them.
  • Run using one of the extra output options (e.g. --microreact) and you’ll get a lot more information out, and can make interactive visualisations.

See the quickstart guide for a walkthrough, the tutorial for all details on a more complex example and the troubleshooting for common issues.


PopPUNK was the result of a collaboration between many people, but I’d particularly like to thank Nick Croucher who jointly worked on the method, code and paper with me.

Creating a conda package with compilation and dependencies

I’ve just finished, what was for me, a difficult compiler/packaging attempt – creating a working bioconda package for seer. You can look at the pull request to see how many times I failed:

(I would note I have made this package for legacy users. I would direct anyone interested in the software itself to the reimplementation pyseer)

The reason this was difficult was due to my own inclusion of a number of packages, all of which also need compilation, further adding to the complexity. While most of them were already available in some form in anaconda it turned out there were some issues with using the defaults. I want to note some of the things I had to modify here.

gcc and glibc version

seer requires gcc >4.8 to compile, and glibc > 4.9 to run. The default compiler version in conda is 4.8. Simply add a ‘conda_build_config.yaml’ file containing:

  - gcc # [linux]
  - gxx # [linux]


I had dlib and gzstream as submodules. If you use a git_url as the source these clone recursively, but not with a standard url in meta.yaml. I needed to do ‘git clone –recursive’ with repository and tarball it myself to include these directories in the git hub release.


Is not available on the bioconda channels so I had to compile myself. I included this as a submodule, but rather than using the default Makefile I needed to add the conda defined compiler flags to ensure these were consistent with later steps (particularly -fPIC in CPPFLAGS).



I was attempting to link boost_program_options using either the boost or boost-cpp anaconda packages, which unlike most boost libraries requires compiling. This led to undefined symbols at the linking stage, which I think are due to incompatible compiler (options) used to make the dynamic libraries in the versions currently available on anaconda. This turned out to be the most difficult thing to fix, requiring me to compile boost as part of the recipe.

Rather than downloading and compiling everything, I followed the boost github examples and made a shallow clone, with a fully copy of the boost library I’m using:

git clone --depth 1 boost
rmdir libs/program_options
cd boost
git clone --depth 50 libs/program_options
git submodule update -q --init tools/boostdep
git submodule update -q --init tools/build
git submodule update -q --init tools/inspect

I then included this in the release tarball. A better way may be to use submodules so this is done automatically using –recursive.

This library needed to be built, but I did so in a work directory to avoid installing unexpected packages with the recipe. Following the conda-forge for boost-cpp:

pushd boost
python2 tools/boostdep/depinst/ program_options --include example
cat < tools/build/src/site-config.jam
using gcc : custom : ${CXX} ;
./ --prefix="${BOOST_BUILT}" --with-libraries=program_options --with-toolset=gcc
./b2 -q \
variant=release \
address-model="${ARCH}" \
architecture=x86 \
debug-symbols=off \
threading=multi \
runtime-link=shared \
link=static,shared \
toolset=gcc-custom \
include="${INCLUDE_PATH}" \
cxxflags="${CXXFLAGS}" \
linkflags="${LINKFLAGS}" \
--layout=system \
-j"${CPU_COUNT}" \

The python2 line sorts out the header libraries required to compile, not included in the shallow clone. The rest are standard methods to install boost, ensuring the same compiler flags as the other compiled code and using the conda compiler.

I then needed to link this boost library statically (leaving the rest dynamic), so modified the make line as follows:

  SEER_LDLIBS="-L../gzstream -L${BOOST_BUILT}/lib -L/usr/local/hdf5/lib \
  -lhdf5 -lgzstream -lz -larmadillo -Wl,-Bstatic -lboost_program_options \
  -Wl,-Bdynamic -lopenblas -llapack -lpthread"


The final trick was linking armadillo correctly. Confusingly it built and linked ok, tested ok locally, but on the bioconda CI I got undefined symbols to lapack at runtime:

seer: symbol lookup error: seer: undefined symbol: wrapper_dgbsv_

This was due to armadillo’s wrapper around its include which links in the versions of blas/openblas and lapack defined at the time it was compiled, which I think must be slightly different from what is now included with the armadillo package dependencies on conda. Easy enough to fix, use a compiler flag to turn the wrapper off and link the libraries manually:

  LDFLAGS="${LDFLAGS} -larmadillo -lopenblas -llapack"

After all of that, it finally worked!

conda build: libarchive: cannot open shared object file: No such file or directory

I was getting the following error, attempting to run conda-build on a package, using a conda env:

Traceback (most recent call last):
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/bin/conda-build", line 7, in <module>
from conda_build.cli.main_build import main
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/cli/", line 18, in <module>
import conda_build.api as api
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/", line 22, in <module>
from conda_build.config import Config, get_or_merge_config, DEFAULT_PREFIX_LENGTH as _prefix_length
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/", line 17, in <module>
from .variants import get_default_variant
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/", line 15, in <module>
from conda_build.utils import ensure_list, trim_empty_keys, get_logger
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/", line 10, in <module>
import libarchive
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/", line 1, in <module>
from .entry import ArchiveEntry
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/", line 6, in <module>
from . import ffi
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/", line 27, in <module>
libarchive = ctypes.cdll.LoadLibrary(libarchive_path)
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/ctypes/", line 426, in LoadLibrary
return self._dlltype(name)
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/ctypes/", line 348, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/libarchive: cannot open shared object file: No such file or directory

The /lib directory does contain libarchive, both as a dynamic (.so) and static (.a) library. There turned out to be two relevant environment variables:


A workaround is then to run

export LIBARCHIVE=/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/

There is probably a proper reason why this has happened and a permanent solution to the issue, but this works for now.

Linear scaling of covariances

In our software PopPUNK we draw a plot of a Gaussian mixture model that uses both the implementation and the excellent example in the scikit-learn documentation:

GMM ellipse example
Gaussian mixture model with mixture components plotted as ellipses

My input is 2D distance, which I first use StandardScaler to normalise each axes between 0 and 1, which helps standardise methods across various parts of the code. This is fine if you then create these plots in the scaled space, and as it is a simple linear scaling it is generally trivial to convert back into the original co-ordinates:

# scale X
scale = np.amax(X, axis = 0)
scaled_X = X / scale
# plot scaled
plt.scatter([(scaled_X)[Y == i, 0]], [(scaled_X)[Y == i, 1]], .4, color=color)
# plot original
plt.scatter([(scaled_X*scale)[Y == i, 0]], [(scaled_X*scale)[Y == i, 1]], .4, color=color)

The only thing that wasn’t obvious was how to scale the covariances, which are used to draw the ellipses. I knew they needed to be multiplied by the scale twice as they are variances (squared), and had a vague memory of something like xAx^T from transforming ellipses/conic sections in one of my first year maths courses. Happily that was enough to do a search, turning up an explanation on stackoverflow which confirmed this vague memory:

Here is the code for making the plot and incorporating a linear scaling

color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold','darkorange'])

fig=plt.figure(figsize=(11, 8), dpi= 160, facecolor='w', edgecolor='k')
splot = plt.subplot(1, 1, 1)
for i, (mean, covar, color) in enumerate(zip(means, covariances, color_iter)):
    scaled_covar = np.matmul(np.matmul(np.diag(scale), covar), np.diag(scale).T)
    v, w = np.linalg.eigh(scaled_covar)
    v = 2. * np.sqrt(2.) * np.sqrt(v)
    u = w[0] / np.linalg.norm(w[0])
    # as the DP will not use every component it has access to
    # unless it needs it, we shouldn't plot the redundant
    # components.
    if not np.any(Y == i):
    plt.scatter([(X)[Y == i, 0]], [(X)[Y == i, 1]], .4, color=color)

# Plot an ellipse to show the Gaussian component
    angle = np.arctan(u[1] / u[0])
    angle = 180. * angle / np.pi # convert to degrees
    ell = mpl.patches.Ellipse(mean*scale, v[0], v[1], 180. + angle, color=color)