There’s loads of these on https://twitter.com/accidental__aRt, here are some of mine from the past couple of years:
There’s loads of these on https://twitter.com/accidental__aRt, here are some of mine from the past couple of years:
As we all know, it’s critical that your code’s github README.md
However, to my horror, I noticed one of my many nice green badges signalling my professionalism had turned an alarming shade of red. What were potential users to think, seeing that the docs for my most recent commit had failed to build on readthedocs
Looking at the build logs, one of my dependencies (in this case hdbscan) was failing with the error message:
AttributeError: module 'setuptools.build_meta' has no attribute '__legacy__'
This issue is apparently caused by certain versions of pip and setuptools in a virtual environment, which appear to be satisfied by the readthedocs build environment: https://github.com/pypa/setuptools/issues/1694
After trying unsuccessfully (with an embarrassingly long series of pushed commits) to fix this through a
readthedocs.yml file, only then did I question why installation of all of the dependencies was required to make the docs. The answer is my use of autodoc (
:automodule:) to make the API documentation from my docstrings. I believe this can also incorporate documentation from external modules, but in this case (and I would guess most cases) this is not necessary. Indeed, sphinx has a parameter you can add to the
conf.py file to pretend the modules are installed when you don’t actually need to install them: https://github.com/sphinx-doc/sphinx/issues/4182.
So, this was caused by:
I was able to fix it by:
readthedocs.ymlfile which specified a separate
dependencies.txtfile for the docs.
autodoc_mock_imports = ['hdbscan']to avoid installation of the dependencies.
A paper describing our recent method for bacterial epidemiology PopPUNK has just been published in Genome Research, which you can read here:
You can install our software by running
conda install poppunk
and that full details and documentation can be found at https://poppunk.readthedocs.io
In this blog post I will attempt to describe some of our key features and findings in a shorter format. Broadly, I think there are three main parts:
I’ve also noted some of the work we added in our revision, for those that might have seen the first version as a pre-print on bioRxiv. We added more direct comparisons with phylogenies and cg/wgMLST schemes, showing that PopPUNK was preferable to wg/cgMLST, while still fulfilling the criteria desirable for an epidemiological typing system laid out by Nadon et al.
The importance of accessory genome evolution and divergence has been increasingly recognised over the past few years. To analyse the accessory genome, one typically attempts to find clusters of orthologous genes (COGs) using
panX or another similar method. These methods compare all annotated genes to all others, which results in a number of comparisons which increases with the square of the number of sequences. Though efficiencies in these pieces of software keep this computation possible, for larger populations this takes a significant amount of time, especially if reruns are needed due to new samples or poorly chosen clustering thresholds.
For some downstream purposes just extracting the core and accessory distances between pairs of samples is sufficient, as information on individual COGs and annotations is not needed. We wanted to use a k-mer based approach to do this, so that we:
Noting that longer k-mers are more likely to mismatch between samples due to SNP in a shared (core) region, but that k-mers of all sizes are equally likely to mismatch between samples due to a missing accessory element (longer than the k-mer length), we were able to formulate a relation between mash distances at various k-mer lengths and core and accessory distances. Ultimately, this allows us to calculate core and accessory distances between all sequences in a population tens or hundreds of times faster than from clustering and aligning genes. In a population of 128 Listeria monocytogenes PopPUNK took about ten minutes, whereas a run of roary alone took 31 hrs. We also compared our results to this method in simulations (figure 2 in the paper) and in ten varied species (figure 3 in the paper) and found our faster estimates to be consistent with both our simulations and the real data.
We can then plot the core and accessory distances for all pairwise comparisons of samples, adding density countours where many points overlap. Here is the L. monocytogenes example:
This distribution is useful for a number of things, particularly clustering – the focus of the rest of the paper – but can also tell us about overall core-accessory evolution, and can be used to pick out samples which have unexpected divergence in either core or accessory content (see supplementary figure 11 for a detailed example of this).
A good, widely used method to define clusters of closely related sequences in a population is hierBAPS, or the recent upgrade fastbaps (both fit the same model, but the newer version is significantly faster at doing so and can also use a phylogeny to constrain the possible clusters). While this approach has many nice features such as being able to cluster recursively and being able to extract likelihoods for fits and assignments, the following limitations make it challenging to directly apply in all the places where subtyping of a population is useful for epidemiology:
These drawbacks make a species-level definition of subtypes potentially challenging. Additionally, for some species (of particular interest to us was the Streptococcus genus) the solution found by optimising the BAPS likelihood does not provide great quality clusters across the tree, I think due to unmodelled recombination events and many small clusters.
This is what we set out to try and improve with PopPUNK, hoping that our fast estimation of core and accessory distances could be used for this purpose.
By finding clusters that are clearly separated in core-accessory space (using one of two standard machine-learning methods) we are able to determine a cutoff for which distances are within the same strain. Applying this to the same distances as above:
The light-blue cluster closest to the origin is the within-strain cluster – distances in this cluster represent comparisons between samples in the same strain. We can then draw links between any pair of samples less than this distance apart. Linked samples form the clusters of strains in a network, the connected components (see figure 5A,B). In the network samples A and B may be greater than the cutoff distance apart, if both are close enough to a third sample C they will be in the same cluster. For most of the species we applied our method to this approach gave good clusters very quickly (table 1, figure 4). For two Streptococcal species where extensive recombination blurred the separation between the components, we needed to apply a final step to adjust the position of the boundary. Optimising properties of the network, avoiding clusters which are straggly and linked by only a few samples connected to many things, and reducing the overall number of links, then gave good results in these species without further extensive computation.
We found some very useful advantages to representing the clusters as a network, which solve many of the above issues:
This means that you can download a PopPUNK database (usually 10-100Mb) and run using
--assign-samples with new assemblies. This will cluster new samples within the context of an existing population, without having to redo/care about the model fit. The databases can be expanded without having to refit the model, or worry about cluster names changing (which is one of the nice features of MLST). We tested this with an emerging E. coli strain not seen in the database at the first time point in a longitudinal series, and PopPUNK was able to track its emergence and expansion (see figure 5D,E).
For the second version of this paper we were asked to add in a more explicit comparison with gene-by-gene methods and phylogenies. My understanding of how MLST and cgMLST/wgMLST schemes are applied in epidemiology is:
In step 3, counting any number of changes within a gene as a single change loses some resolution, but has the advantage that it does not overcount recombination events. With a good choice of genes making up the scheme, MLST schemes have been shown to capture population structure very well. It is faster than alignment and modelling with hierBAPS, a single sample can easily be added, and with centralised databases it can also deal with keeping names of clusters (STs) consistent.
However, some drawbacks are:
In practice, I found that downloading a cgMLST scheme and applying it to my own data was quite challenging due to how the gene database needed to be formatted, and to make sure all the dependencies worked (thanks to João Carriço and Mickael Silva for helping me with this, and for their chewBBACA software which made the comparison possible). MLST methods and databases have been around for longer, and so this was easier to work with. Defining and maintaining a new scheme for a species which doesn’t have one yet seems like it would be a significant undertaking, though I didn’t try this myself.
To directly compare PopPUNK and these methods, we performed MLST and cgMLST assignment on two different species with good typing schemes (L. monocytogenes and E. coli). We then calculated pairwise distances in terms of the number of allele changes, which gave a gene distance matrix rather than a core and accessory distance matrix. By using these with the PopPUNK network we could find how many allele changes to connect to form similar clusters, and how good projections at various cutoffs are (see supplementary tables 6 and 7). We could make the clusters similar between PopPUNK and (cg)MLST, but only by manually testing many values of the cutoff for number of allele changes.
I don’t have lots of experience using gene-by-gene methods or analysing surveillance datasets, but from these tests I ended up concluding that PopPUNK has the following advantages over gene-by-gene methods:
--microreactoption), giving further resolution and relationships within clusters.
So we ended up concluding that PopPUNK also retains the advantages of gene-by-gene approaches, and meets the criteria of Nadon et al for a genomic surveillance scheme.
The main place we found PopPUNK’s clusters to be worse than those from RhierBAPS was for populations with limited genetic diversity, for example within an identified strain. The calculation of core and accessory distances will in theory work to any resolution (but one may need to increase the sketch size to the genome length divided by the variants per genome). But if there is no clear within-strain versus between-strain separation in the distances and instead just a cloud of points, the spatial clustering methods are not likely to converge on a good solution. Network-based model refinement is needed in this case, though it is likely to split the strain into many substrains.
One example of this was Neisseria gonorrhoeae, which is essentially a strain of Neisseria meningitidis (which did work using default settings). Using refinement of core distances we did get a reasonable fit, and were able to use this to find accessory elements moving within strains (we also looked at this within a well studied strain of S. pneumoniae). In Mycobacterium tuberculosis diversity was even more limited, so while the core distance based phylogeny PopPUNK produced was consistent with the lineages estimated by the first level of hierBAPS, PopPUNK’s clustering split the population into many more substrains (comparable to spoligotyping).
See the supplementary text S1 and figure S11 for a full discussion of this point.
--microreact) and you’ll get a lot more information out, and can make interactive visualisations.
PopPUNK was the result of a collaboration between many people, but I’d particularly like to thank Nick Croucher who jointly worked on the method, code and paper with me.
Trying to install PyVCF under a python (3) virtual environment gave me the following error:
(venv)johnlees@hpc:~$ pip install pyvcf Downloading/unpacking pyvcf Downloading PyVCF-0.6.8.linux-x86_64.tar.gz (1.1MB): 1.1MB downloaded Saved /tmp/downloadcache/PyVCF-0.6.8.linux-x86_64.tar.gz Running setup.py egg_info for package pyvcf Traceback (most recent call last): File "", line 16, in FileNotFoundError: [Errno 2] No such file or directory: '~/venv/build/pyvcf/setup.py' Complete output from command python setup.py egg_info: Traceback (most recent call last): File "", line 16, in FileNotFoundError: [Errno 2] No such file or directory: '~/venv/build/pyvcf/setup.py'
The solution was to upgrade setuptools:
pip install --upgrade setuptools
I went from 0.6.34 to 36.8.0 (apparently!)