Paper summary – PopPUNK for bacterial epidemiology

A paper describing our recent method for bacterial epidemiology PopPUNK has just been published in Genome Research, which you can read here:
https://dx.doi.org/10.1101/gr.241455.118

You can install our software by running
conda install poppunk
and that full details and documentation can be found at https://poppunk.readthedocs.io

In this blog post I will attempt to describe some of our key features and findings in a shorter format. Broadly, I think there are three main parts:

  1. Core and accessory distances can be estimated using k-mer distances.
  2. In many species, finding clusters in core-accessory space gives good quality clusters of strains. The core and accessory distances from 1) provide further resolution within clusters of strains.
  3. These clusters can be expanded online as new samples are added, and their names stay consistent between studies.

I’ve also noted some of the work we added in our revision, for those that might have seen the first version as a pre-print on bioRxiv. We added more direct comparisons with phylogenies and cg/wgMLST schemes, showing that PopPUNK was preferable to wg/cgMLST, while still fulfilling the criteria desirable for an epidemiological typing system laid out by Nadon et al.

Core and accessory distances can be estimated using k-mer distances

The importance of accessory genome evolution and divergence has been increasingly recognised over the past few years. To analyse the accessory genome, one typically attempts to find clusters of orthologous genes (COGs) using roary, panX or another similar method. These methods compare all annotated genes to all others, which results in a number of comparisons which increases with the square of the number of sequences. Though efficiencies in these pieces of software keep this computation possible, for larger populations this takes a significant amount of time, especially if reruns are needed due to new samples or poorly chosen clustering thresholds.

For some downstream purposes just extracting the core and accessory distances between pairs of samples is sufficient, as information on individual COGs and annotations is not needed. We wanted to use a k-mer based approach to do this, so that we:

  • wouldn’t be reliant on gene annotations, which differ in quality between species and sample collection, and can be hard to standardise across labs and studies.
  • do not require an alignment step, which takes a long time, and again can be hard to standardise.
  • could take advantage of recent developments in efficient k-mer based software. Specifically, we used mash, which rapidly calculates distances between sequences by taking the lowest scoring subset (a sketch) of k-mers from a sequence assembly.

Noting that longer k-mers are more likely to mismatch between samples due to SNP in a shared (core) region, but that k-mers of all sizes are equally likely to mismatch between samples due to a missing accessory element (longer than the k-mer length), we were able to formulate a relation between mash distances at various k-mer lengths and core and accessory distances. Ultimately, this allows us to calculate core and accessory distances between all sequences in a population tens or hundreds of times faster than from clustering and aligning genes. In a population of 128 Listeria monocytogenes PopPUNK took about ten minutes, whereas a run of roary alone took 31 hrs. We also compared our results to this method in simulations (figure 2 in the paper) and in ten varied species (figure 3 in the paper) and found our faster estimates to be consistent with both our simulations and the real data.

We can then plot the core and accessory distances for all pairwise comparisons of samples, adding density countours where many points overlap. Here is the L. monocytogenes example:

L. monocytogenes distance distribution
Core and accessory distances between every pair of 128 L. monocytogenes samples, with density contours

This distribution is useful for a number of things, particularly clustering – the focus of the rest of the paper – but can also tell us about overall core-accessory evolution, and can be used to pick out samples which have unexpected divergence in either core or accessory content (see supplementary figure 11 for a detailed example of this).

Using core-accessory distances to define sequence clusters (strains)

A good, widely used method to define clusters of closely related sequences in a population is hierBAPS, or the recent upgrade fastbaps (both fit the same model, but the newer version is significantly faster at doing so and can also use a phylogeny to constrain the possible clusters). While this approach has many nice features such as being able to cluster recursively and being able to extract likelihoods for fits and assignments, the following limitations make it challenging to directly apply in all the places where subtyping of a population is useful for epidemiology:

  • Requires a good quality alignment as input (time-consuming).
  • Is difficult to update (every new genome requires a complete refit).
  • Difficult to keep cluster names consistent (different studies and runs will have different cluster IDs, and possibly assignments).
  • Usually forms a bin cluster of outlier sequences (putting all low frequency clusters together, rather than separating them).
  • Can be slow to fit (though perhaps no longer, with the development of fastbaps).

These drawbacks make a species-level definition of subtypes potentially challenging. Additionally, for some species (of particular interest to us was the Streptococcus genus) the solution found by optimising the BAPS likelihood does not provide great quality clusters across the tree, I think due to unmodelled recombination events and many small clusters.

This is what we set out to try and improve with PopPUNK, hoping that our fast estimation of core and accessory distances could be used for this purpose.

By finding clusters that are clearly separated in core-accessory space (using one of two standard machine-learning methods) we are able to determine a cutoff for which distances are within the same strain. Applying this to the same distances as above:

GMM fit to L. monocytogenes
Four component Gaussian mixture model fitted to pairwise distances in L. monocytogenes

The light-blue cluster closest to the origin is the within-strain cluster – distances in this cluster represent comparisons between samples in the same strain. We can then draw links between any pair of samples less than this distance apart. Linked samples form the clusters of strains in a network, the connected components (see figure 5A,B). In the network samples A and B may be greater than the cutoff distance apart, if both are close enough to a third sample C they will be in the same cluster. For most of the species we applied our method to this approach gave good clusters very quickly (table 1, figure 4). For two Streptococcal species where extensive recombination blurred the separation between the components, we needed to apply a final step to adjust the position of the boundary. Optimising properties of the network, avoiding clusters which are straggly and linked by only a few samples connected to many things, and reducing the overall number of links, then gave good results in these species without further extensive computation.

Adding new isolates in to a fitted cluster model – quickly and consistently

We found some very useful advantages to representing the clusters as a network, which solve many of the above issues:

  • Outlier sequences or small groups are correctly placed in their own clusters, rather than being grouped in a bin.
  • New isolates can be added in by calculating distances to samples already in the network (avoiding calculating everything again).
  • Removing fully connected sets of samples (cliques) removes redundancy, and further improves speed/memory/storage.
  • Cluster names are by size, and can be constrained to be the same between studies, or as a database is added to.

This means that you can download a PopPUNK database (usually 10-100Mb) and run using --assign-samples with new assemblies. This will cluster new samples within the context of an existing population, without having to redo/care about the model fit. The databases can be expanded without having to refit the model, or worry about cluster names changing (which is one of the nice features of MLST). We tested this with an emerging E. coli strain not seen in the database at the first time point in a longitudinal series, and PopPUNK was able to track its emergence and expansion (see figure 5D,E).

Comparison with gene-by-gene methods

For the second version of this paper we were asked to add in a more explicit comparison with gene-by-gene methods and phylogenies. My understanding of how MLST and cgMLST/wgMLST schemes are applied in epidemiology is:

  1. Download the MLST allele database, or upload your sample to an online database.
  2. BLAST (or another alignment/matching tool) is used to find the typing genes in the uploaded sample.
  3. These genes are compared to a list of previously seen sequences for that gene (alleles). If they exactly match they are assigned the same identifier, otherwise a new one.
  4. A sequence type (ST) is assigned to each unique combination of alleles in the typed genes.

In step 3, counting any number of changes within a gene as a single change loses some resolution, but has the advantage that it does not overcount recombination events. With a good choice of genes making up the scheme, MLST schemes have been shown to capture population structure very well. It is faster than alignment and modelling with hierBAPS, a single sample can easily be added, and with centralised databases it can also deal with keeping names of clusters (STs) consistent.

However, some drawbacks are:

  • MLST lacks resolution, and throws data from whole genome sequencing away. Extended schemes (using core genes or all genes) improve this, but still ignore intergenic regions and group any number of changes together as a single allele difference.
  • Defining allele databases is difficult and requires continual curation for people to add new alleles, and avoid duplicate genes. There are some impressive efforts to do this (e.g. enterobase, BIGSdb), but these only cover some species.
  • It is unclear how to form clusters from allele assignment. How many changes of allele should be in the same cluster? Common interpretations of a single change is far too specific for extended schemes, and seems to be a convenience-based choice rather than a biological one.

In practice, I found that downloading a cgMLST scheme and applying it to my own data was quite challenging due to how the gene database needed to be formatted, and to make sure all the dependencies worked (thanks to João Carriço and Mickael Silva for helping me with this, and for their chewBBACA software which made the comparison possible). MLST methods and databases have been around for longer, and so this was easier to work with. Defining and maintaining a new scheme for a species which doesn’t have one yet seems like it would be a significant undertaking, though I didn’t try this myself.

To directly compare PopPUNK and these methods, we performed MLST and cgMLST assignment on two different species with good typing schemes (L. monocytogenes and E. coli). We then calculated pairwise distances in terms of the number of allele changes, which gave a gene distance matrix rather than a core and accessory distance matrix. By using these with the PopPUNK network we could find how many allele changes to connect to form similar clusters, and how good projections at various cutoffs are (see supplementary tables 6 and 7). We could make the clusters similar between PopPUNK and (cg)MLST, but only by manually testing many values of the cutoff for number of allele changes.

I don’t have lots of experience using gene-by-gene methods or analysing surveillance datasets, but from these tests I ended up concluding that PopPUNK has the following advantages over gene-by-gene methods:

  • The clusters are likely to be real biological entities based on their separation in core-accessory space. With cgMLST/MLST it is unclear how many allele changes should be chosen as a cutoff, but PopPUNK optimises this.
  • The calculation of core distances allow trees to be drawn within clusters (e.g. with the --microreact option), giving further resolution and relationships within clusters.
  • PopPUNK is faster to run.
  • You don’t need curated database of genes (difficult to format/curate when it exists, more difficult to create when it doesn’t).

So we ended up concluding that PopPUNK also retains the advantages of gene-by-gene approaches, and meets the criteria of Nadon et al for a genomic surveillance scheme.

Challenging species – low diversity

The main place we found PopPUNK’s clusters to be worse than those from RhierBAPS was for populations with limited genetic diversity, for example within an identified strain. The calculation of core and accessory distances will in theory work to any resolution (but one may need to increase the sketch size to the genome length divided by the variants per genome). But if there is no clear within-strain versus between-strain separation in the distances and instead just a cloud of points, the spatial clustering methods are not likely to converge on a good solution. Network-based model refinement is needed in this case, though it is likely to split the strain into many substrains.

One example of this was Neisseria gonorrhoeae, which is essentially a strain of Neisseria meningitidis (which did work using default settings). Using refinement of core distances we did get a reasonable fit, and were able to use this to find accessory elements moving within strains (we also looked at this within a well studied strain of S. pneumoniae). In Mycobacterium tuberculosis diversity was even more limited, so while the core distance based phylogeny PopPUNK produced was consistent with the lineages estimated by the first level of hierBAPS, PopPUNK’s clustering split the population into many more substrains (comparable to spoligotyping).

See the supplementary text S1 and figure S11 for a full discussion of this point.

Recommendations for running PopPUNK

Some tips:

  • The most important part of the model fit is to uniquely select the cluster closest to the origin.
  • If you have high accessory distance points check for contamination in assemblies and remove them.
  • Run using one of the extra output options (e.g. --microreact) and you’ll get a lot more information out, and can make interactive visualisations.

See the quickstart guide for a walkthrough, the tutorial for all details on a more complex example and the troubleshooting for common issues.

Thanks

PopPUNK was the result of a collaboration between many people, but I’d particularly like to thank Nick Croucher who jointly worked on the method, code and paper with me.

Creating a conda package with compilation and dependencies

I’ve just finished, what was for me, a difficult compiler/packaging attempt – creating a working bioconda package for seer. You can look at the pull request to see how many times I failed: https://github.com/bioconda/bioconda-recipes/pull/11263

(I would note I have made this package for legacy users. I would direct anyone interested in the software itself to the reimplementation pyseer)

The reason this was difficult was due to my own inclusion of a number of packages, all of which also need compilation, further adding to the complexity. While most of them were already available in some form in anaconda it turned out there were some issues with using the defaults. I want to note some of the things I had to modify here.

gcc and glibc version

seer requires gcc >4.8 to compile, and glibc > 4.9 to run. The default compiler version in conda is 4.8. Simply add a ‘conda_build_config.yaml’ file containing:

c_compiler:
  - gcc # [linux]
cxx_compiler:
  - gxx # [linux]

Submodules

I had dlib and gzstream as submodules. If you use a git_url as the source these clone recursively, but not with a standard url in meta.yaml. I needed to do ‘git clone –recursive’ with repository and tarball it myself to include these directories in the git hub release.

gzstream

Is not available on the bioconda channels so I had to compile myself. I included this as a submodule, but rather than using the default Makefile I needed to add the conda defined compiler flags to ensure these were consistent with later steps (particularly -fPIC in CPPFLAGS).

make CXX="$CXX $CXXFLAGS" CPPFLAGS="$CPPFLAGS -I. -I$PREFIX/include" AR="$AR cr"

boost

I was attempting to link boost_program_options using either the boost or boost-cpp anaconda packages, which unlike most boost libraries requires compiling. This led to undefined symbols at the linking stage, which I think are due to incompatible compiler (options) used to make the dynamic libraries in the versions currently available on anaconda. This turned out to be the most difficult thing to fix, requiring me to compile boost as part of the recipe.

Rather than downloading and compiling everything, I followed the boost github examples and made a shallow clone, with a fully copy of the boost library I’m using:

git clone --depth 1 https://github.com/boostorg/boost.git boost
rmdir libs/program_options
cd boost
git clone --depth 50 https://github.com/boostorg/program_options.git libs/program_options
git submodule update -q --init tools/boostdep
git submodule update -q --init tools/build
git submodule update -q --init tools/inspect

I then included this in the release tarball. A better way may be to use submodules so this is done automatically using –recursive.

This library needed to be built, but I did so in a work directory to avoid installing unexpected packages with the recipe. Following the conda-forge build.sh for boost-cpp:


pushd boost
python2 tools/boostdep/depinst/depinst.py program_options --include example
CXXFLAGS="${CXXFLAGS} -fPIC"
INCLUDE_PATH="${PREFIX}/include"
LIBRARY_PATH="${PREFIX}/lib"
LINKFLAGS="${LINKFLAGS} -L${LIBRARY_PATH}"
cat < tools/build/src/site-config.jam
using gcc : custom : ${CXX} ;
EOF
BOOST_BUILT="${SRC_DIR}/boost_built"
mkdir $BOOST_BUILT
./bootstrap.sh --prefix="${BOOST_BUILT}" --with-libraries=program_options --with-toolset=gcc
./b2 -q \
variant=release \
address-model="${ARCH}" \
architecture=x86 \
debug-symbols=off \
threading=multi \
runtime-link=shared \
link=static,shared \
toolset=gcc-custom \
include="${INCLUDE_PATH}" \
cxxflags="${CXXFLAGS}" \
linkflags="${LINKFLAGS}" \
--layout=system \
-j"${CPU_COUNT}" \
install
popd

The python2 line sorts out the header libraries required to compile, not included in the shallow clone. The rest are standard methods to install boost, ensuring the same compiler flags as the other compiled code and using the conda compiler.

I then needed to link this boost library statically (leaving the rest dynamic), so modified the make line as follows:

make CXXFLAGS="$CXXFLAGS" \
  SEER_LDLIBS="-L../gzstream -L${BOOST_BUILT}/lib -L/usr/local/hdf5/lib \
  -lhdf5 -lgzstream -lz -larmadillo -Wl,-Bstatic -lboost_program_options \
  -Wl,-Bdynamic -lopenblas -llapack -lpthread"

armadillo

The final trick was linking armadillo correctly. Confusingly it built and linked ok, tested ok locally, but on the bioconda CI I got undefined symbols to lapack at runtime:


seer: symbol lookup error: seer: undefined symbol: wrapper_dgbsv_

This was due to armadillo’s wrapper around its include which links in the versions of blas/openblas and lapack defined at the time it was compiled, which I think must be slightly different from what is now included with the armadillo package dependencies on conda. Easy enough to fix, use a compiler flag to turn the wrapper off and link the libraries manually:

make CPPFLAGS="${CPPFLAGS} -DARMA_DONT_USE_WRAPPER" \
  LDFLAGS="${LDFLAGS} -larmadillo -lopenblas -llapack"

After all of that, it finally worked!

conda build: libarchive: cannot open shared object file: No such file or directory

I was getting the following error, attempting to run conda-build on a package, using a conda env:

Traceback (most recent call last):
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/bin/conda-build", line 7, in <module>
from conda_build.cli.main_build import main
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/cli/main_build.py", line 18, in <module>
import conda_build.api as api
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/api.py", line 22, in <module>
from conda_build.config import Config, get_or_merge_config, DEFAULT_PREFIX_LENGTH as _prefix_length
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/config.py", line 17, in <module>
from .variants import get_default_variant
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/variants.py", line 15, in <module>
from conda_build.utils import ensure_list, trim_empty_keys, get_logger
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/utils.py", line 10, in <module>
import libarchive
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/__init__.py", line 1, in <module>
from .entry import ArchiveEntry
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/entry.py", line 6, in <module>
from . import ffi
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/ffi.py", line 27, in <module>
libarchive = ctypes.cdll.LoadLibrary(libarchive_path)
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/ctypes/__init__.py", line 426, in LoadLibrary
return self._dlltype(name)
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/ctypes/__init__.py", line 348, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/libarchive: cannot open shared object file: No such file or directory

The /lib directory does contain libarchive, both as a dynamic (.so) and static (.a) library. There turned out to be two relevant environment variables:

LIBARCHIVE=/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/libarchive
_CONDA_SET_LIBARCHIVE=LIBARCHIVE

A workaround is then to run

export LIBARCHIVE=/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/libarchive.so

There is probably a proper reason why this has happened and a permanent solution to the issue, but this works for now.

Linear scaling of covariances

In our software PopPUNK we draw a plot of a Gaussian mixture model that uses both the implementation and the excellent example in the scikit-learn documentation:

GMM ellipse example
Gaussian mixture model with mixture components plotted as ellipses

My input is 2D distance, which I first use StandardScaler to normalise each axes between 0 and 1, which helps standardise methods across various parts of the code. This is fine if you then create these plots in the scaled space, and as it is a simple linear scaling it is generally trivial to convert back into the original co-ordinates:

# scale X
scale = np.amax(X, axis = 0)
scaled_X = X / scale
# plot scaled
plt.scatter([(scaled_X)[Y == i, 0]], [(scaled_X)[Y == i, 1]], .4, color=color)
# plot original
plt.scatter([(scaled_X*scale)[Y == i, 0]], [(scaled_X*scale)[Y == i, 1]], .4, color=color)

The only thing that wasn’t obvious was how to scale the covariances, which are used to draw the ellipses. I knew they needed to be multiplied by the scale twice as they are variances (squared), and had a vague memory of something like xAx^T from transforming ellipses/conic sections in one of my first year maths courses. Happily that was enough to do a search, turning up an explanation on stackoverflow which confirmed this vague memory: https://math.stackexchange.com/questions/1184523/scaling-a-covariance-matrix

Here is the code for making the plot and incorporating a linear scaling

color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold','darkorange'])

fig=plt.figure(figsize=(11, 8), dpi= 160, facecolor='w', edgecolor='k')
splot = plt.subplot(1, 1, 1)
for i, (mean, covar, color) in enumerate(zip(means, covariances, color_iter)):
    scaled_covar = np.matmul(np.matmul(np.diag(scale), covar), np.diag(scale).T)
    v, w = np.linalg.eigh(scaled_covar)
    v = 2. * np.sqrt(2.) * np.sqrt(v)
    u = w[0] / np.linalg.norm(w[0])
    # as the DP will not use every component it has access to
    # unless it needs it, we shouldn't plot the redundant
    # components.
    if not np.any(Y == i):
        continue
    plt.scatter([(X)[Y == i, 0]], [(X)[Y == i, 1]], .4, color=color)

# Plot an ellipse to show the Gaussian component
    angle = np.arctan(u[1] / u[0])
    angle = 180. * angle / np.pi # convert to degrees
    ell = mpl.patches.Ellipse(mean*scale, v[0], v[1], 180. + angle, color=color)
    ell.set_clip_box(splot.bbox)
    ell.set_alpha(0.5)
    splot.add_artist(ell)

plt.title(title)
plt.savefig("ellipses.png")
plt.close()

Expression modules in S. pneumoniae

I recently read a pre-print from the Veening lab where they had reconstructed various (22 total) physiological conditions in vitro and then measured expression levels with RNA-seq. I thought it was a great bit of research, and would encourage you to read it here if you’re interested: https://doi.org/10.1101/283739

They’ve also done a really good job with data availability, having released a browser for their data (PneumoExpress), and they have put their raw data on zenodo.

I was interested in trying to replicate their expression module analysis, which was performed using WGCNA. I first downloaded the co-expression matrix, which is split across five CSV files in the zenodo archive. I then basically ran WGCNA exactly as suggested in their tutorial. I then used the annotations from the lab’s D39V genome to get final expression modules.

The expression modules look sensible to me, with the comE master regulator splitting the expression of the competence pathway and associated genes as expected. The modules I predicted can be found here.

Thanks to the authors for making their data so accessible! If you find this useful I would also note the authors also analysed their data with WGCNA, probably in a more optimal and error-checked way than I have, so most likely they have better clusters.

Here is a gist of the code I used:

Installing phyx without sudo

I saw this phylogenetics package today, phyx: https://github.com/FePhyFoFum/phyx

To install without admin rights/sudo I needed to do the following (my software is installed in my home ~/software, rather than e.g. /usr, /usr/local):

Compile armadillo as follows

cmake -DINSTALL_PREFIX=$(HOME)/software
make
make install

Compile nlopt as follows

./configure --with-cxx --without-octave --without-matlab --prefix=$(HOME)/software
make
make install

Compile phyx as follows (slightly hacky, maybe there’s a ‘proper’ way)

./configure --prefix=$(HOME)/software

change line 11 of the Makefile (CPP_LIBS) to add the library path:

CPP_LIBS = -L$(HOME)/software/lib -llapack -lblas -lpthread -lm

change line 23 of the Makefile (OPT_FLAGS) to add the include path/CPPFLAGS:

OPT_FLAGS := -O3 -std=c++11 -fopenmp -I$(HOME)/software/include

then you can run

make 
make install

 

 

 

setup.py not found using pip install

Trying to install PyVCF under a python (3) virtual environment gave me the following error:

(venv)johnlees@hpc:~$ pip install pyvcf
 Downloading/unpacking pyvcf
 Downloading PyVCF-0.6.8.linux-x86_64.tar.gz (1.1MB): 1.1MB downloaded
 Saved /tmp/downloadcache/PyVCF-0.6.8.linux-x86_64.tar.gz
 Running setup.py egg_info for package pyvcf
 Traceback (most recent call last):
 File "", line 16, in 
 FileNotFoundError: [Errno 2] No such file or directory: '~/venv/build/pyvcf/setup.py'
 Complete output from command python setup.py egg_info:
 Traceback (most recent call last):

File "", line 16, in

FileNotFoundError: [Errno 2] No such file or directory: '~/venv/build/pyvcf/setup.py'

The solution was to upgrade setuptools:

pip install --upgrade setuptools

I went from 0.6.34 to 36.8.0 (apparently!)

Firth regression in python

Marco Galardini and I have recently reimplemented the bacterial GWAS software SEER in python. As part of this I rewrote my C++ code for Firth regression in python. Firth regression gives better estimates when data in logistic regression is separable or close to separable (when a chi-squared contingency table has small entries).

I found that although there is an R implementation logistf I couldn’t find an equivalent in another language, or python’s statsmodels. Here is a gist with my python functions and a skeleton of how to use them and calculate p-values, in case anyone would like to use this in future without having to write the optimiser themselves.

Running BSLMM in gemma

In GWAS the Bayesian Sparse Linear Mixed Model (BSLMM) is a hybrid of the LMM, which assumes all SNPs have an effect size drawn from a normal distribution (closer to ridge regression), and sparse regression which finds a few SNPs with non-zero effect sizes.

In their paper on this model Zhou et al show that this hybrid method can have better prediction accuracy than either individual model on its own (which are special cases in their model), and can also estimate the proportion of variance explained by polygenic and sparse effects.

They implement this in their software gemma, which I (and many others) have used to perform GWAS using a LMM. It’s easy to use once you’ve got the data formatted, though memory usage is quite high (>50Gb for N =~ 5000; M =~ 600 000. Perhaps I should be LD-pruning?)

gemma -bfile data_in -bslmm 1 -o data_out_bslmm_1

However, when running this command I got one of two errors at either the Eigen-decomposition stage, or the UtX stage. ‘Segmentation fault’ or ‘Invalid instruction’. I’ve used the gemma static executable (the default download) before without any issues, so it looked to me like it was probably something with the linear algebra libraries.

Downloading the source, it turns out gemma relies on blas, lapack, gsl and atlas (and optionally arpack). When I tried to compile using the default Makefile I got the following errors at the linking stage:

/software/hgi/pkglocal/gcc-4.9.1/lib/gcc/x86_64-unknown-linux-gnu/4.9.1/../../../../lib64/libgfortran.a(write.o): In function `write_float'
:
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1300: undefined
reference to `signbitq'
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1300: undefined
reference to `finiteq'
/software/hgi/pkglocal/gcc-4.9.1/lib/gcc/x86_64-unknown-linux-gnu/4.9.1/../../../../lib64/libgfortran.a(write.o): In function `determine_en
_precision':
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1213: undefined
reference to `finiteq'
/software/hgi/pkglocal/gcc-4.9.1/lib/gcc/x86_64-unknown-linux-gnu/4.9.1/../../../../lib64/libgfortran.a(write.o): In function `write_float'
:
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1300: undefined
reference to `isnanq'
collect2: error: ld returned 1 exit status
make: *** [bin/gemma] Error 1

Fortunately these scary errors were familiar to me from wrangling with statically compiling seer. To fix this, compile gemma from source and run the BSLMM without errors add -lquadmath to the following line in the Makefile

LIBS_LNX_S_LAPACK = /usr/lib/lapack/liblapack.a -lgfortran -lquadmath /usr/lib/atlas-base/libatlas.a /usr/lib/libblas/libblas.a -Wl,--allow-multiple-definition

Happy modelling

Tanglegrams can be misleading

Tanglegrams are a visual method to compare two phylogenetic trees with the same set of tip labels. This can be useful for comparing trees produced by different methods on the same alignment, or on different alignments of the sample set. Tanglegrams work by connecting the matching tips of the trees, then rotating subtrees to minimise the number of crossings. The algorithm was published in 2011, and continues to be used in a range of publications (for example genomic epidemiology).

I recently ran into an issue using tanglegrams which I’d like to document here. I constructed phylogenies from the protein alignment of all the alleles of PspA identified in a carriage collection of Streptococcus pneumoniae using RAxML (a maximum likelihood method) and mrbayes (a Bayesian method which produces a posterior of trees). For the latter, one can use the R package treescape to find the median tree in the posterior distribution. I wanted to compare the shape of these two trees to see if there were any obvious areas of uncertainty in the topology (a perhaps more quantitative way to do this might be by looking at bootstrap values). I made a tanglegram, from which my initial interpretation was trees were very similar:

tanglegram_unlabelled-01
pspA allele trees. Tanglegram between maximum likelihood tree and median from the Bayesian posterior.

There is only one clade of swaps (near the bottom), so the trees must be similar? Not necessarily. A game I like to play is manually rotating subtrees to make metadata cluster. Though the topology remains identical, visually the clustering of metadata can change drastically. Even though we know vertical distance is meaningless in a phylogeny, the way this is usually visually displayed can lead us to draw erroneous conclusions.

combined_flips
Rotating subtrees. Left: nice well defined clades clustering together. Right: same tree topology and labels, but the clades no longer form single clusters. Clusters 1 and 4 are polyphyletic, and rotating subtrees can break up or combine these vertically.

Let’s look at the tanglegram above again in more detail:

tanglegram_phylogram
Misleading tanglegram? This emphasises topology changes in more recent parts of the phylogeny (blue) and may be misleading about similarity of ancestral branches (red).

The swap highlighted in blue is the only event where tip lines cross. This correctly leads us to identify that pspA-31 and pspA-37 are in slightly different places in this part of the phylogeny. However look in the red box. Those these strains totally align, the placement of them within the phylogeny is quite different (red arrows)! In this case I don’t think the line crossings by the tanglegram have identified what I wanted to see here, which is that the ancestral topology of these two trees is different.

While I am not claiming tanglegrams are not useful, without careful attention they may be misleading. A good alternative to tanglegrams, in my opinion, is phylo.io. This allows you to dynamically compare two tree topologies, and see how each branch flip affects the arrangement rather than providing just one static comparison. If you want to try this out on the trees above the two newick files are attached at the end of this post to copy in. Finally, I would also note that there are various metrics which can provide a quantitative comparison between (many) trees, my favourite of which is the Kendall-Colijin metric implemented in treescape.

Update: Michelle Kendall (an author of both the KC metric and treescape) pointed out on twitter that they have added the function plotTreeDiff() as another way to look at topology changes. I gave this a go:

plot_tree_diff
plotTreeDiff result. The two trees, with tips coloured based on how many differences in recent common ancestors there are with other tips.

From the manual:
A plot of the two trees side by side. Tips are coloured in the following way:

  • if each ancestor of a tip in tree 1 occurs in tree 2 with the same partition of tip descendants, then the tip is coloured grey (or supplied “baseCol”)
  • if not, the tip gets coloured pale orange to red on a scale according to how many differences there are amongst its most recent common ancestors with other tips. The colour spectrum can be changed according to preference.’

You can use tipDiff() to see the numeric results direcly.

Trees:

((((((pspA-15:0.7542002,(pspA-20:0.1224276,pspA-17:0.01459105):0.0136343):0.06696656,pspA-31:0.02844112):0.07110159,(pspA-38:0.0188037,((((pspA-1:0.07786275,pspA-33:0.08346416):0.03494853,pspA-5:0.1691264):0.02173634,pspA-37:0.0787912):0.009598778,pspA-25:0.07136204):0.03492139):0.02958377):0.2273015,(((pspA-7:0.09993183,pspA-21:0.1386982):0.05063578,(pspA-4:0.1303186,((pspA-27:0.2173186,pspA-8:0.006066717):0.07263402,pspA-36:0.07059364):0.119125):0.03385501):0.01120214,pspA-24:0.09906705):0.09046543):0.1139506,(pspA-19:0.04005775,(pspA-32:0.04931893,pspA-10:0.08485811):0.02024913):0.9919588):0.167318775,((((((pspA-3:0.1203388,pspA-0:0.09452877):0.04488296,pspA-30:0.1090477):0.03244832,((pspA-6:0.06727758,((pspA-29:0.04877083,pspA-14:0.07255919):0.1581595,pspA-2:0.1249573):0.0518224):0.05925152,pspA-18:0.05166722):0.06424128):0.01986727,(pspA-12:0.1279069,pspA-26:0.1232956):0.08339421):0.05060685,(pspA-22:0.04905987,(pspA-9:0.1901771,((pspA-16:0.04409937,(pspA-35:0.06477806,pspA-13:0.2830619):0.07727303):0.04340311,(pspA-11:0.09526805,pspA-28:0.2083244):0.07013518):0.07074651):0.05979277):0.02315286):0.03699984,(pspA-34:0.2528056,pspA-23:1.222536):0.1333402):0.044963025);

(((pspA-5:0.16557530729568833983,((pspA-1:0.04756995109125589788,pspA-33:0.10915366994507740006):0.04785917480955029918,(((pspA-7:0.10230120656769127463,pspA-21:0.14095901783884545733):0.02506804670603787408,(((pspA-36:0.06861655800623776835,(pspA-8:0.00442478384355664365,pspA-27:0.22142270295956195669):0.09444585738499523819):0.08879457852006654439,pspA-4:0.10604010012498631121):0.04672183194168183507,pspA-24:0.07323682699184425049):0.01199257548493264970):0.10529470578563034089,((pspA-34:0.38604601286039375019,(((pspA-12:0.12639557827350531016,pspA-26:0.12595340725708714658):0.11169806126353225284,((pspA-18:0.04075294523735660535,(pspA-6:0.05097275166368370192,((pspA-14:0.08245346709560481824,pspA-29:0.05470945144269170890):0.16281232925454419691,pspA-2:0.15070678273765572563):0.03938700493100827371):0.05203089575251463456):0.06302461915576125506,(pspA-30:0.09148296281857748458,(pspA-0:0.10463631384718696804,pspA-3:0.11382571714494262027):0.04249311451680449353):0.03340348219787352829):0.01489076329975322355):0.04853549722852742304,(pspA-22:0.04696709784392495701,((pspA-11:0.07535697579734597362,pspA-28:0.25243204747960418244):0.04403388452890973775,(pspA-9:0.18996034668835029557,(pspA-16:0.06144862376464669401,(pspA-13:0.27683177216876692084,pspA-35:0.07992819540786025301):0.05138818922505713344):0.04831192529255548540):0.03498701951880421601):0.09272730151798472265):0.04073871346767179297):0.01853343949682505903):0.15648667630311274834,((pspA-19:0.03232176246079743881,(pspA-10:0.09060060658061755423,pspA-32:0.05569822158278419505):0.02219077947076726620):0.99299057857468620014,pspA-23:1.51276427797469659176):0.15413589497708032883):0.15939898234062221949):0.23660076759918716172):0.00461861925303937333):0.03002044039766965655,pspA-31:0.09022645321942507346):0.02410560069003805581,(((pspA-15:0.72495676640385564582,(pspA-17:0.08537518563518509129,pspA-20:0.05450107613074649943):0.03312225101520550885):0.02823410020886808411,pspA-37:0.04303607734486421255):0.05520151224297501630,pspA-25:0.04043049259731745781):0.05152615294739129603,pspA-38:0.00917388377063698898):0.0;