John Lees' blog

Pathogens, informatics and modelling at EMBL-EBI

Using unitigs for bacterial GWAS with pyseer

This post briefly explains how you can now use unitigs, nodes of sequence in a compressed de Bruijn graph enumerated using DBGWAS, in the pyseer software. Broadly, this has the following advantages over k-mer based association: Computational burden: fewer resources used in counting the unitigs, and fewer unitigs that need to have their association tested. Lower multiple testing burden, as unitigs reduce redundancy present in k-mer counting. Easier to interpret: unitigs are usually longer than k-mers, and further context (surrounding sequence) can be analysed by using the graph structure they come from.

Paper summary – PopPUNK for bacterial epidemiology

A paper describing our recent method for bacterial epidemiology PopPUNK has just been published in Genome Research, which you can read here: https://dx.doi.org/10.1101/gr.241455.118 You can install our software by running conda install poppunk and that full details and documentation can be found at https://poppunk.readthedocs.io In this blog post I will attempt to describe some of our key features and findings in a shorter format. Broadly, I think there are three main parts:

Creating a conda package with compilation and dependencies

I’ve just finished, what was for me, a difficult compiler/packaging attempt – creating a working bioconda package for seer. You can look at the pull request to see how many times I failed: https://github.com/bioconda/bioconda-recipes/pull/11263 (I would note I have made this package for legacy users. I would direct anyone interested in the software itself to the reimplementation pyseer) The reason this was difficult was due to my own inclusion of a number of packages, all of which also need compilation, further adding to the complexity.

conda build: libarchive: cannot open shared object file: No such file or directory

I was getting the following error, attempting to run conda-build on a package, using a conda env: Traceback (most recent call last): File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/bin/conda-build", line 7, in <module> from conda_build.cli.main_build import main File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/cli/main_build.py", line 18, in <module> import conda_build.api as api File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/api.py", line 22, in <module> from conda_build.config import Config, get_or_merge_config, DEFAULT_PREFIX_LENGTH as _prefix_length File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/config.py", line 17, in <module> from .variants import get_default_variant File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/variants.py", line 15, in <module> from conda_build.

Linear scaling of covariances

In our software PopPUNK we draw a plot of a Gaussian mixture model that uses both the implementation and the excellent example in the scikit-learn documentation: Gaussian mixture model with mixture components plotted as ellipses My input is 2D distance, which I first use StandardScaler to normalise each axes between 0 and 1, which helps standardise methods across various parts of the code. This is fine if you then create these plots in the scaled space, and as it is a simple linear scaling it is generally trivial to convert back into the original co-ordinates:

Expression modules in S. pneumoniae

I recently read a pre-print from the Veening lab where they had reconstructed various (22 total) physiological conditions in vitro and then measured expression levels with RNA-seq. I thought it was a great bit of research, and would encourage you to read it here if you’re interested: https://doi.org/10.1101/283739 They’ve also done a really good job with data availability, having released a browser for their data (PneumoExpress), and they have put their raw data on zenodo.