My research focuses on computational analyses of large microbial genome datasets, both to increase our understanding of infectious diseases, and to develop the efficient and easy-to-use methods required to perform these analyses. Much of my efforts have focussed on Streptococcus pneumoniae, due to its importance as a global pathogen and the recent availability of large genomic datasets containing over 35,000 whole genomes.

I am particularly passionate about developing open source tools in bioinformatics, and widening access to genomic analysis of publicly funded data.

Bacterial statistical genetics methods and applications

I have developed methods to perform genome-wide association studies (GWAS) in bacterial populations, and regularly update this package (pyseer) to include newer methods where I can.

In previous work I have used these methods to investigate the genetic basis for antimicrobial resistance, carriage duration and virulence in Streptococcus pneumoniae; and separately virulence in Streptococcus pyogenes and Listeria monocytogenes.

Newer developments include using genome-wide models rather than univariate approaches, and the use of machine-learning to predict traits such as resistance and virulence.

Bacterial genomic epidemiology

Our main contribution to this area is the PopPUNK package, for bacterial sequence clustering and epidemiology. By removing the computationally complex requirements to produce these results from sequence data, we hope to engage with a wider range of users interested in disease surveillance.

The broad advantages of this approach are that it can define clusters reproducibly and quickly, with reference to larger populations. It retains the major advantages of MLST/cgMLST/wgMLST approaches (common cluster names, fast and easy to use, comparison against large databases of other samples, no need to completely recluster for every new genome) while improving on some of their drawbacks (no need for gene annotation, uses the entire genome, biologically motivated cutoff, no need to define a schema).

Ongoing work is further improving the speed, modularity, and breadth of application of this method.

We are looking to run this method on multiple pathogens to create ‘reference databases‘ as we have done for Streptococcus pneumoniae and Streptococcus pyogenes. If you have some data that would be suitable for a new species, or know of some good publicly available data, please get in touch!

I also once built a tree of tree building methods, which I think is when I peaked.

Modelling bacterial evolution

Some of my favourite projects in this area have been where we were able to combine genomics and modelling approaches with in vivo experimental data. These have included a stochastic competition model fitted with likelihood-free inference to data from a mouse model of competition; showing immune exclusion of pilus using both sputum samples and genomic data; and the sequential acquisition of virulence elements in Staphylococcus aureus.

Other then where this overlaps with the statistical genetics approaches above, I have also used purely genomic data to look at within-host variation during meningitis. This led to an unfortunately titled paper (!) which probably hid the most interesting result: signs that deactivation of dlt (responsible for D-alanylation and increased inflammation and immunity to microbial peptides) is adaptive.

Ongoing work is looking at the genetic basis for variance in transmissibility, and whether this can be used to identify novel immunogens.

Software development and making access to genomic data more democratic

One of the things I most enjoy about my job is trying to make things that other people find useful too! It’s great that people find our software of use, and it’s always still exciting getting messages from users from countries all around the world. I want to maintain and improve our software, and help people use it as much as I can. (After all, one of the great things about working thanks to public/charity funds is that everything I work on goes into the public domain, rather than being kept closed-source.)

I try my best to follows good software development practises, though am always looking to discuss ways I can become better at this.

A new project aims to broaden access to genomic analysis, or at least offer a new approach (many great tools exist already!).