Paper summary – Joint sequencing of human and pathogen genomes reveals the genetics of pneumococcal meningitis

This a summary of our paper on a joint pathogen and human GWAS that has just been published in Nature Communications:
//doi.org/10.1038/s41467-019-09976-3

This is the last bit of research from my PhD thesis. Also, this was the first thing I started working on back in 2014 (my first GWAS), and our collaborators have been collecting data since 2006 – so it’s good to see this one out!

Overview

We collected cases from pneumococcal meningitis patients enrolled in a nationwide Dutch cohort. We were also able to match these with bacterial isolates collected in the nationwide reference lab. For both the patients and the bacteria, we then collected population-matched control samples to perform a case-control genome-wide association study (GWAS), plus some other statistical genetics. We accumulated similar data from case-control cohorts in other populations, again in both patients and bacteria, to increase the number of samples, and perform meta-analysis.

This allows us to look for the affect of any genetic variant across the species (without needing a prior hypothesis of an interesting gene), in the natural population (humans not mice, and natural variation not lab strains). I hope this generates some useful hypotheses for further testing in more controlled experiments, for example using isogenic mutants in a relevant model organism.

I think these were our key findings:

    1. Bacterial genetics did not seem to affect disease severity, but was a big factor in whether invasive disease would be caused after carriage.
    2. Pneumococcal serotype is important in determining invasiveness, but other genetic factors are also involved.
    3. We found some candidate genetic factors other than serotype, including some previously implicated proteins, and some new ones.
    4. Human genetics however affected both disease severity and susceptibility, and we found a couple of specific candidate loci.
    5. We performed an interaction (genome-to-genome) analysis, but were limited in power due to small number of samples (N = 460) versus large number of possible interactions (~1010).

If you’re interested in the biological ramifications of any of these points I’d encourage you to have a look through the paper. I was particularly interested by 1) which seems to rule out pathogen genotype as a diagnostic of disease severity, and 2) and 3) which suggest that targeting only serotype is maybe not the optimal way to a) define pneumococci or b) design vaccines which stop invasive disease.

Technical summary

I’ll also try and describe some of the more technical aspects here, which were interesting to me, but ended up being only briefly covered in the paper.

Bacterial GWAS (pGWAS)

We tried to get as much as we could out of the whole-genome sequence data here, calling as many genetic features as possible for later use in the association:

  • SNPs, genes and k-mers. As have been used in bacterial GWAS studies before.
  • INDELs. Can be less well called by some methods. Here I just looked for short INDELs which might result in frameshifts.
  • Antigen alleles. Noted in various papers before,(e.g. //dx.doi.org/10.1073/pnas.1613937114 and //dx.doi.org/10.1093/infdis/jiw628), these might not get well tagged by the SNPs. I built a new machine-learning classifier for three genes (pspC, pspA, zmpA), but it’s probably quite similar to other methods (and was relatively labour and computationally intensive).
  • Copy number variants (including large deletions).
  • A phase-variable inverting locus (ivr). This requires using the read pair information due to repeats breaking up assembly and mapping.

We then fed these into some association models:

  • SEER/pyseer. Used to try and find individual genetic associations with meningitis. We found the mixed effects model was required to get anything sensible, as phenotype was strongly correlated with genetic background.
  • limix. Used to obtain heritability estimates, including partitioning by genome segment.
  • A hierarchical Bayesian model was used for the ivr, as it contains variation within each sample.

Ultimately, only the use of SNPs, INDELs and k-mers led to significant findings. The former were also useful for the heritability analysis with limix. This was also the first time I tried to include association of rare variation (aggregating through gene and operon burden tests). This seemed to be one of the more successful approaches, perhaps due to lower confounding with background, and I would certainly recommend it to anyone trying a bacterial GWAS in future (and you can easily do it in pyseer!). It definitely benefited from the calling of short INDELs.

One further point is that some of the strongest associations we found were to transposons and BOX elements. While BOX elements have been associated with gene expression and competence, none of the transposons had any obvious biological effect (no cargo, nothing consistently interrupted, as far as we could tell). I am not sure whether these associations are a biological effect which we may have ignored, or if this is an artefact of the association model, where the p-value is elevated by these elements apparently occurring independently multiple times on different genetic backgrounds, with uneven case and control numbers. I’d like to investigate this further at some point.

Human GWAS (hGWAS)

What a pleasure it was to work with human genotypes and GWAS methods! It was really noticeable how smooth this analysis was, and how it got easier over the years we added cohorts. I think the main reasons for this compared to pGWAS are:

  • Many long standing methods and software packages which directly work with your data, no need for format conversion (e.g. plink, bolt-lmm, locuszoom, LDAK).
  • Excellent web-services and databases which help enrich the interpretation of your findings (e.g. the imputation servers, //www.targetvalidation.org/, GTEx).
  • It’s easier than bacterial GWAS. The well designed genotyping arrays with SNPs that cover most the common variation (after imputation) and lack of strong population structure/fast LD decay definitely made a big difference.

This didn’t make it into the final paper, but I also looked into some models of effect sizes with bayesR, which suggested as well as a few high effect loci (oligogenic), much of the heritability explained is from smaller effect loci (polygenic). So collecting more samples would certainly be useful, though this is hard due to the low incidence.

Have a look at the Data Availability section if you want to use our GWAS summary statistics in any future meta-analysis.

Thanks

It’s been a long time in the works mostly due to the time taken to accumulate all of this data: meningitis is a fairly rare disease, with only about 200 cases each year reported in the Netherlands. As new patients were enrolled and genotypes I ended up repeating the analysis (which got easier each time, as human GWAS methods became more and more impressive and easy to use). We also put a lot of effort into expanding the dataset – we were eventually able to gather three separate cohorts of human genotypes (from the Netherlands, Denmark and the UK Biobank) and two cohorts with bacterial sequences (from the Netherlands and South Africa). Consequently we have a nice long author list, with almost as many affiliations as names. I am particularly grateful to all of these authors who each contributed data, analysis and their time.

Particular thanks to Bart Ferwerda, Philip Kremer and Nicole Wheeler who worked with me on much of the analysis; Thomas Benfield, Anne von Gottberg, Arie van der Ende and Mattijs Brouwer who set up and ran the largest cohorts included; and Jeff Barrett, Stephen Bentley and Diederik van de Beek who supervised the work. Also to the reviewers, who provided lots of useful suggestions for improving the paper (you can see the reviews here).

Using unitigs for bacterial GWAS with pyseer

This post briefly explains how you can now use unitigs, nodes of sequence in a compressed de Bruijn graph enumerated using DBGWAS, in the pyseer software. Broadly, this has the following advantages over k-mer based association:

  • Computational burden: fewer resources used in counting the unitigs, and fewer unitigs that need to have their association tested.
  • Lower multiple testing burden, as unitigs reduce redundancy present in k-mer counting.
  • Easier to interpret: unitigs are usually longer than k-mers, and further context (surrounding sequence) can be analysed by using the graph structure they come from.

More details below.

Background

A recent, excellent, paper by Jaillard, Lima et al. showed how unitigs (explained below) can be used instead of k-mers in bacterial GWAS:
//dx.doi.org/10.1371/journal.pgen.1007758

The authors present their software DBGWAS which is an end-to-end bacterial GWAS solution to go from assemblies to associated graph elements (using bugwas/gemma to perform the association).

We are currently adding some new association models to pyseer, and I wanted to follow DBGWAS’ example and use unitigs in these new approaches. While we’re not ready to release the new models quite yet, it is now possible to use unitigs instead of k-mers with the existing association models implemented in pyseer (fixed effects, mixed effects, lineage effects). This has been achieved through minor modifications to the DBGWAS code, so I am indebted to these authors for making this possible.

Usage/implementation

I have updated the documentation to include these details.

Count the unitigs

  1. Install unitig-counter using bioconda (‘conda install -c bioconda unitig-counter’).
  2. Create a list of assemblies and their names as input (see //github.com/johnlees/unitig-counter for details).
  3. Run the counting step, using multiple cores if available.

Run the association

  1. Simply use the same options as for a k-mer association, but drop in output/unitigs.txt.gz produced above as the –kmers option.
  2. As the number of tests for significance threshold, use the number of unique patterns reported (or count the lines in output/unitigs.unique_rows.Rtab).

Interpret the results

  1. As you would for k-mers, use the included scripts to map to a reference and produce a Manhattan plot in Phandango, or annotate the significant sequences.
  2. To provide extra context, or lengthen short sequences, unitigs can be extended leftwards and rightwards following the graph using ‘cdbg-ops extend’.

Citations

If you use unitigs in your association, please cite the DBGWAS paper:
Jaillard, M., Lima L, et al. A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events. PLoS Genet. 14, e1007758 (2018). doi:10.1371/journal.pgen.1007758.

If you find pyseer useful, citation would be appreciated:
Lees, John A., Galardini, M., et al. pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 34:4310–4312 (2018). doi:10.1093/bioinformatics/bty539.

Using ALF to simulate large, closely related populations of bacteria

I am currently trying to use ALF (the stand-alone version) to simulate data from a custom tree, and include realistic parameters for SNP rate, INDEL rate, gene loss and recombination rates. This is a little different to what I think the program was originally designed for – small numbers of divergent organisms – but is probably an easier problem.

ALF is good because it includes a lot of features of evolution more naive models don’t encompass, and gives good output useful for further simulation and testing work.

I’ve made the following notes and tweaks to fix issues as I’ve been going along, which I hope may be of use to anyone trying to use the software for this purpose

  • For custom INDEL distributions, they must be specified in the parameters file as (note the double bracket):
    IndelModel(0.02,'CUSTOM', [[0.5,0.25,0.2,0.05]], 20)

    (thanks to the author Daniel Dalquen for helping me with this)

  • Custom trees must have no labels on the internal nodes. To ignore these you can remove the InternalLabels argument on line 820 of lib/simulator/SE_DataOutput.drw
  • Make sure ‘unitIsPam’ is set to false for trees with substitutions per site, which is the default unit for e.g. Raxml trees
  • If you’re simulating a lot of lateral gene transfer events with multiple genes, you’ll run into a transLoc out of range error due to a bug in the code. This can be fixed by changing line 604 in lib/simulator/SE_Evolutionary_Events.drw to
    place := Rand(0..length(geneR[org]) - lgtSize);

I have also written some helper scripts, which can be found in //github.com/johnlees/bioinformatics/tree/master/sequence_evolution/ALF

  • gff2darwin.pl: Helps convert gff annotation files to custom input starting sequences
  • alf_db_to_fasta.pl: Converts the DB output formatting into a single fasta contig for an organism -> observed organism genome
  • alf_msa_concat.pl: Converts MSA output (which is by gene) into true alignments by organism -> true alignment
  • genes_to_contig.pl: Concatenates all contigs to create a whole genome alignment file (output from alf_msa_concat,pl) -> true alignment for population