Paper summary – Joint sequencing of human and pathogen genomes reveals the genetics of pneumococcal meningitis

This a summary of our paper on a joint pathogen and human GWAS that has just been published in Nature Communications:

This is the last bit of research from my PhD thesis. Also, this was the first thing I started working on back in 2014 (my first GWAS), and our collaborators have been collecting data since 2006 – so it’s good to see this one out!


We collected cases from pneumococcal meningitis patients enrolled in a nationwide Dutch cohort. We were also able to match these with bacterial isolates collected in the nationwide reference lab. For both the patients and the bacteria, we then collected population-matched control samples to perform a case-control genome-wide association study (GWAS), plus some other statistical genetics. We accumulated similar data from case-control cohorts in other populations, again in both patients and bacteria, to increase the number of samples, and perform meta-analysis.

This allows us to look for the affect of any genetic variant across the species (without needing a prior hypothesis of an interesting gene), in the natural population (humans not mice, and natural variation not lab strains). I hope this generates some useful hypotheses for further testing in more controlled experiments, for example using isogenic mutants in a relevant model organism.

I think these were our key findings:

    1. Bacterial genetics did not seem to affect disease severity, but was a big factor in whether invasive disease would be caused after carriage.
    2. Pneumococcal serotype is important in determining invasiveness, but other genetic factors are also involved.
    3. We found some candidate genetic factors other than serotype, including some previously implicated proteins, and some new ones.
    4. Human genetics however affected both disease severity and susceptibility, and we found a couple of specific candidate loci.
    5. We performed an interaction (genome-to-genome) analysis, but were limited in power due to small number of samples (N = 460) versus large number of possible interactions (~1010).

If you’re interested in the biological ramifications of any of these points I’d encourage you to have a look through the paper. I was particularly interested by 1) which seems to rule out pathogen genotype as a diagnostic of disease severity, and 2) and 3) which suggest that targeting only serotype is maybe not the optimal way to a) define pneumococci or b) design vaccines which stop invasive disease.

Technical summary

I’ll also try and describe some of the more technical aspects here, which were interesting to me, but ended up being only briefly covered in the paper.

Bacterial GWAS (pGWAS)

We tried to get as much as we could out of the whole-genome sequence data here, calling as many genetic features as possible for later use in the association:

  • SNPs, genes and k-mers. As have been used in bacterial GWAS studies before.
  • INDELs. Can be less well called by some methods. Here I just looked for short INDELs which might result in frameshifts.
  • Antigen alleles. Noted in various papers before,(e.g. // and //, these might not get well tagged by the SNPs. I built a new machine-learning classifier for three genes (pspC, pspA, zmpA), but it’s probably quite similar to other methods (and was relatively labour and computationally intensive).
  • Copy number variants (including large deletions).
  • A phase-variable inverting locus (ivr). This requires using the read pair information due to repeats breaking up assembly and mapping.

We then fed these into some association models:

  • SEER/pyseer. Used to try and find individual genetic associations with meningitis. We found the mixed effects model was required to get anything sensible, as phenotype was strongly correlated with genetic background.
  • limix. Used to obtain heritability estimates, including partitioning by genome segment.
  • A hierarchical Bayesian model was used for the ivr, as it contains variation within each sample.

Ultimately, only the use of SNPs, INDELs and k-mers led to significant findings. The former were also useful for the heritability analysis with limix. This was also the first time I tried to include association of rare variation (aggregating through gene and operon burden tests). This seemed to be one of the more successful approaches, perhaps due to lower confounding with background, and I would certainly recommend it to anyone trying a bacterial GWAS in future (and you can easily do it in pyseer!). It definitely benefited from the calling of short INDELs.

One further point is that some of the strongest associations we found were to transposons and BOX elements. While BOX elements have been associated with gene expression and competence, none of the transposons had any obvious biological effect (no cargo, nothing consistently interrupted, as far as we could tell). I am not sure whether these associations are a biological effect which we may have ignored, or if this is an artefact of the association model, where the p-value is elevated by these elements apparently occurring independently multiple times on different genetic backgrounds, with uneven case and control numbers. I’d like to investigate this further at some point.

Human GWAS (hGWAS)

What a pleasure it was to work with human genotypes and GWAS methods! It was really noticeable how smooth this analysis was, and how it got easier over the years we added cohorts. I think the main reasons for this compared to pGWAS are:

  • Many long standing methods and software packages which directly work with your data, no need for format conversion (e.g. plink, bolt-lmm, locuszoom, LDAK).
  • Excellent web-services and databases which help enrich the interpretation of your findings (e.g. the imputation servers, //, GTEx).
  • It’s easier than bacterial GWAS. The well designed genotyping arrays with SNPs that cover most the common variation (after imputation) and lack of strong population structure/fast LD decay definitely made a big difference.

This didn’t make it into the final paper, but I also looked into some models of effect sizes with bayesR, which suggested as well as a few high effect loci (oligogenic), much of the heritability explained is from smaller effect loci (polygenic). So collecting more samples would certainly be useful, though this is hard due to the low incidence.

Have a look at the Data Availability section if you want to use our GWAS summary statistics in any future meta-analysis.


It’s been a long time in the works mostly due to the time taken to accumulate all of this data: meningitis is a fairly rare disease, with only about 200 cases each year reported in the Netherlands. As new patients were enrolled and genotypes I ended up repeating the analysis (which got easier each time, as human GWAS methods became more and more impressive and easy to use). We also put a lot of effort into expanding the dataset – we were eventually able to gather three separate cohorts of human genotypes (from the Netherlands, Denmark and the UK Biobank) and two cohorts with bacterial sequences (from the Netherlands and South Africa). Consequently we have a nice long author list, with almost as many affiliations as names. I am particularly grateful to all of these authors who each contributed data, analysis and their time.

Particular thanks to Bart Ferwerda, Philip Kremer and Nicole Wheeler who worked with me on much of the analysis; Thomas Benfield, Anne von Gottberg, Arie van der Ende and Mattijs Brouwer who set up and ran the largest cohorts included; and Jeff Barrett, Stephen Bentley and Diederik van de Beek who supervised the work. Also to the reviewers, who provided lots of useful suggestions for improving the paper (you can see the reviews here).