Paper summary – Joint sequencing of human and pathogen genomes reveals the genetics of pneumococcal meningitis

This a summary of our paper on a joint pathogen and human GWAS that has just been published in Nature Communications:
//doi.org/10.1038/s41467-019-09976-3

This is the last bit of research from my PhD thesis. Also, this was the first thing I started working on back in 2014 (my first GWAS), and our collaborators have been collecting data since 2006 – so it’s good to see this one out!

Overview

We collected cases from pneumococcal meningitis patients enrolled in a nationwide Dutch cohort. We were also able to match these with bacterial isolates collected in the nationwide reference lab. For both the patients and the bacteria, we then collected population-matched control samples to perform a case-control genome-wide association study (GWAS), plus some other statistical genetics. We accumulated similar data from case-control cohorts in other populations, again in both patients and bacteria, to increase the number of samples, and perform meta-analysis.

This allows us to look for the affect of any genetic variant across the species (without needing a prior hypothesis of an interesting gene), in the natural population (humans not mice, and natural variation not lab strains). I hope this generates some useful hypotheses for further testing in more controlled experiments, for example using isogenic mutants in a relevant model organism.

I think these were our key findings:

    1. Bacterial genetics did not seem to affect disease severity, but was a big factor in whether invasive disease would be caused after carriage.
    2. Pneumococcal serotype is important in determining invasiveness, but other genetic factors are also involved.
    3. We found some candidate genetic factors other than serotype, including some previously implicated proteins, and some new ones.
    4. Human genetics however affected both disease severity and susceptibility, and we found a couple of specific candidate loci.
    5. We performed an interaction (genome-to-genome) analysis, but were limited in power due to small number of samples (N = 460) versus large number of possible interactions (~1010).

If you’re interested in the biological ramifications of any of these points I’d encourage you to have a look through the paper. I was particularly interested by 1) which seems to rule out pathogen genotype as a diagnostic of disease severity, and 2) and 3) which suggest that targeting only serotype is maybe not the optimal way to a) define pneumococci or b) design vaccines which stop invasive disease.

Technical summary

I’ll also try and describe some of the more technical aspects here, which were interesting to me, but ended up being only briefly covered in the paper.

Bacterial GWAS (pGWAS)

We tried to get as much as we could out of the whole-genome sequence data here, calling as many genetic features as possible for later use in the association:

  • SNPs, genes and k-mers. As have been used in bacterial GWAS studies before.
  • INDELs. Can be less well called by some methods. Here I just looked for short INDELs which might result in frameshifts.
  • Antigen alleles. Noted in various papers before,(e.g. //dx.doi.org/10.1073/pnas.1613937114 and //dx.doi.org/10.1093/infdis/jiw628), these might not get well tagged by the SNPs. I built a new machine-learning classifier for three genes (pspC, pspA, zmpA), but it’s probably quite similar to other methods (and was relatively labour and computationally intensive).
  • Copy number variants (including large deletions).
  • A phase-variable inverting locus (ivr). This requires using the read pair information due to repeats breaking up assembly and mapping.

We then fed these into some association models:

  • SEER/pyseer. Used to try and find individual genetic associations with meningitis. We found the mixed effects model was required to get anything sensible, as phenotype was strongly correlated with genetic background.
  • limix. Used to obtain heritability estimates, including partitioning by genome segment.
  • A hierarchical Bayesian model was used for the ivr, as it contains variation within each sample.

Ultimately, only the use of SNPs, INDELs and k-mers led to significant findings. The former were also useful for the heritability analysis with limix. This was also the first time I tried to include association of rare variation (aggregating through gene and operon burden tests). This seemed to be one of the more successful approaches, perhaps due to lower confounding with background, and I would certainly recommend it to anyone trying a bacterial GWAS in future (and you can easily do it in pyseer!). It definitely benefited from the calling of short INDELs.

One further point is that some of the strongest associations we found were to transposons and BOX elements. While BOX elements have been associated with gene expression and competence, none of the transposons had any obvious biological effect (no cargo, nothing consistently interrupted, as far as we could tell). I am not sure whether these associations are a biological effect which we may have ignored, or if this is an artefact of the association model, where the p-value is elevated by these elements apparently occurring independently multiple times on different genetic backgrounds, with uneven case and control numbers. I’d like to investigate this further at some point.

Human GWAS (hGWAS)

What a pleasure it was to work with human genotypes and GWAS methods! It was really noticeable how smooth this analysis was, and how it got easier over the years we added cohorts. I think the main reasons for this compared to pGWAS are:

  • Many long standing methods and software packages which directly work with your data, no need for format conversion (e.g. plink, bolt-lmm, locuszoom, LDAK).
  • Excellent web-services and databases which help enrich the interpretation of your findings (e.g. the imputation servers, //www.targetvalidation.org/, GTEx).
  • It’s easier than bacterial GWAS. The well designed genotyping arrays with SNPs that cover most the common variation (after imputation) and lack of strong population structure/fast LD decay definitely made a big difference.

This didn’t make it into the final paper, but I also looked into some models of effect sizes with bayesR, which suggested as well as a few high effect loci (oligogenic), much of the heritability explained is from smaller effect loci (polygenic). So collecting more samples would certainly be useful, though this is hard due to the low incidence.

Have a look at the Data Availability section if you want to use our GWAS summary statistics in any future meta-analysis.

Thanks

It’s been a long time in the works mostly due to the time taken to accumulate all of this data: meningitis is a fairly rare disease, with only about 200 cases each year reported in the Netherlands. As new patients were enrolled and genotypes I ended up repeating the analysis (which got easier each time, as human GWAS methods became more and more impressive and easy to use). We also put a lot of effort into expanding the dataset – we were eventually able to gather three separate cohorts of human genotypes (from the Netherlands, Denmark and the UK Biobank) and two cohorts with bacterial sequences (from the Netherlands and South Africa). Consequently we have a nice long author list, with almost as many affiliations as names. I am particularly grateful to all of these authors who each contributed data, analysis and their time.

Particular thanks to Bart Ferwerda, Philip Kremer and Nicole Wheeler who worked with me on much of the analysis; Thomas Benfield, Anne von Gottberg, Arie van der Ende and Mattijs Brouwer who set up and ran the largest cohorts included; and Jeff Barrett, Stephen Bentley and Diederik van de Beek who supervised the work. Also to the reviewers, who provided lots of useful suggestions for improving the paper (you can see the reviews here).

Conservation of core genes in S. pneumoniae

A question I am sometimes asked is whether a gene of interest, usually being studied in vitro or in vivo, is conserved. Although the availability of population genomic datasets allows this question to be answered, it can be hard to find this kind of analysis in the literature, and doing it yourself is not trivial. This post hopes to be an easy way to access this information for S. pneumoniae.

I think three useful measures of conservation are:

  • Is it a core gene? If not, what is the frequency in the population?
  • What is the omega value (dN/dS) of the gene, which is the ratio of non-synonymous to synonymous changes? This tells us something about how it appears to be evolving, and the selection pressures it is under.
  • What is the pi value of the gene, which is the average number of sequence differences if two sequences from the population are compared. This tells us about the amount of sequence diversity in the population, and whether multiple alleles are maintained. (The pis used here are for amino acid rather than nucleotide differences).

Tajima’s D can also be useful, but I haven’t included it here as I think its interpretation is less straightforward.

Background

The plots below are a re-presentation of some data first published in
Croucher, N. J. et al. Diverse evolutionary patterns of pneumococcal antigens identified by pangenome-wide immunological screening. Proc. Natl. Acad. Sci. U. S. A. (2017). doi:10.1073/pnas.1613937114

In this paper, using 616 genomes isolated from carriage in children in Massachusetts, Croucher et al defined and annotated core gene clusters, produced MSAs and calculated many useful statistics about which gene. This information can all be found in Dataset_S01.xlsx from the paper. I’ve listed my methods at the end.

Methods

The data is all available in the supplementary information of the cited paper. I also re-analysed it using the following steps:

  1. Extract each core gene DNA sequence using the annotations from the paper.
  2. Translate sequences.
  3. Align protein sequences with MUSCLE.
  4. Use dendropy.calculate.popgenstat.nucleotide_diversity to calculate pi from 3).
  5. Use RevTrans to produce a nucleotide alignment from the output of 1) and 3).
  6. Use SLAC in HyPhy to calculate dN, dS and omega from 3).

See //github.com/johnlees/pneumo_core for the code.

Is it a core gene?

If it is in this table/plots below, yes.
If not, no, it is an accessory gene.

I don’t have accessory gene frequencies included here, as these were only partially analysed in the cited paper.

dN/dS and pi

Interactive scatter plot (full size here):

Table of data:

 

Expression modules in S. pneumoniae

I recently read a pre-print from the Veening lab where they had reconstructed various (22 total) physiological conditions in vitro and then measured expression levels with RNA-seq. I thought it was a great bit of research, and would encourage you to read it here if you’re interested: //doi.org/10.1101/283739

They’ve also done a really good job with data availability, having released a browser for their data (PneumoExpress), and they have put their raw data on zenodo.

I was interested in trying to replicate their expression module analysis, which was performed using WGCNA. I first downloaded the co-expression matrix, which is split across five CSV files in the zenodo archive. I then basically ran WGCNA exactly as suggested in their tutorial. I then used the annotations from the lab’s D39V genome to get final expression modules.

The expression modules look sensible to me, with the comE master regulator splitting the expression of the competence pathway and associated genes as expected. The modules I predicted can be found here.

Thanks to the authors for making their data so accessible! If you find this useful I would also note the authors also analysed their data with WGCNA, probably in a more optimal and error-checked way than I have, so most likely they have better clusters.

Here is a gist of the code I used: