Conservation of core genes in S. pneumoniae

A question I am sometimes asked is whether a gene of interest, usually being studied in vitro or in vivo, is conserved. Although the availability of population genomic datasets allows this question to be answered, it can be hard to find this kind of analysis in the literature, and doing it yourself is not trivial. This post hopes to be an easy way to access this information for S. pneumoniae.

I think three useful measures of conservation are:

  • Is it a core gene? If not, what is the frequency in the population?
  • What is the omega value (dN/dS) of the gene, which is the ratio of non-synonymous to synonymous changes? This tells us something about how it appears to be evolving, and the selection pressures it is under.
  • What is the pi value of the gene, which is the average number of sequence differences if two sequences from the population are compared. This tells us about the amount of sequence diversity in the population, and whether multiple alleles are maintained. (The pis used here are for amino acid rather than nucleotide differences).

Tajima’s D can also be useful, but I haven’t included it here as I think its interpretation is less straightforward.


The plots below are a re-presentation of some data first published in
Croucher, N. J. et al. Diverse evolutionary patterns of pneumococcal antigens identified by pangenome-wide immunological screening. Proc. Natl. Acad. Sci. U. S. A. (2017). doi:10.1073/pnas.1613937114

In this paper, using 616 genomes isolated from carriage in children in Massachusetts, Croucher et al defined and annotated core gene clusters, produced MSAs and calculated many useful statistics about which gene. This information can all be found in Dataset_S01.xlsx from the paper. I’ve listed my methods at the end.


The data is all available in the supplementary information of the cited paper. I also re-analysed it using the following steps:

  1. Extract each core gene DNA sequence using the annotations from the paper.
  2. Translate sequences.
  3. Align protein sequences with MUSCLE.
  4. Use dendropy.calculate.popgenstat.nucleotide_diversity to calculate pi from 3).
  5. Use RevTrans to produce a nucleotide alignment from the output of 1) and 3).
  6. Use SLAC in HyPhy to calculate dN, dS and omega from 3).

See for the code.

Is it a core gene?

If it is in this table/plots below, yes.
If not, no, it is an accessory gene.

I don’t have accessory gene frequencies included here, as these were only partially analysed in the cited paper.

dN/dS and pi

Interactive scatter plot (full size here):

Table of data:


Expression modules in S. pneumoniae

I recently read a pre-print from the Veening lab where they had reconstructed various (22 total) physiological conditions in vitro and then measured expression levels with RNA-seq. I thought it was a great bit of research, and would encourage you to read it here if you’re interested:

They’ve also done a really good job with data availability, having released a browser for their data (PneumoExpress), and they have put their raw data on zenodo.

I was interested in trying to replicate their expression module analysis, which was performed using WGCNA. I first downloaded the co-expression matrix, which is split across five CSV files in the zenodo archive. I then basically ran WGCNA exactly as suggested in their tutorial. I then used the annotations from the lab’s D39V genome to get final expression modules.

The expression modules look sensible to me, with the comE master regulator splitting the expression of the competence pathway and associated genes as expected. The modules I predicted can be found here.

Thanks to the authors for making their data so accessible! If you find this useful I would also note the authors also analysed their data with WGCNA, probably in a more optimal and error-checked way than I have, so most likely they have better clusters.

Here is a gist of the code I used: