Conservation of core genes in S. pneumoniae

A question I am sometimes asked is whether a gene of interest, usually being studied in vitro or in vivo, is conserved. Although the availability of population genomic datasets allows this question to be answered, it can be hard to find this kind of analysis in the literature, and doing it yourself is not trivial. This post hopes to be an easy way to access this information for S. pneumoniae.

I think three useful measures of conservation are:

  • Is it a core gene? If not, what is the frequency in the population?
  • What is the omega value (dN/dS) of the gene, which is the ratio of non-synonymous to synonymous changes? This tells us something about how it appears to be evolving, and the selection pressures it is under.
  • What is the pi value of the gene, which is the average number of sequence differences if two sequences from the population are compared. This tells us about the amount of sequence diversity in the population, and whether multiple alleles are maintained. (The pis used here are for amino acid rather than nucleotide differences).

Tajima’s D can also be useful, but I haven’t included it here as I think its interpretation is less straightforward.


The plots below are a re-presentation of some data first published in Croucher, N. J. et al. Diverse evolutionary patterns of pneumococcal antigens identified by pangenome-wide immunological screening. Proc. Natl. Acad. Sci. U. S. A. (2017). doi:10.1073/pnas.1613937114

In this paper, using 616 genomes isolated from carriage in children in Massachusetts, Croucher et al defined and annotated core gene clusters, produced MSAs and calculated many useful statistics about which gene. This information can all be found in Dataset_S01.xlsx from the paper. I’ve listed my methods at the end.


The data is all available in the supplementary information of the cited paper. I also re-analysed it using the following steps:

  1. Extract each core gene DNA sequence using the annotations from the paper.
  2. Translate sequences.
  3. Align protein sequences with MUSCLE.
  4. Use dendropy.calculate.popgenstat.nucleotide_diversity to calculate pi from 3).
  5. Use RevTrans to produce a nucleotide alignment from the output of 1) and 3).
  6. Use SLAC in HyPhy to calculate dN, dS and omega from 3).

See for the code.

Is it a core gene?

If it is in this table/plots below, yes. If not, no, it is an accessory gene.

I don’t have accessory gene frequencies included here, as these were only partially analysed in the cited paper.

dN/dS and pi

(sorry, plot lost on conversion of blog, but all data is still below)

Table of data: