Thoughts on ‘Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species’

I was happy to see that this paper, which originally appeared as a preprint back in April 2019 (!), was published earlier this month.I thought it was one of the most thought-provoking papers I’ve read recently, so suggested a journal club on the final version (it’s long paper — over 80 pages).

There were some parts that I liked a lot, and some parts I didn’t like, which I wanted to summarise here. Overall, I thought the paper brought an interesting ‘outsider’ approach to the problem of bacterial population genomics, and quantified some issues in new and useful ways. However, I was less keen on the presentation, which to me was overly confrontational, and failed to put the research within a proper modern context. I’ve summarised some of the discussion here, which I note is not meant to be a thorough review, and is subjective.

I think I remember some tweets when the preprint came out along the lines of ‘are we analysing bacterial genomes all wrong?’, to which I think we can safely say the answer is no. So, at the bottom of the page, I outline an alternative analysis approach to that which the authors critique, and show that it is likely to be compatible with their findings.

[here I will use ‘strain’ to mean a cluster of related samples, usually a clade on a tree. In the paper the authors use ‘strain’ to mean a single sample]

Update:  Erik van Nimwegen, the corresponding author on the paper, replied to some of these points. I have copied his replies at the end of the post, and added some final thoughts responding to these.

Liked

  • The ‘outsider’ perspective contributes novel perspective, analysis and plots to a tricky problem.
  • The paper is almost relentlessly quantitative, which is particularly appreciated on a problem which is often discussed fairly qualitatively.
  • I really liked the Poisson + negative bionomial fit to pairwise comparisons to quantify the proportion of recombined regions.
  • Datasets were small, but seemed well-chosen to study the problem. ‘Extremes’ on H. pylori and M. tuberculosis were useful to demonstrate these regimes.
  • The interpretation of whole genome phylogenies as rates of recombination, or population structure between strains, seemed convincing.

Ambivalent

  • The authors spent a while on phylogenetic entropy. This seemed interesting, but ultimately my opinion was that this may not be the right tool to talk about this problem. It took me a long time to follow the concept, and while I do agree that it measures something about these populations, it’s not clear exactly what. I think it is perhaps more important consider whether entropy is the right ‘language’ with which to discuss genome evolution, and would posit that molecular terms are easier to understand, even if they cannot explain every nuance in a single number.
  • I was not sure how much of a dependence n-SNPs would have on sample size, which I don’t think was investigated.
  • The model of core genome evolution with random SNP and recombination accrual seemed a little simple to test the questions at hand.
  • I did not buy the network analysis — the power law fits were not convincing, and lacked explanatory power.

Did not like

  • The three attacked models for whole populations, ‘clonal’, ‘freely recombining’ and ‘clonal + some randomly recombining’, are all straw men. I don’t think many researchers would expect bacterial genomes to evolve in these ways.
  • I think there’s a misunderstanding of how bacterial genomics/phylogenetics is typically performed. In my experience, few people would attempt to construct a whole-species tree and infer clonal ancestry from it. Most people would (and typically should) analyse strains/lineages, which are those clusters below the ~90% clonal threshold identified.
  • Recombination detection is misrepresented: gubbins and clonalframe work within a strain, in which there is a genuine clonal frame. They are not designed to be applied to an entire species’ diversity. Methods such as FastGear, which do attempt to find recombination across a population, weren’t mentioned in the discussion where this criticism was made.
  • Related to both of these points, I think the paper picked on some less good examples of phylogenetic analysis, and ignored the better ones. Previous papers have shown how to partition a population into strains to analyse closely related groups. Complex forward simulations, not assuming exchangeability, including selection and saltational recombination have been shown to give rise to genomes which look very similar to observed data.

So, how should we analyse bacterial populations?

Some analyses such as GWAS are of course valid on the whole population. But for those where we would like to find clusters or analyse clonal ancestry, here is a pipeline that I would suggest, which I think will work in many cases:

  1. Standard quality control steps.
  2. Split the population into strains. I would recommend PopPUNK, but cgMLST also works.
  3. Within each of these strains:
    1. Create an alignment by: mapping reads to a reference or assembly within the strain; or by using ska.
    2. Remove recombination from this alignment with gubbins or ClonalFrameML.
    3. Build a phylogeny with IQTree or RAxML. (Note that if using gubbins, this will have been run as part of the pipeline).
    4. Use fastbaps to find subclusters within your strain.
    5. If you wish, BactDating is a good way to time your tree.
  4. Pangenome analysis with panaroo or PEPPAN.

If this looks useful I have written some of these steps as a snakemake pipeline. It’s available here, though is a work-in-progress at the time of writing.

To motivate this pipeline within the context of this paper’s contributions, I would like to present a figure from our PopPUNK paper, showing a ‘within-strain’ boundary in core-accessory distance space:

PopPUNK fits to E. coli
PopPUNK fits to E. coli

You can see (or see figure 4 from the paper if it’s too small) that for all of these fits there is a discontinuous region we divide strains at a divergence of around 0.003. The authors identify a 𝝅-crit of 0.0032 for mostly clonal sites in this species — which aligns nicely. Looking at the rest of these fits, for all species this discontinuous space, and where we divide strains is below 0.014, which is the divergence the authors identify across species.

Thanks to the BEE group, and in particular Sam Horsfield and Nick Croucher for their comments on the paper and this post.

Update: Author reply

Erik van Nimwegen’s reply to some of these points (via twitter)

  • The three attacked models for whole populations, ‘clonal’, ‘freely recombining’ and ‘clonal + some randomly recombining’, are all straw men. I don’t think many researchers would expect bacterial genomes to evolve in these ways.

Yet these models are used in methods such as clonalframe and fitted to the data. If it’s obvious these models are wrong, than it is also wrong to make inferences by fitting them to genomic data.

  • I think there’s a misunderstanding of how bacterial genomics/phylogenetics is typically performed. In my experience, few people would attempt to construct a whole-species tree and infer clonal ancestry from it. Most people would (and typically should) analyse strains/lineages, which are those clusters below the ~90% clonal threshold identified.

For E. coli, clusters with >=90% clonal correspond to genomes that are all less than 0.001 divergence of each other. This would partition our 92 strains into many small clusters (singlets, pairs, triplets), i.e. the ‘phylogroups’ would be broken up into little clusters.

I don’t think it is accurate to say ‘most people’ would refuse to infer clonal ancestry beyond these small clusters. Also note that, even for pairs less than 0.001 diverged, one may have that >90% of the divergence comes from recombined regions. So one would have to be careful to reconstruct the clonal tree even for these tiny groups. I also know that, for our E. coli data, one would simply not have the SNPs to fully resolve the phylogenies of these small clusters.

  • Recombination detection is misrepresented: gubbins and clonalframe work within a strain, in which there is a genuine clonal frame. They are not designed to be applied to an entire species’ diversity. Methods such as FastGear, which do attempt to find recombination across a population, weren’t mentioned in the discussion where this criticism was made.

This is just not true. As we cite, the developers of clonalframe specifically used it to reconstruct a clonal phylogeny of E. coli (Didelot et al, BMC genomics, 2012). The well-known Touchon et al paper also claims a clonal tree for a diverse set of E. coli strains. So I don’t accept we have misrepresented here.

Finally, you propose a pipe-line for analysis. We made our E. coli core genome alignment available and I would be very interested to see what this pipe-line would end up giving.

My reply to these points

I agree with the authors that recombination detection using a method which assumes a clonal tree across a species is generally inappropriate. I also think it’s clear that there are good examples of analysis and less good examples of such analysis in previous literature — I don’t think either myself or Erik are ‘right’ in our view of how analysis is typically done, it’s just our (differing) opinion.

Some of the critiqued papers cited are somewhat older: it’s true that the cited paper from 2012 attempted this analysis, though the original clonalframe was ‘designed mostly to work on MLST data’. The updated version, from 2015, is applied mostly within STs (though there is a whole population analysis of S. aureus at the end). Gubbins, from the same year, was exclusively applied within STs.

My belief is that analysis has moved on, and there are numerous other examples of first partitioning a population into strains before trying to detect recombination.  I further think that the authors (Sakoparnig et al) have made a nice contribution formalising the rationale for these improved approaches.

One final point: the dataset used covers much of E. coli’s diversity in just 92 strains, so it’s not surprising that ‘strains’ would consist of primarily singletons and doubletons in this dataset. Datasets with larger sample sizes will likely work a lot better here. I hope to apply PopPUNK/PopPIPE to the 92 sample dataset analysed here in the future.