Contents

Using ALF to simulate large, closely related populations of bacteria

Contents

I am currently trying to use ALF (the stand-alone version) to simulate data from a custom tree, and include realistic parameters for SNP rate, INDEL rate, gene loss and recombination rates. This is a little different to what I think the program was originally designed for – small numbers of divergent organisms – but is probably an easier problem.

ALF is good because it includes a lot of features of evolution more naive models don’t encompass, and gives good output useful for further simulation and testing work.

I’ve made the following notes and tweaks to fix issues as I’ve been going along, which I hope may be of use to anyone trying to use the software for this purpose

  • For custom INDEL distributions, they must be specified in the parameters file as (note the double bracket):

    IndelModel(0.02,'CUSTOM', [[0.5,0.25,0.2,0.05]], 20)
    

    (thanks to the author Daniel Dalquen for helping me with this)

  • Custom trees must have no labels on the internal nodes. To ignore these you can remove the InternalLabels argument on line 820 of lib/simulator/SE_DataOutput.drw

  • Make sure ‘unitIsPam’ is set to false for trees with substitutions per site, which is the default unit for e.g. Raxml trees

  • If you’re simulating a lot of lateral gene transfer events with multiple genes, you’ll run into a transLoc out of range error due to a bug in the code. This can be fixed by changing line 604 in lib/simulator/SE_Evolutionary_Events.drw to

    place := Rand(0..length(geneR[org]) - lgtSize);
    

I have also written some helper scripts, which can be found in https://github.com/johnlees/bioinformatics/tree/master/sequence_evolution/ALF

  • gff2darwin.pl: Helps convert gff annotation files to custom input starting sequences
  • alf_db_to_fasta.pl: Converts the DB output formatting into a single fasta contig for an organism -> observed organism genome
  • alf_msa_concat.pl: Converts MSA output (which is by gene) into true alignments by organism -> true alignment
  • genes_to_contig.pl: Concatenates all contigs to create a whole genome alignment file (output from alf_msa_concat,pl) -> true alignment for population