Using unitigs for bacterial GWAS with pyseer

This post briefly explains how you can now use unitigs, nodes of sequence in a compressed de Bruijn graph enumerated using DBGWAS, in the pyseer software. Broadly, this has the following advantages over k-mer based association:

  • Computational burden: fewer resources used in counting the unitigs, and fewer unitigs that need to have their association tested.
  • Lower multiple testing burden, as unitigs reduce redundancy present in k-mer counting.
  • Easier to interpret: unitigs are usually longer than k-mers, and further context (surrounding sequence) can be analysed by using the graph structure they come from.

More details below.


A recent, excellent, paper by Jaillard, Lima et al. showed how unitigs (explained below) can be used instead of k-mers in bacterial GWAS:

The authors present their software DBGWAS which is an end-to-end bacterial GWAS solution to go from assemblies to associated graph elements (using bugwas/gemma to perform the association).

We are currently adding some new association models to pyseer, and I wanted to follow DBGWAS’ example and use unitigs in these new approaches. While we’re not ready to release the new models quite yet, it is now possible to use unitigs instead of k-mers with the existing association models implemented in pyseer (fixed effects, mixed effects, lineage effects). This has been achieved through minor modifications to the DBGWAS code, so I am indebted to these authors for making this possible.


I have updated the documentation to include these details.

Count the unitigs

  1. Install unitig-counter using bioconda (conda install -c bioconda unitig-counter).
  2. Create a list of assemblies and their names as input (see for details).
  3. Run the counting step, using multiple cores if available.

Run the association

  1. Simply use the same options as for a k-mer association, but drop in output/unitigs.txt.gz produced above as the --kmers option.
  2. As the number of tests for significance threshold, use the number of unique patterns reported (or count the lines in output/unitigs.unique_rows.Rtab).

Interpret the results

  1. As you would for k-mers, use the included scripts to map to a reference and produce a Manhattan plot in Phandango, or annotate the significant sequences.
  2. To provide extra context, or lengthen short sequences, unitigs can be extended leftwards and rightwards following the graph using cdbg-ops extend.


If you use unitigs in your association, please cite the DBGWAS paper:

Jaillard, M., Lima L, et al. A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events. PLoS Genet. 14, e1007758 (2018). doi:10.1371/journal.pgen.1007758.

If you find pyseer useful, citation would be appreciated:

Lees, John A., Galardini, M., et al. pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 34:4310–4312 (2018). doi:10.1093/bioinformatics/bty539.