John Lees' blog

Pathogens, informatics and modelling at EMBL-EBI

Installing phyx without sudo

I saw this phylogenetics package today, phyx: https://github.com/FePhyFoFum/phyx To install without admin rights/sudo I needed to do the following (my software is installed in my home ~/software, rather than e.g. /usr, /usr/local): Compile armadillo as follows cmake -DINSTALL_PREFIX=$(HOME)/software make make install Compile nlopt as follows ./configure --with-cxx --without-octave --without-matlab --prefix=$(HOME)/software make make install Compile phyx as follows (slightly hacky, maybe there’s a ‘proper’ way) ./configure --prefix=$(HOME)/software change line 11 of the Makefile (CPP_LIBS) to add the library path:

setup.py not found using pip install

Trying to install PyVCF under a python (3) virtual environment gave me the following error: (venv)johnlees@hpc:~$ pip install pyvcf Downloading/unpacking pyvcf Downloading PyVCF-0.6.8.linux-x86_64.tar.gz (1.1MB): 1.1MB downloaded Saved /tmp/downloadcache/PyVCF-0.6.8.linux-x86_64.tar.gz Running setup.py egg_info for package pyvcf Traceback (most recent call last): File "", line 16, in FileNotFoundError: \[Errno 2\] No such file or directory: '~/venv/build/pyvcf/setup.py' Complete output from command python setup.py egg_info: Traceback (most recent call last): File "", line 16, in FileNotFoundError: \[Errno 2\] No such file or directory: '~/venv/build/pyvcf/setup.

Firth regression in python

Marco Galardini and I have recently reimplemented the bacterial GWAS software SEER in python. As part of this I rewrote my C++ code for Firth regression in python. Firth regression gives better estimates when data in logistic regression is separable or close to separable (when a chi-squared contingency table has small entries). I found that although there is an R implementation logistf I couldn’t find an equivalent in another language, or python’s statsmodels.

Running BSLMM in gemma

In GWAS the Bayesian Sparse Linear Mixed Model (BSLMM) is a hybrid of the LMM, which assumes all SNPs have an effect size drawn from a normal distribution (closer to ridge regression), and sparse regression which finds a few SNPs with non-zero effect sizes. In their paper on this model Zhou et al show that this hybrid method can have better prediction accuracy than either individual model on its own (which are special cases in their model), and can also estimate the proportion of variance explained by polygenic and sparse effects.

Tanglegrams can be misleading

Tanglegrams are a visual method to compare two phylogenetic trees with the same set of tip labels. This can be useful for comparing trees produced by different methods on the same alignment, or on different alignments of the sample set. Tanglegrams work by connecting the matching tips of the trees, then rotating subtrees to minimise the number of crossings. The algorithm was published in 2011, and continues to be used in a range of publications (for example genomic epidemiology).

Likelihood ratio test in SEER

I have added the likelihood ratio test (LRT) for logistic regression into seer, in addition to the existing Wald test as noted in issue 42. As this is likely to remain undocumented elsewhere, here are some brief notes: Both the p-value from the Wald test, and the p-value from the new LRT are in the output. The LRT is expected to be a more powerful test in some situations. I would recommend its use over the Wald test.