/images/jl11_lots.jpg

John Lees' blog

Pathogens, informatics and modelling at EMBL-EBI

setup.py not found using pip install

Trying to install PyVCF under a python (3) virtual environment gave me the following error:

(venv)johnlees@hpc:~$ pip install pyvcf
 Downloading/unpacking pyvcf
 Downloading PyVCF-0.6.8.linux-x86_64.tar.gz (1.1MB): 1.1MB downloaded
 Saved /tmp/downloadcache/PyVCF-0.6.8.linux-x86_64.tar.gz
 Running setup.py egg_info for package pyvcf
 Traceback (most recent call last):
 File "", line 16, in 
 FileNotFoundError: \[Errno 2\] No such file or directory: '~/venv/build/pyvcf/setup.py'
 Complete output from command python setup.py egg_info:
 Traceback (most recent call last):

File "", line 16, in

FileNotFoundError: \[Errno 2\] No such file or directory: '~/venv/build/pyvcf/setup.py'

The solution was to upgrade setuptools:

Firth regression in python

Marco Galardini and I have recently reimplemented the bacterial GWAS software SEER in python. As part of this I rewrote my C++ code for Firth regression in python. Firth regression gives better estimates when data in logistic regression is separable or close to separable (when a chi-squared contingency table has small entries).

I found that although there is an R implementation logistf I couldn’t find an equivalent in another language, or python’s statsmodels. Here is a gist with my python functions and a skeleton of how to use them and calculate p-values, in case anyone would like to use this in future without having to write the optimiser themselves.

Running BSLMM in gemma

In GWAS the Bayesian Sparse Linear Mixed Model (BSLMM) is a hybrid of the LMM, which assumes all SNPs have an effect size drawn from a normal distribution (closer to ridge regression), and sparse regression which finds a few SNPs with non-zero effect sizes.

In their paper on this model Zhou et al show that this hybrid method can have better prediction accuracy than either individual model on its own (which are special cases in their model), and can also estimate the proportion of variance explained by polygenic and sparse effects.

Tanglegrams can be misleading

Tanglegrams are a visual method to compare two phylogenetic trees with the same set of tip labels. This can be useful for comparing trees produced by different methods on the same alignment, or on different alignments of the sample set. Tanglegrams work by connecting the matching tips of the trees, then rotating subtrees to minimise the number of crossings. The algorithm was published in 2011, and continues to be used in a range of publications (for example genomic epidemiology).

Likelihood ratio test in SEER

I have added the likelihood ratio test (LRT) for logistic regression into seer, in addition to the existing Wald test as noted in issue 42. As this is likely to remain undocumented elsewhere, here are some brief notes:

  • Both the p-value from the Wald test, and the p-value from the new LRT are in the output.
  • The LRT is expected to be a more powerful test in some situations. I would recommend its use over the Wald test.
  • Testing has shown some clear cases (e.g. when population structure is not a major effect) where the Wald test performs poorly, and the LRT recovers the power of a chi-squared test.
  • I have also put in a LRT for linear regression, but based on an estimate of the residual errors (which therefore gives slightly different results to R at small sample sizes). I don’t expect it to make much, if any, difference in this case.

There’s a nice article on the Wald, LRT and score tests here.

Sorting a massive file

I want to count the number of unique patterns in a vcf file. First I convert it to text with bcftools query:

bcftools query -f '[%GT]\n' vcf_in.vcf.gz > patterns.txt

The resulting patterns.txt is about 100Gb. The best way I found to count the unique patterns in this was with the following command:

LC_ALL=C sort -u --parallel=4 -S 990M -T ~/tmp_sort_files patterns.txt | wc -l

This used 1063Mb RAM, took 1521s and used a maximum of around 75Gb tmp space on my home (as the /tmp drive on the cluster ran out of space).