John Lees' blog
Pathogens, informatics and modelling at EMBL-EBI
In GWAS the Bayesian Sparse Linear Mixed Model (BSLMM) is a hybrid of the LMM, which assumes all SNPs have an effect size drawn from a normal distribution (closer to ridge regression), and sparse regression which finds a few SNPs with non-zero effect sizes.
In their paper on this model Zhou et al show that this hybrid method can have better prediction accuracy than either individual model on its own (which are special cases in their model), and can also estimate the proportion of variance explained by polygenic and sparse effects.
Tanglegrams are a visual method to compare two phylogenetic trees with the same set of tip labels. This can be useful for comparing trees produced by different methods on the same alignment, or on different alignments of the sample set. Tanglegrams work by connecting the matching tips of the trees, then rotating subtrees to minimise the number of crossings. The algorithm was published in 2011, and continues to be used in a range of publications (for example genomic epidemiology).
I have added the likelihood ratio test (LRT) for logistic regression into seer, in addition to the existing Wald test as noted in issue 42. As this is likely to remain undocumented elsewhere, here are some brief notes:
Both the p-value from the Wald test, and the p-value from the new LRT are in the output. The LRT is expected to be a more powerful test in some situations. I would recommend its use over the Wald test.
I want to count the number of unique patterns in a vcf file. First I convert it to text with bcftools query:
bcftools query -f '[%GT]\n' vcf_in.vcf.gz > patterns.txt The resulting patterns.txt is about 100Gb. The best way I found to count the unique patterns in this was with the following command:
LC_ALL=C sort -u --parallel=4 -S 990M -T ~/tmp_sort_files patterns.txt | wc -l This used 1063Mb RAM, took 1521s and used a maximum of around 75Gb tmp space on my home (as the /tmp drive on the cluster ran out of space).
I was working on an OS X system which kept getting annoying pop-ups about the system needing clean up, anti-virus software etc. I was able to see that the window was titled ‘helperamc’.
It turns out this was a remnant from Advanced Mac Cleaner, the use of which I won’t comment on here. The user of the system had tried to remove it when upgrading OS X version, but the annoying advertising component remained.
A recent paper by Earle et. al. nicely showed the use of linear mixed models to determine drug resistance related genetic variants. Part of the software provided is an R package called bugwas, which will make the nice plots in figure 1 for you.
Here are some notes on how to get it to run, and correctly format the input files
Getting gemma to work You’ll need to use the author’s modified version of gemma, which can be downloaded here.