/images/jl11_lots.jpg

John Lees' blog

Pathogens, informatics and modelling at EMBL-EBI

Sorting a massive file

I want to count the number of unique patterns in a vcf file. First I convert it to text with bcftools query: bcftools query -f '[%GT]\n' vcf_in.vcf.gz > patterns.txt The resulting patterns.txt is about 100Gb. The best way I found to count the unique patterns in this was with the following command: LC_ALL=C sort -u --parallel=4 -S 990M -T ~/tmp_sort_files patterns.txt | wc -l This used 1063Mb RAM, took 1521s and used a maximum of around 75Gb tmp space on my home (as the /tmp drive on the cluster ran out of space).

Removing helperamc (Advanced Mac Cleaner) OS X

I was working on an OS X system which kept getting annoying pop-ups about the system needing clean up, anti-virus software etc. I was able to see that the window was titled ‘helperamc’. It turns out this was a remnant from Advanced Mac Cleaner, the use of which I won’t comment on here. The user of the system had tried to remove it when upgrading OS X version, but the annoying advertising component remained.

Running bugwas + gen files

A recent paper by Earle et. al. nicely showed the use of linear mixed models to determine drug resistance related genetic variants. Part of the software provided is an R package called bugwas, which will make the nice plots in figure 1 for you. Here are some notes on how to get it to run, and correctly format the input files Getting gemma to work You’ll need to use the author’s modified version of gemma, which can be downloaded here.

R packages break after OS X upgrade

I recently upgraded from OS X 10.10 to 10.11. This has upgraded the version of the gfortran dynamic library from 2 to 3 (in /Library/Frameworks/R.framework/Resources/lib), which in turn causes problems in various R packages (msm, ape). For those which give an error along the lines of unable to load shared object the solution seems to be to use install.packages recursively. Use it on the package that failed. If a dependency fails, use it on that too.

Installing PEER executable peertool

PEER (probabilistic estimation of expression residuals) is a tool to determine hidden factors from expression data, for use in genetic association studies such as eQTL mapping. The method is first discussed in a 2010 PLOS Comp Bio paper: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000770 and a guide to its applications and use in a 2012 Nature Protocols paper: http://www.nature.com/nprot/journal/v7/n3/pdf/nprot.2011.457.pdf To install a command line version of the tool, you can clone it from the github page