John Lees' blog

Pathogens, informatics and modelling at EMBL-EBI

Diagnosing results/status of lots of LSF jobs

Over the past few months I’ve found myself running large numbers of jobs over an LSF system, for example assembling and annotating thousands of bacterial genomes or imputing thousands of human genomes in 5Mb chunks. Inevitable some of these jobs fail, and often for a number of reasons. I thought it might be helpful to share some of the commands I’ve found useful for diagnosing the jobs that have finished. The commands apply to IBM platform LSF (bsub), but I imagine have slightly wider applicability

Parallel MCMC

On github: https://github.com/johnlees/pMCMC Parallel implementation of MCMC using MPI - coded by Hákon Jónsson, John Lees and Tobias Madsen Code is available as C++ Under testing implementations in R and Perl do not provide speedups due to execution overheads, but are included as easier to read ‘pseudocode’ if required. Details can be found in this draft paper: pMCMC Acknowledgements This work was completed for the Oxford Summer School in Computational Biology 2012 (http://www.

Compiling and installing MaSuRCA/MSRCA assembler

After reading GAGE-B (dx.doi.org/10.1093/bioinformatics/btt273) which is an evaluation of the performance of various pieces of de-novo assembly software I was convinced to try and get MaSuRCA (http://www.genome.umd.edu/masurca.html) working even if it took a lot of effort, as the results looked very promising. The compilation didn’t work for me, the problem being the automatically generated Makefiles had an error in them where the compiler name was missing in the executed statement. This proved too complex for me to fix quickly, and instead I went with the following solution: EDIT 4/8/14: This solution is unlikely to work.

Display env variable, tmux and zsh over ssh

I have been using zsh within tmux, and found upon reattaching tmux X forwarding wasn’t working. For example when trying to launch gvim I’d get the error: E233: cannot open display The problem, a quick google determined, is that each time I ssh into my sever a new $DISPLAY environment variable is set. When I run ’tmux attach’ the new $DISPLAY variable is passed through (see http://stackoverflow.com/questions/8645053/how-do-i-start-tmux-with-my-current-environment) so any new windows within tmux will have the correct environment.

Compiling Stampy v1.0.23 for use with cortex - error: unrecognized command line option ‘-Wl’

To assemble illumina sequence data I am currently trialling assembly with cortex. To be able to use their Perl script to automate the pipeline between reads in and variant calls requires vcftools and stampy to be installed, and you provide the installation paths as input to the script. However when running make using the default downloaded stampy makefile I got the following error from g++ (v4.8.1): g++ \`python2.7-config --ldflags\` -pthread -shared -Wl build/linux-x86\_64-2.

Impute your whole genome from 23andme data

23andme is a service which types 602352 sites on your chromosomal DNA and your mtDNA. It is possible, by comparing to a reference panel in which all sites have been typed, to impute (fill in statistically) the missing sites and thus get an ’estimation’ of your whole genome. The piece of software impute2 written by B. N. Howie, P. Donnelly, and J. Marchini gives good accuracy when using the 1000 Genome Project as a reference.