Mean coverage from a BAM one-liner

Calculates the mean depth of mapped coverage over the whole genome from a mapped bam, using bedtools

bedtools genomecov -ibam  -g bedtools_genome.txt \
 | awk '$1=="genome" {tot += $2*$3; reads += $3} END {print tot/reads}'

To generate bedtools_genome.txt use

samtools faidx reference.fa
cut -f 1-2 reference.fa.fai > bedtools_genome.txt

Installing PEER executable peertool

PEER (probabilistic estimation of expression residuals) is a tool to determine hidden factors from expression data, for use in genetic association studies such as eQTL mapping.

The method is first discussed in a 2010 PLOS Comp Bio paper:
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000770
and a guide to its applications and use in a 2012 Nature Protocols paper:
http://www.nature.com/nprot/journal/v7/n3/pdf/nprot.2011.457.pdf

To install a command line version of the tool, you can clone it from the github page

git clone https://github.com/PMBio/peer

When installing, it won’t install the executable binary peertool by default, nor will it use a user directory as the install location (though the latter is addressed at the end of documentation). To install these, use the following commands:

cd peer && mkdir build && cd build
cmake -D CMAKE_INSTALL_PREFIX=~/software -D BUILD_PEERTOOL=1 ./..
make
make install

Which will install peertool to ~/software/bin, which you can put on your path.

 

Simulate disease state from a given odds ratio, fraction exposed, fraction affected

Problem: I have genetic data at a single variant site, where the minor allele frequency (MAF) is set, and the prevalence of disease is known (Sr). The variant is truly associated with the phenotype, at an odds ratio (OR) I want to set. How do I simulate the phenotypes given these three parameters, and whether each sample has the variant (exposed) or not?

This is analogous to simulating data in a 2×2 contigency table, as discussed in stack exchange here: http://stats.stackexchange.com/questions/13193/generating-data-with-a-pre-specified-odds-ratio

Solution: As alluded to in one of the comments of the stack exchange post, but unfortunately not derived or written out in a formatted way, the number of disease cases is a quadratic equation which can be rewritten in terms of the above parameters. I found the derivation from the table of p11 (or De, number of disease cases that have the variant) in a book appendix.

This requires a little rewriting to get in terms of these parameters, but can be written down as:

Screen Shot 2015-10-09 at 17.18.13

I have implemented this in c++ as a function returning a tuple

std::tuple<double, double> p_case(const double OR, const double MAF, const double Sr)
{
   // Convenient defines
   const double m1 = Sr/(Sr + 1);

   // Quadratic equation
   const double a = OR - 1;
   const double b = -((m1 + MAF)*OR - m1 + 1 - MAF);
   const double c = OR*m1*MAF;
   double a1 = (-b - pow(pow(b, 2) - 4*a*c, 0.5)) / (2*a);

   // Probabilities to return
   double p_e = a1 / MAF;
   double p_ne = (m1 - a1)/(1 - MAF);

   return std::make_tuple(p_e, p_ne);
}

The whole code for the simulation can be found in subsample_seer.cpp here:
https://github.com/johnlees/bioinformatics/tree/master/seer

Using ALF to simulate large, closely related populations of bacteria

I am currently trying to use ALF (the stand-alone version) to simulate data from a custom tree, and include realistic parameters for SNP rate, INDEL rate, gene loss and recombination rates. This is a little different to what I think the program was originally designed for – small numbers of divergent organisms – but is probably an easier problem.

ALF is good because it includes a lot of features of evolution more naive models don’t encompass, and gives good output useful for further simulation and testing work.

I’ve made the following notes and tweaks to fix issues as I’ve been going along, which I hope may be of use to anyone trying to use the software for this purpose

  • For custom INDEL distributions, they must be specified in the parameters file as (note the double bracket):
    IndelModel(0.02,'CUSTOM', [[0.5,0.25,0.2,0.05]], 20)

    (thanks to the author Daniel Dalquen for helping me with this)

  • Custom trees must have no labels on the internal nodes. To ignore these you can remove the InternalLabels argument on line 820 of lib/simulator/SE_DataOutput.drw
  • Make sure ‘unitIsPam’ is set to false for trees with substitutions per site, which is the default unit for e.g. Raxml trees
  • If you’re simulating a lot of lateral gene transfer events with multiple genes, you’ll run into a transLoc out of range error due to a bug in the code. This can be fixed by changing line 604 in lib/simulator/SE_Evolutionary_Events.drw to
    place := Rand(0..length(geneR[org]) - lgtSize);

I have also written some helper scripts, which can be found in https://github.com/johnlees/bioinformatics/tree/master/sequence_evolution/ALF

  • gff2darwin.pl: Helps convert gff annotation files to custom input starting sequences
  • alf_db_to_fasta.pl: Converts the DB output formatting into a single fasta contig for an organism -> observed organism genome
  • alf_msa_concat.pl: Converts MSA output (which is by gene) into true alignments by organism -> true alignment
  • genes_to_contig.pl: Concatenates all contigs to create a whole genome alignment file (output from alf_msa_concat,pl) -> true alignment for population

A hierarchical Bayesian model using multinomial and Dirichlet distributions in JAGS

I am currently trying to model the state of a genetic locus in bacteria (which may be one of six values) using a hierarchical Bayesian model. This allows me to account for the fact that within a sample there is heterogeneity, as well as there being heterogeneity within a tissue type.

This is also good because:

  • I am able to incorporate results from existing papers as priors.
  • From the output I can quantify all the uncertainties within samples and tissues, check for significantly different distributions between condition types.

I think what I have ended up working with is probably something that could also be called a mixture model.

At any rate, I wrote some R code (available on my github) that uses rjags to interface with JAGS. I started off with code based on exercise 9.2 of ‘Doing Bayesian Data Analysis’ (Kruschke 2011, 1st ed.), which models the bias of multiple coins from multiple mints using data from the number of heads observed from each coin.

I then extended this to more than two outcomes (success, fail), which use a binomial distribution for the data and a beta distribution for the priors, to six outcomes. I now use a multinomial distribution for the data and Dirichlet distributions for the priors.

However I ran into an array of error messages when implementing this:

  RUNTIME ERROR:
Compilation error on line 10.
Dimension mismatch taking subset of pi  

 RUNTIME ERROR:
Compilation error on line 16.
Dimension mismatch taking subset of alpha

 RUNTIME ERROR:
Dimension mismatch when setting bounds in distribution ddirch

The first two of these was solved by getting all the brackets correct and consistent when tranisitoning from scalars -> vectors and vectors -> matrices. However the last message, ‘Dimension mismatch when setting bounds in distribution ddirch’, proved more insidious.

As JAGS is pretty closely based on BUGS, I searched for the problem in that domain (which seems to be a big advantage of JAGS compared to e.g. STAN – BUGS has been around for so long most problems have already arisen and been solved) and came across the following stackexchange/overflow issues that allowed me to solve the problem:
http://stackoverflow.com/questions/24349692/dirichlet-multinomial-winbugs-code
http://stats.stackexchange.com/questions/76644/define-priors-for-dirichlet-distribution-parameters-in-jags

The problem is that JAGS doesn’t allow the parameters of ddirch to be inferred. However you can use the observation that if delta[k] ~ dgamma(alpha[k], 1), then the vector with elements delta[k] / sum(delta[1:K]), k = 1, …, K, is Dirichlet with parameters alpha[k], k = 1, …, K.’
Then infer in the gamma distribution instead!

While you can find the full code here, the final working JAGS model code is pasted below. Note the use of the above trick, while using nested indices to allow for the hierarchical structure.

# JAGS model specification
model {
  # For each sample, likelihood and prior
  for (sample_index in 1:num_samples)
  {
    # likelihood
    # Number of reads that map to each allele is multinomial.
    # y and pi are matrices with num_samples rows and num_alleles columns
    y[sample_index,1:num_alleles] ~ dmulti(pi[sample_index,1:num_alleles], N[sample_index])

    # Dirichlet prior for proportion of population with allele in each sample
    #
    # This would be written like this:
    # pi[sample_index,1:num_alleles] ~ ddirch(alpha[tissue[sample_index],1:num_alleles])T(0.0001,0.9999)
    # Except JAGS doesn't allow the parameters of ddirch to be inferred
    #
    # Instead use the observation in the BUGS manual, and infer from a dgamma instead:
    # 'The trick is to note that if delta[k] ~ dgamma(alpha[k], 1), then the vector with elements 
    # delta[k] / sum(delta[1:K]), k = 1, ...,   K, is Dirichlet with parameters alpha[k], k = 1, ..., K.'
    for (allele_index in 1:num_alleles)
    {
      pi[sample_index, allele_index] 
      delta[sample_index, allele_index] ~ dgamma(alpha[tissue[allele_index],allele_index], 1)
    }
    
  }

  # For each tissue (blood or csf) hyperpriors
  for (tissue_index in 1:num_tissues)
  {
    # Convert a and b in beta prior, to mu and kappa
    # mu = mean allele for tissue,
    # kappa = how closely does sequence represent tissue - constant across all tissue types
    for (allele_index in 1:num_alleles)
    {
      alpha[tissue_index,allele_index] 
    }
 # hyperpriors for mu (beta dist)
 mu[tissue_index,1:num_alleles] ~ ddirch(AlphaMu[])
 }

 # Kappa hyperprior (gamma dist - shape and rate)
 kappa ~ dgamma(Skappa, Rkappa)

 # Top level constants for hyperpriors
 # This is a vector of length num_alleles
 AlphaMu <- alpha_priors

 # Gamma dist for kappa. First convert mean and sd to shape and rate
 Skappa <- pow(meanGamma,2)/pow(sdGamma,2)
 Rkappa <- meanGamma/pow(sdGamma,2)

 meanGamma <- 10
 sdGamma <- 10
}

Installing Cactus (progressiveCactus)

When installing Cactus (using the progressiveCactus repository) I encountered the following issues during compiling:

  1. easy_install not found
    Solution: Needed to remove my ~/.pydistutils.cfg
  2. Dependencies/includes not being found
    Solution: add ‘CXXFLAGS=-I”<install_location>/include/”‘ and ‘CFLAGS=-I”<install_location>/include/”‘ to <install_location>/share/config.site
  3. kyototycoon not compiling as kyotocabinet functions not found (as in issue 27)
    Solution: (as in my comment to the issue)entering the kyototycoon directory and running configure with different flags, then make:

    ./configure --prefix=~/software --with-kc=~/software
    
    make
    

    where ~/software is the prefix I am installing to (with subdirs bin, lib, include, man etc)

  4. Various cases of -ltokyocabinet not found, or other dependencies missing in the USCS submodules stages of compilation (which use makefiles but not configure, so don’t pick up config.site)
    Solution: add:

    cflags += -I"/nfs/users/nfs_j/jl11/software/include/" -L/nfs/users/nfs_j/jl11/software/lib

    to include.mk in submodules cactus, cactus2hal, hal, pinchesAndCacti and matchingAndOrdering

Hope that helps anyone who might have been struggling with similar compile issues!

Diagnosing results/status of lots of LSF jobs

Over the past few months I’ve found myself running large numbers of jobs over an LSF system, for example assembling and annotating thousands of bacterial genomes or imputing thousands of human genomes in 5Mb chunks.

Inevitable some of these jobs fail, and often for a number of reasons. I thought it might be helpful to share some of the commands I’ve found useful for diagnosing the jobs that have finished. The commands apply to IBM platform LSF (bsub), but I imagine have slightly wider applicability

bjobs -a -x

This command is useful if run just after jobs finish, so that they are still in the history (they are usually cleared after a couple of hours). It will show all jobs that have finished with a non-zero exit code, and also jobs which have underrun/overrun. This is especially useful if you’ve run something that has exited with an error early on, but still returns exit code 0 (e.g. wrong command line parameters).

find . -name "*.o" | xargs grep -L "Successfully completed"

Assuming all your job STDOUT files have the suffix .o (bsub -o), this will show any jobs (files at least) that have not finished with exit code 0.
find – returns all files names which end with .o, searching recursively
xargs – passes these file names one by one to grep
grep -L returns the file names of any files which do not contain the given phrase

find . -name "*.o" | xargs grep -l "MEMLIMIT"

Similar to the above command, except returns all those jobs that exceeded their memory limit. grep -l returns files with the match.
Makes it easy to find jobs which just need to be resubmitted with higher memory limits.

This and the above command can obviously be simply extended by grepping for different strings in the log files

find . -name "*.e" | xargs cat

Useful for some tasks, this will display all the output to STDERR assuming you wrote it to files with the suffix .e (bsub -e). Some software writes logs to STDERR, but in some cases you might expect this command to return no text

Compiling and installing MaSuRCA/MSRCA assembler

After reading GAGE-B (dx.doi.org/10.1093/bioinformatics/btt273) which is an evaluation of the performance of various pieces of de-novo assembly software I was convinced to try and get MaSuRCA (http://www.genome.umd.edu/masurca.html) working even if it took a lot of effort, as the results looked very promising.

The compilation didn’t work for me, the problem being the automatically generated Makefiles had an error in them where the compiler name was missing in the executed statement. This proved too complex for me to fix quickly, and instead I went with the following solution:
EDIT 4/8/14: This solution is unlikely to work. See bottom of article for why

  1. Download and compile the Celera Assembler separately (http://wgs-assembler.sourceforge.net/) following the instructions provided. You can probably even just download a pre-compiled binary
  2. Copy the contents of wgs-<version> over the CA directory in the MaSuRCA install. If you downloaded just the binary the important part is the Linux-amd64/bin/ directory (or whatever it is for your architecture)
  3. Comment out or delete line 29 of the install.sh file:
    (cd CA/src && make LD_RUN_PATH="$LIBDIR")
  4. Run the install.sh script

Which should then work, as the problem with the Makefiles only exists for the CA/src directory

When running the assembler, you’ll need the CA/bin files installed in bin/masurca/../CA/Linux-amd64/bin where /bin/masurca is the directory where the masurca binaries were installed to. This is the default provided by the script, so in most cases shouldn’t be a problem

EDIT: MaSuRCA v2.2.1 comes with CA v6.1, so download this binary/source rather than the newest one (v8 onwards).
Some of the command line options are different and the assemble.sh script won’t work. I’m going to try and fix this, and if I have any success will post the result here

EDIT 4/8/14: After reading the MaSuRCA v2 paper (dx.doi.org/10.1093/bioinformatics/btt476) I see that the authors modify the CA code slightly as part of their assembler. Without using this version the assemble.sh script won’t run, as the overlaps are not build correctly using the super-reads.
Therefore the code needs to be compiled by the provided method, but as mentioned the makefiles don’t work. I did manage to find a pre-compiled binary that works on my system on their ftp site, which I would recommend as the best solution for now

Compiling Stampy v1.0.23 for use with cortex – error: unrecognized command line option ‘-Wl’

To assemble illumina sequence data I am currently trialling assembly with cortex. To be able to use their Perl script to automate the pipeline between reads in and variant calls requires vcftools and stampy to be installed, and you provide the installation paths as input to the script.

However when running make using the default downloaded stampy makefile I got the following error from g++ (v4.8.1):

g++ `python2.7-config --ldflags` -pthread -shared -Wl build/linux-x86_64-2.7-ucs4/pyx/maptools.o build/linux-x86_64-2.7-ucs4/c/map
utils.o build/linux-x86_64-2.7-ucs4/c/alignutils.o build/linux-x86_64-2.7-ucs4/readalign.o build/linux-x86_64-2.7-ucs4/algebras.o build/linux-x86_64-2.7-ucs4/frontend.o -o maptools.so
g++: error: unrecognized command line option ‘-Wl’

The solution was straightforward to find, as ever thanks to stackoverflow: http://stackoverflow.com/questions/21305309/g-doesnt-recognize-the-option-wl
All you need to do is edit lines 44 and 46 in the makefile, replacing the space after -Wl with a comma:

 43 ifeq ($(platform),linux-x86_64)
 44    g++ `$(python)-config --ldflags` -pthread -shared -Wl,$(objs) -o maptools.so
 45 else
 46    g++ `$(python)-config --ldflags` -pthread -dynamiclib -Wl,$(objs) -o maptools.so
 47 endif

As you can see from the surrounding if statement, this is only an issue on 64-bit linux platforms

I also tried compiling cortex with icc, but the compilation failed after a lot of errors. Rather than pursuing this further, I used gcc and only got warnings of unused variables in compilation

Impute your whole genome from 23andme data

23andme is a service which types 602352 sites on your chromosomal DNA and your mtDNA. It is possible, by comparing to a reference panel in which all sites have been typed, to impute (fill in statistically) the missing sites and thus get an ‘estimation’ of your whole genome.

The piece of software impute2 written by B. N. Howie, P. Donnelly, and J. Marchini gives good accuracy when using the 1000 Genome Project as a reference. However, there is some difficulty in providing the data in the right input format, using all the correct options and interpreting the output from this piece of software.

EDIT: As pointed out by lassefolkersen in the comments, this has now been nicely implemented at impute.me

I have written a tool to allow people with a small amount computational experience (but not necessarily any biological/bioinformatics knowledge) to run this tool on their 23andme data to get their whole genome output, which can be found at my github: https://github.com/johnlees/23andme-impute

To use this tool, you will need to do the following steps:

  1. Download your ‘raw data’ from the 23andme site. This is a file named something like genome_name_full
  2. Download the impute2 software from https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#download and follow their instructions to install it
  3. Put impute2 on the path, i.e. run (with the correct path for where you extracted impute2):
    echo “export PATH=$PATH:/path/to/impute2” >> ~/.bashrc
  4. Download the 1000 Genomes reference data, which can be found on the impute2 website here:
    https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html
  5. Extract this data by running:
    gunzip ALL_1000G_phase1integrated_v3_impute.tgz
    tar xf ALL_1000G_phase1integrated_v3_impute.tar
    (you will then probably want to delete the original, unextracted archive file as it is quite large)
  6. Download my code by running:
    git clone https://github.com/johnlees/23andme-impute
  7. Run ./impute_genome.pl to impute your whole genome!

The options required as input for impute_genome.pl should be reasonably straightforward, run with -h to see them, or look at the README.md on github.

As the analysis will take a lot of resources, I recommend against using the run command. I think –print or –write will be best for most people, and you can then run each job one at a time or in parallel if you have access to a cluster.

If you have any problems with this, please leave a message in the comments and I’ll try my best to get back to you.