Sorting a massive file

I want to count the number of unique patterns in a vcf file. First I convert it to text with bcftools query:

bcftools query -f '[%GT]\n' vcf_in.vcf.gz > patterns.txt

The resulting patterns.txt is about 100Gb. The best way I found to count the unique patterns in this was with the following command:

LC_ALL=C sort -u --parallel=4 -S 990M -T ~/tmp_sort_files patterns.txt | wc -l

This used 1063Mb RAM, took 1521s and used a maximum of around 75Gb tmp space on my home (as the /tmp drive on the cluster ran out of space).

With thanks to http://unix.stackexchange.com/questions/120096/how-to-sort-big-files

Running bugwas + gen files

A recent paper by Earle et. al. nicely showed the use of linear mixed models to determine drug resistance related genetic variants. Part of the software provided is an R package called bugwas, which will make the nice plots in figure 1 for you.

Here are some notes on how to get it to run, and correctly format the input files

Getting gemma to work

You’ll need to use the author’s modified version of gemma, which can be downloaded here. This may not run on your system, due to the blas and lapack libraries being in unexpected places. You can easily solve this by making some symlinks.

First run ldd on the gemma executable to check which libraries cannot be found

ldd /nfs/users/nfs_j/jl11/installations/gemma.0.93b
 linux-vdso.so.1 => (0x00007fff28000000)
 libgsl.so.0 => /usr/lib/libgsl.so.0 (0x00007f9dfbcc0000)
 libgslcblas.so.0 => /usr/lib/libgslcblas.so.0 (0x00007f9dfba78000)
 libblas.so.3 => not found
 liblapack.so.3 => not found
 libstdc++.so.6 => /software/hgi/pkglocal/gcc-4.9.1/lib64/libstdc++.so.6 (0x00007f9dfa8d0000)
 libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9dfa5d0000)
 libgcc_s.so.1 => /software/hgi/pkglocal/gcc-4.9.1/lib64/libgcc_s.so.1 (0x00007f9dfa3b8000)
 libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9df9ff8000)
 libgfortran.so.3 => /software/hgi/pkglocal/gcc-4.9.1/lib64/libgfortran.so.3 (0x00007f9df9cd8000)
 /lib64/ld-linux-x86-64.so.2 (0x00007f9dfc120000)
 libquadmath.so.0 => /software/hgi/pkglocal/gcc-4.9.1/lib/../lib64/libquadmath.so.0 (0x00007f9df9a98000)

In my case this was libblas and liblapack. Find the libraries using ‘locate liblapack’ etc. Above, I have created symlinks to these in a location in my LD_LIBRARY_PATH variable ‘/nfs/users/nfs_j/jl11/software/lib’ (you can make a similar directory and export it using ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<new directory>’).

ln -s /usr/lib/lapack/liblapack.so /nfs/users/nfs_j/jl11/software/lib/liblapack.so.3
ln -s /usr/lib/libblas/libblas.so /nfs/users/nfs_j/jl11/software/lib/libblas.so.3

File formatting

The bugwas gen file format is as follows:
ps sample_1_name sample_2_name
23 A C
etc.

If you have a file readable into plink (stored as snps.bed, snps.bim and snps.fam) and bcftools, this is an easy conversion.

plink --bfile snps --recode vcf-iid --geno 0.05 --out bugwas
bcftools plugin missing2ref bugwas.vcf > bugwas2.vcf
bcftools query -f '%ID\t[%TGT\t]\n' bugwas2.vcf > bugwas_in.gen
sed -i -e 's/coor_//' -e 's/\/.//g' -e 's/\t$//' bugwas_in.gen
head -7 bugwas.vcf| tail -1 | cut -f 10- | sed 's/#/_/g'| awk '{print "ps\t" $0}' | cat - bugwas_in.gen > tmp.gen
mv tmp.gene bugwas_in.gen

As bugwas requires all sites to be imputed this code will take the major allele where any site is missing. An alternative would be to use the ancestral state, which the authors suggest using ClonalFrameML to do, though this will take longer to run.

If you’ve got a messy vcf rather than a bed as input a script I wrote may help you with missing/multiallelic sites.

Other pieces

You’ll also need gemma to generate the kinship matrix for the random effects:

gemma -bfile snps -gk 1 -o gemma_snps

You’ll need a phylogeny, which you can draw by standard methods (I’d recommend fasttree on your alignment if you’ve got > 1000 samples). If you’ve already got a tree make sure the tip labels match the samples in the gen file.

Run bugwas

The command you’ll need to run in R is of the form:

lin_loc(gen=”bugwas_in.gen”,pheno=”bugwas.pheno”,phylo=”fasttree.tr”,prefix=”gwas”,gem.path=”/nfs/users/nfs_j/jl11/installations/gemma.0.93b”, relmatrix=”output/gemma_snps.cXX.txt”, output.dir=”./”)

I’ve found for around 2 000 samples and 110 000 snps I needed around 15Gb of RAM and 4-5 hours to run.

Mean coverage from a BAM one-liner

Calculates the mean depth of mapped coverage over the whole genome from a mapped bam, using bedtools

bedtools genomecov -ibam  -g bedtools_genome.txt \
 | awk '$1=="genome" {tot += $2*$3; reads += $3} END {print tot/reads}'

To generate bedtools_genome.txt use

samtools faidx reference.fa
cut -f 1-2 reference.fa.fai > bedtools_genome.txt

Installing PEER executable peertool

PEER (probabilistic estimation of expression residuals) is a tool to determine hidden factors from expression data, for use in genetic association studies such as eQTL mapping.

The method is first discussed in a 2010 PLOS Comp Bio paper:
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000770
and a guide to its applications and use in a 2012 Nature Protocols paper:
http://www.nature.com/nprot/journal/v7/n3/pdf/nprot.2011.457.pdf

To install a command line version of the tool, you can clone it from the github page

git clone https://github.com/PMBio/peer

When installing, it won’t install the executable binary peertool by default, nor will it use a user directory as the install location (though the latter is addressed at the end of documentation). To install these, use the following commands:

cd peer && mkdir build && cd build
cmake -D CMAKE_INSTALL_PREFIX=~/software -D BUILD_PEERTOOL=1 ./..
make
make install

Which will install peertool to ~/software/bin, which you can put on your path.

 

Simulate disease state from a given odds ratio, fraction exposed, fraction affected

Problem: I have genetic data at a single variant site, where the minor allele frequency (MAF) is set, and the prevalence of disease is known (Sr). The variant is truly associated with the phenotype, at an odds ratio (OR) I want to set. How do I simulate the phenotypes given these three parameters, and whether each sample has the variant (exposed) or not?

This is analogous to simulating data in a 2×2 contigency table, as discussed in stack exchange here: http://stats.stackexchange.com/questions/13193/generating-data-with-a-pre-specified-odds-ratio

Solution: As alluded to in one of the comments of the stack exchange post, but unfortunately not derived or written out in a formatted way, the number of disease cases is a quadratic equation which can be rewritten in terms of the above parameters. I found the derivation from the table of p11 (or De, number of disease cases that have the variant) in a book appendix.

This requires a little rewriting to get in terms of these parameters, but can be written down as:

Screen Shot 2015-10-09 at 17.18.13

I have implemented this in c++ as a function returning a tuple

std::tuple<double, double> p_case(const double OR, const double MAF, const double Sr)
{
   // Convenient defines
   const double m1 = Sr/(Sr + 1);

   // Quadratic equation
   const double a = OR - 1;
   const double b = -((m1 + MAF)*OR - m1 + 1 - MAF);
   const double c = OR*m1*MAF;
   double a1 = (-b - pow(pow(b, 2) - 4*a*c, 0.5)) / (2*a);

   // Probabilities to return
   double p_e = a1 / MAF;
   double p_ne = (m1 - a1)/(1 - MAF);

   return std::make_tuple(p_e, p_ne);
}

The whole code for the simulation can be found in subsample_seer.cpp here:
https://github.com/johnlees/bioinformatics/tree/master/seer

Using ALF to simulate large, closely related populations of bacteria

I am currently trying to use ALF (the stand-alone version) to simulate data from a custom tree, and include realistic parameters for SNP rate, INDEL rate, gene loss and recombination rates. This is a little different to what I think the program was originally designed for – small numbers of divergent organisms – but is probably an easier problem.

ALF is good because it includes a lot of features of evolution more naive models don’t encompass, and gives good output useful for further simulation and testing work.

I’ve made the following notes and tweaks to fix issues as I’ve been going along, which I hope may be of use to anyone trying to use the software for this purpose

  • For custom INDEL distributions, they must be specified in the parameters file as (note the double bracket):
    IndelModel(0.02,'CUSTOM', [[0.5,0.25,0.2,0.05]], 20)

    (thanks to the author Daniel Dalquen for helping me with this)

  • Custom trees must have no labels on the internal nodes. To ignore these you can remove the InternalLabels argument on line 820 of lib/simulator/SE_DataOutput.drw
  • Make sure ‘unitIsPam’ is set to false for trees with substitutions per site, which is the default unit for e.g. Raxml trees
  • If you’re simulating a lot of lateral gene transfer events with multiple genes, you’ll run into a transLoc out of range error due to a bug in the code. This can be fixed by changing line 604 in lib/simulator/SE_Evolutionary_Events.drw to
    place := Rand(0..length(geneR[org]) - lgtSize);

I have also written some helper scripts, which can be found in https://github.com/johnlees/bioinformatics/tree/master/sequence_evolution/ALF

  • gff2darwin.pl: Helps convert gff annotation files to custom input starting sequences
  • alf_db_to_fasta.pl: Converts the DB output formatting into a single fasta contig for an organism -> observed organism genome
  • alf_msa_concat.pl: Converts MSA output (which is by gene) into true alignments by organism -> true alignment
  • genes_to_contig.pl: Concatenates all contigs to create a whole genome alignment file (output from alf_msa_concat,pl) -> true alignment for population

A hierarchical Bayesian model using multinomial and Dirichlet distributions in JAGS

I am currently trying to model the state of a genetic locus in bacteria (which may be one of six values) using a hierarchical Bayesian model. This allows me to account for the fact that within a sample there is heterogeneity, as well as there being heterogeneity within a tissue type.

This is also good because:

  • I am able to incorporate results from existing papers as priors.
  • From the output I can quantify all the uncertainties within samples and tissues, check for significantly different distributions between condition types.

I think what I have ended up working with is probably something that could also be called a mixture model.

At any rate, I wrote some R code (available on my github) that uses rjags to interface with JAGS. I started off with code based on exercise 9.2 of ‘Doing Bayesian Data Analysis’ (Kruschke 2011, 1st ed.), which models the bias of multiple coins from multiple mints using data from the number of heads observed from each coin.

I then extended this to more than two outcomes (success, fail), which use a binomial distribution for the data and a beta distribution for the priors, to six outcomes. I now use a multinomial distribution for the data and Dirichlet distributions for the priors.

However I ran into an array of error messages when implementing this:

  RUNTIME ERROR:
Compilation error on line 10.
Dimension mismatch taking subset of pi  

 RUNTIME ERROR:
Compilation error on line 16.
Dimension mismatch taking subset of alpha

 RUNTIME ERROR:
Dimension mismatch when setting bounds in distribution ddirch

The first two of these was solved by getting all the brackets correct and consistent when tranisitoning from scalars -> vectors and vectors -> matrices. However the last message, ‘Dimension mismatch when setting bounds in distribution ddirch’, proved more insidious.

As JAGS is pretty closely based on BUGS, I searched for the problem in that domain (which seems to be a big advantage of JAGS compared to e.g. STAN – BUGS has been around for so long most problems have already arisen and been solved) and came across the following stackexchange/overflow issues that allowed me to solve the problem:
http://stackoverflow.com/questions/24349692/dirichlet-multinomial-winbugs-code
http://stats.stackexchange.com/questions/76644/define-priors-for-dirichlet-distribution-parameters-in-jags

The problem is that JAGS doesn’t allow the parameters of ddirch to be inferred. However you can use the observation that if delta[k] ~ dgamma(alpha[k], 1), then the vector with elements delta[k] / sum(delta[1:K]), k = 1, …, K, is Dirichlet with parameters alpha[k], k = 1, …, K.’
Then infer in the gamma distribution instead!

While you can find the full code here, the final working JAGS model code is pasted below. Note the use of the above trick, while using nested indices to allow for the hierarchical structure.

# JAGS model specification
model {
  # For each sample, likelihood and prior
  for (sample_index in 1:num_samples)
  {
    # likelihood
    # Number of reads that map to each allele is multinomial.
    # y and pi are matrices with num_samples rows and num_alleles columns
    y[sample_index,1:num_alleles] ~ dmulti(pi[sample_index,1:num_alleles], N[sample_index])

    # Dirichlet prior for proportion of population with allele in each sample
    #
    # This would be written like this:
    # pi[sample_index,1:num_alleles] ~ ddirch(alpha[tissue[sample_index],1:num_alleles])T(0.0001,0.9999)
    # Except JAGS doesn't allow the parameters of ddirch to be inferred
    #
    # Instead use the observation in the BUGS manual, and infer from a dgamma instead:
    # 'The trick is to note that if delta[k] ~ dgamma(alpha[k], 1), then the vector with elements 
    # delta[k] / sum(delta[1:K]), k = 1, ...,   K, is Dirichlet with parameters alpha[k], k = 1, ..., K.'
    for (allele_index in 1:num_alleles)
    {
      pi[sample_index, allele_index] 
      delta[sample_index, allele_index] ~ dgamma(alpha[tissue[allele_index],allele_index], 1)
    }
    
  }

  # For each tissue (blood or csf) hyperpriors
  for (tissue_index in 1:num_tissues)
  {
    # Convert a and b in beta prior, to mu and kappa
    # mu = mean allele for tissue,
    # kappa = how closely does sequence represent tissue - constant across all tissue types
    for (allele_index in 1:num_alleles)
    {
      alpha[tissue_index,allele_index] 
    }
 # hyperpriors for mu (beta dist)
 mu[tissue_index,1:num_alleles] ~ ddirch(AlphaMu[])
 }

 # Kappa hyperprior (gamma dist - shape and rate)
 kappa ~ dgamma(Skappa, Rkappa)

 # Top level constants for hyperpriors
 # This is a vector of length num_alleles
 AlphaMu <- alpha_priors

 # Gamma dist for kappa. First convert mean and sd to shape and rate
 Skappa <- pow(meanGamma,2)/pow(sdGamma,2)
 Rkappa <- meanGamma/pow(sdGamma,2)

 meanGamma <- 10
 sdGamma <- 10
}

Installing Cactus (progressiveCactus)

When installing Cactus (using the progressiveCactus repository) I encountered the following issues during compiling:

  1. easy_install not found
    Solution: Needed to remove my ~/.pydistutils.cfg
  2. Dependencies/includes not being found
    Solution: add ‘CXXFLAGS=-I”<install_location>/include/”‘ and ‘CFLAGS=-I”<install_location>/include/”‘ to <install_location>/share/config.site
  3. kyototycoon not compiling as kyotocabinet functions not found (as in issue 27)
    Solution: (as in my comment to the issue)entering the kyototycoon directory and running configure with different flags, then make:

    ./configure --prefix=~/software --with-kc=~/software
    
    make
    

    where ~/software is the prefix I am installing to (with subdirs bin, lib, include, man etc)

  4. Various cases of -ltokyocabinet not found, or other dependencies missing in the USCS submodules stages of compilation (which use makefiles but not configure, so don’t pick up config.site)
    Solution: add:

    cflags += -I"/nfs/users/nfs_j/jl11/software/include/" -L/nfs/users/nfs_j/jl11/software/lib

    to include.mk in submodules cactus, cactus2hal, hal, pinchesAndCacti and matchingAndOrdering

Hope that helps anyone who might have been struggling with similar compile issues!

Diagnosing results/status of lots of LSF jobs

Over the past few months I’ve found myself running large numbers of jobs over an LSF system, for example assembling and annotating thousands of bacterial genomes or imputing thousands of human genomes in 5Mb chunks.

Inevitable some of these jobs fail, and often for a number of reasons. I thought it might be helpful to share some of the commands I’ve found useful for diagnosing the jobs that have finished. The commands apply to IBM platform LSF (bsub), but I imagine have slightly wider applicability

bjobs -a -x

This command is useful if run just after jobs finish, so that they are still in the history (they are usually cleared after a couple of hours). It will show all jobs that have finished with a non-zero exit code, and also jobs which have underrun/overrun. This is especially useful if you’ve run something that has exited with an error early on, but still returns exit code 0 (e.g. wrong command line parameters).

find . -name "*.o" | xargs grep -L "Successfully completed"

Assuming all your job STDOUT files have the suffix .o (bsub -o), this will show any jobs (files at least) that have not finished with exit code 0.
find – returns all files names which end with .o, searching recursively
xargs – passes these file names one by one to grep
grep -L returns the file names of any files which do not contain the given phrase

find . -name "*.o" | xargs grep -l "MEMLIMIT"

Similar to the above command, except returns all those jobs that exceeded their memory limit. grep -l returns files with the match.
Makes it easy to find jobs which just need to be resubmitted with higher memory limits.

This and the above command can obviously be simply extended by grepping for different strings in the log files

find . -name "*.e" | xargs cat

Useful for some tasks, this will display all the output to STDERR assuming you wrote it to files with the suffix .e (bsub -e). Some software writes logs to STDERR, but in some cases you might expect this command to return no text

Compiling and installing MaSuRCA/MSRCA assembler

After reading GAGE-B (dx.doi.org/10.1093/bioinformatics/btt273) which is an evaluation of the performance of various pieces of de-novo assembly software I was convinced to try and get MaSuRCA (http://www.genome.umd.edu/masurca.html) working even if it took a lot of effort, as the results looked very promising.

The compilation didn’t work for me, the problem being the automatically generated Makefiles had an error in them where the compiler name was missing in the executed statement. This proved too complex for me to fix quickly, and instead I went with the following solution:
EDIT 4/8/14: This solution is unlikely to work. See bottom of article for why

  1. Download and compile the Celera Assembler separately (http://wgs-assembler.sourceforge.net/) following the instructions provided. You can probably even just download a pre-compiled binary
  2. Copy the contents of wgs-<version> over the CA directory in the MaSuRCA install. If you downloaded just the binary the important part is the Linux-amd64/bin/ directory (or whatever it is for your architecture)
  3. Comment out or delete line 29 of the install.sh file:
    (cd CA/src && make LD_RUN_PATH="$LIBDIR")
  4. Run the install.sh script

Which should then work, as the problem with the Makefiles only exists for the CA/src directory

When running the assembler, you’ll need the CA/bin files installed in bin/masurca/../CA/Linux-amd64/bin where /bin/masurca is the directory where the masurca binaries were installed to. This is the default provided by the script, so in most cases shouldn’t be a problem

EDIT: MaSuRCA v2.2.1 comes with CA v6.1, so download this binary/source rather than the newest one (v8 onwards).
Some of the command line options are different and the assemble.sh script won’t work. I’m going to try and fix this, and if I have any success will post the result here

EDIT 4/8/14: After reading the MaSuRCA v2 paper (dx.doi.org/10.1093/bioinformatics/btt476) I see that the authors modify the CA code slightly as part of their assembler. Without using this version the assemble.sh script won’t run, as the overlaps are not build correctly using the super-reads.
Therefore the code needs to be compiled by the provided method, but as mentioned the makefiles don’t work. I did manage to find a pre-compiled binary that works on my system on their ftp site, which I would recommend as the best solution for now