Impute your whole genome from 23andme data

23andme is a service which types 602352 sites on your chromosomal DNA and your mtDNA. It is possible, by comparing to a reference panel in which all sites have been typed, to impute (fill in statistically) the missing sites and thus get an ‘estimation’ of your whole genome.

The piece of software impute2 written by B. N. Howie, P. Donnelly, and J. Marchini gives good accuracy when using the 1000 Genome Project as a reference. However, there is some difficulty in providing the data in the right input format, using all the correct options and interpreting the output from this piece of software.

EDIT: As pointed out by lassefolkersen in the comments, this has now been nicely implemented at impute.me

I have written a tool to allow people with a small amount computational experience (but not necessarily any biological/bioinformatics knowledge) to run this tool on their 23andme data to get their whole genome output, which can be found at my github: https://github.com/johnlees/23andme-impute

To use this tool, you will need to do the following steps:

  1. Download your ‘raw data’ from the 23andme site. This is a file named something like genome_name_full
  2. Download the impute2 software from https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#download and follow their instructions to install it
  3. Put impute2 on the path, i.e. run (with the correct path for where you extracted impute2):
    echo “export PATH=$PATH:/path/to/impute2” >> ~/.bashrc
  4. Download the 1000 Genomes reference data, which can be found on the impute2 website here:
    https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html
  5. Extract this data by running:
    gunzip ALL_1000G_phase1integrated_v3_impute.tgz
    tar xf ALL_1000G_phase1integrated_v3_impute.tar
    (you will then probably want to delete the original, unextracted archive file as it is quite large)
  6. Download my code by running:
    git clone https://github.com/johnlees/23andme-impute
  7. Run ./impute_genome.pl to impute your whole genome!

The options required as input for impute_genome.pl should be reasonably straightforward, run with -h to see them, or look at the README.md on github.

As the analysis will take a lot of resources, I recommend against using the run command. I think –print or –write will be best for most people, and you can then run each job one at a time or in parallel if you have access to a cluster.

If you have any problems with this, please leave a message in the comments and I’ll try my best to get back to you.

A new direction for leesjohn

Since October 2013 I have stopped using Fedora, and instead use machines running Ubuntu 12.04/13.10, Windows 8 and OS X 10.8.5. As these OSs have a larger user base than Fedora, many of the issues I encounter are well documented and easy to fix (i.e. there is a stackexchange post as one of the top three google results), hence there haven’t been many things for me to post under the original remit of this blog.

Of course, when I do encounter an undocumented OS based issue as I go about my business I’ll still try and post it on leesjohn. However I expect this to be much less common than previously, and the new computing based issues I find myself having to deal with are:
Interactions and differences between OS X and Ubuntu when working with them simultaneously
Working with Ubuntu without a sudo account (e.g. installing software, using custom libraries)
Use of radio software (e.g. Rivendell, Cuedex, Jack)

I have now changed area from physics to bioinformatics, and think there is scope to share many of the scripts and programs I write for this, as well as solutions to issues I encounter in the area. So I have finally gotten round to setting up a github account (https://github.com/johnlees) to share as much of the code I write as possible.

From now on leesjohn will primarily be to document the scripts in these repositories, and to share some original tools. I’ve already committed some things, which you can see at:
https://github.com/johnlees/bioinformatics (for bioinformatics tools)
https://github.com/johnlees/config (for software related configuration)
Hopefully this will make it very easy if someone ever does want to use some of the stuff I’ve written.

In the next few weeks I am hoping to write some posts about some of the more useful/general things in these repositories. I am also planning on making a wrapper script to allow you to use impute2 (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html) to infer your whole genome from the ‘raw data’ you get if you have had a 23andme done (https://www.23andme.com/) – which as far as I can tell is not something yet available in the public sphere, but something I think many of 23andme’s clients could be interested in.

A latex bibliography style I like (Nature style in biblatex)

This isn’t really a specific question, but I needed to make a bibliography in latex with the following requirements:

  • The citations take up as little space as possible, so should probably be superscript
  • The citations should be correctly grouped (i.e. 1-3, 6 not 6, 2, 3, 1)
  • The bibliography can take up any amount of space
  • The citations should be linked to their bibliography entry (i.e. hyperref compatible)
  • bib entries contain unicode characters
  • I want the entries to look like the Elsevier standard, though Nature is also fine
  • I want DOIs, properly displayed and hyperlinked, not monospaced

The style=nature option supplied to biblatex in the preamble (see http://ctan.org/pkg/biblatex-nature) achieves most of this but you don’t get DOIs, there seemed to be some problems with unicode characters (particularly Polish names, see http://www.terminally-incoherent.com/blog/reference/latex-reference/) and there were some problems displaying URLs well

Rather than trying to hack together a biblatex.cfg based on the nature style which I didn’t understand/couldn’t be bothered to read through I instead was able to use a standard biblatex style with some options when loading the package:

\usepackage[style=numeric-comp,
maxcitenames=2,
maxnames = 5,
firstinits=true,
uniquename=init,
sorting=none,
url=false,
isbn=false,
eprint=false,
texencoding=utf8,
bibencoding=utf8,
autocite=superscript,
backend=biber
]{biblatex}

This gets pretty close, but I also needed to use the following biblatex.cfg (create this file in the same directory as the .tex file):

% Number in parenthesis
\renewbibmacro*{volume+number+eid}{%
%  \setunit*{\addcomma\space}% NEW
  \printfield{volume}%
%  \setunit*{\adddot}% DELETED
%  \setunit*{\addcomma\space}% NEW
  \iffieldundef{number}
    {}
    {\bibopenparen
     \printfield{number}%
     \bibcloseparen}
  \setunit{\addcomma\space}%
  \printfield{eid}}

% Field formats for the bibliography environment (get rid of square brackets)
\DeclareFieldFormat{labelnumberwidth}{#1\adddot}

%Get rid of in:
\renewbibmacro{in:}{}

%Get rid of pp.
\DeclareFieldFormat[article,inproceedings,incollection]{pages}{#1}

%Make volume number emboldened
\DeclareFieldFormat[article,inproceedings,incollection]{volume}{\textbf{#1}}

%Journal name in non-italics
%\DeclareFieldFormat[article,inbook,incollection,inproceedings,patent,thesis,unpublished]{journaltitle}{#1}

%No quotes around article name
\DeclareFieldFormat
  [article,inbook,incollection,inproceedings,patent,thesis,unpublished]
  {title}{#1\isdot}

%Bibliography in smaller font size, and unjustified
\renewcommand{\bibfont}{\normalfont\small\raggedright}

%Hyperlinks in serif font
\def\UrlFont{\normalfont}

%DOI lower case, normal font
\renewcommand*{\mkbibacro}[1]{%
  \ifcsundef{\f@encoding/\f@family/\f@series/sc}
    {#1}
    {\MakeLowercase{#1}}}

%Colon after author names
\renewcommand{\labelnamepunct}{\addcolon\space}

Which got me what I wanted:

bibliography

DPPC (Dipalmitoylphosphatidylcholine) DSPC and DMPC in Latex

Using chemfig I was able to represent DPPC (Dipalmitoylphosphatidylcholine) and other lipids in Latex by using the following code

\newcommand\setpolymerdelim[2]{\def\delimleft{#1}\def\delimright{#2}}
\def\makebraces[#1,#2]#3#4#5{%
\edef\delimhalfdim{\the\dimexpr(#1+#2)/2}%
\edef\delimvshift{\the\dimexpr(#1-#2)/2}%
\chemmove{%
\node[at=(#4),yshift=(\delimvshift)]
{$\left\delimleft\vrule height\delimhalfdim depth\delimhalfdim
width0pt\right.$};%
\node[at=(#5),yshift=(\delimvshift)]
{$\left.\vrule height\delimhalfdim depth\delimhalfdim
width0pt\right\delimright_{\rlap{$\scriptstyle#3$}}$};}}
\setpolymerdelim[]


\begin{figure}
\small
\setatomsep{1.5em}
\chemfig{N^+(-[:180,1.1]H_3C)(-[:90,1.3]CH_3)(-[:270,1.3]CH_3)(-[:-30]-[:30]-[:-30]O-[:30,1.3]P^+(<[:50,1.5]O\rlap{${}^-$})(<:[:130,1.5]O\rlap{${}^-$})(-#(1pt,)[:330,1.3]O-[:30]-[:-30](-[:270]O-[:-30](=[:270]O)(-[@{downleft,0.8}:30]CH_2-#(1pt,1pt)[@{downright,0.3}:-30,1.2]CH_3))(-[:30]-[:-30]O-[:30](=[:90]O)(-[@{upleft,0.8}:-30]CH_2-#(1pt,1pt)[@{upright,0.3}:30,1.2]CH_3))))}
\makebraces[10pt,13pt]{n}{downleft}{downright}
\makebraces[6pt,15pt]{n}{upleft}{upright}
\label{fig:lipids}
\end{figure}

The crucial line is:
\chemfig{N^+(-[:180,1.1]H_3C)(-[:90,1.3]CH_3)(-[:270,1.3]CH_3)(-[:-30]-[:30]-[:-30]O-[:30,1.3]P^+(<[:50,1.5]O\rlap{${}^-$})(<:[:130,1.5]O\rlap{${}^-$})(-#(1pt,)[:330,1.3]O-[:30]-[:-30](-[:270]O-[:-30](=[:270]O)(-[@{downleft,0.8}:30]CH_2-#(1pt,1pt)[@{downright,0.3}:-30,1.2]CH_3))(-[:30]-[:-30]O-[:30](=[:90]O)(-[@{upleft,0.8}:-30]CH_2-#(1pt,1pt)[@{upright,0.3}:30,1.2]CH_3))))}

You’ll also need to include the following in the preamble

\usepackage{chemfig}

Which produces something that looks like this:

Image

Samsung Galaxy S3 with Fedora

The new Android phones no longer work as USB mass storage devices, and instead use MTP. Not that I really know what this is, or its advantages over the previous system.

Fortunately a very helpful blog post at http://tacticalvim.wordpress.com/2012/12/08/mounting-nexus-4-via-mtp-in-fedora-17/ guided me most of the way (the nexus 4 and galaxy S3 are very similar)
I had to make a couple of changes as the device ids were different, but it’s essentially the same instructions so have a look there first

Firstly install simple-mtpfs:

sudo yum -y install fuse fuse-libs libmtp simple-mtpfs

Check it’s worked with:

ls -l /dev/libmtp*

Which should return a link between libmtp and somewhere in bus/usb. Then create /etc/udev/rules.d/99-galaxyS3.rules, with the following content:

ACTION!="add", GOTO="galaxyS3_rules_end"
ENV{MAJOR}!="?*", GOTO="galaxyS3_rules_end"
SUBSYSTEM=="usb", GOTO="galaxyS3_usb_rules"
GOTO="galaxyS3_rules_end"

LABEL="galaxyS3_usb_rules"

# Galaxy SIII I-9300
ATTR{idVendor}=="04e8", ATTR{idProduct}=="6860", SYMLINK+="libmtp-%k", ENV{ID_MTP_DEVICE}="1", ENV{ID_MEDIA_PLAYER}="1"

LABEL="galaxyS3_rules_end"

Get the correct idVendor (VID) and idProduct (PID) by running simple-mtpfs -l

You can then set commands to mount and unmount by adding the following to your /.bashrc, where ~/mnt/galaxyS3 is the directory your phone’s storage will be mounted to:

alias S3mount="simple-mtpfs ~/mnt/galaxyS3"
alias S3umount="fusermount -u ~/mnt/galaxyS3"

The commands on the right of course can be used to mount and unmount

You’ll need to reboot to get it to work. I had to unplug and plug the phone too. troyengel reports on his tacticalvim blog that he had to run the S3mount command 2-3 times to get it to work

If you’re having trouble I’d recommend looking at the simple-mtpfs documentation

Installing Times New Roman in fedora

I imported a pdf into inkscape I made in gnuplot which used Times New Roman as a font, however Times wasn’t installed so it substituted the font for sans.
I have a pretty shaky knowledge of how fonts work, and why it was that it worked in gnuplot but Times isn’t available from other programs, but my solution was as follows:

Follow the instructions to use the script to install all msttcorefonts at:
http://blog.andreas-haerter.com/2011/07/01/install-msttcorefonts-fedora.sh

wget "http://blog.andreas-haerter.com/_export/code/2011/07/01/install-msttcorefonts-fedora.sh?codeblock=1" -O "/tmp/install-msttcorefonts-fedora.sh"
chmod a+rx "/tmp/install-msttcorefonts-fedora.sh"
su -c "/tmp/install-msttcorefonts-fedora.sh"

After rebooting this worked, but some of the fonts used by webpages had screwed up (which I believe were trebuchet and verdana)

I fixed this by (as root) navigating to /usr/share/fonts, ensuring out of the newly installed fonts only the times ones are available and refreshing the font cache:

cd /usr/share/fonts
mkdir mstt-times
cp msttcorefonts/times* mstt-times
mv msttcorefonts .msttcorefonts
fc-cache -v

This made a lot of text unrendered in the browser, but after rebooting everything worked as I wanted it to, and I was able to automatically use Times correctly in inkscape

I also found the following page useful: http://www.pwsdb.com/pgm/?p=172

Using custom text when using \ref in latex

I couldn’t find how to do this easily, but perhaps this is because I used rubbish search terms.

I eventually found my answer on http://en.wikibooks.org/wiki/LaTeX/Labels_and_Cross-referencing (which ended up telling me lots of useful things about the hyperref package I didn’t know)

First source the hyperref package in the preamble

\usepackage{hyperref}

You’ll probably want to provide some options to make it look nicer. See the manual linked from the ctan page: http://www.ctan.org/pkg/hyperref

You can then add references choosing the text yourself with a command of the format

\hyperref[label-name]{link-text}

It helps to illustrate this with an example. In my case I have a figure 4, composed of 3 sub-figures 4a, 4b and 4c (though these are simply part of the same image, not specified as separate figures in latex). My figure is labelled ‘SEM’ and I want to reference figure 4c including a hyperlink to the figure it appears in. I can do this using:

\hyperref[fig:SEM]{\ref*{fig:SEM}c}

This sends the link to the SEM figure, and puts as the hyperlinked text ‘4c’. Using \ref in the curly brackets ensures the figure number is updated if it changes from 4, which is the usual behaviour we desire.

Another thing I came across on the wikibooks page was the \autoref command provided by hyperref. This looks like a better idea than using \ref and constantly typing figure, and could straightforwardly be included in the above example by changing \ref to \autoref

Converting .ndv files into .csv (for the UV/Vis nanodrop spectrometers)

Thermo Scientific NanoDrop range of UV/Vis spectrophotometers (http://www.nanodrop.com/Absorbance.aspx) seem to be pretty good to me — apart from their terrible software.

I used the NanoDrop 1000 Spectrophotometer, and found the software to be so unintuitive that I had to read the entirety of the manual for the sections I used (which can be found at http://goo.gl/smXCT). The two main points I would note are:

  • The data can be exported from the report – select ‘Show report’->’Save report’->’Full report’  which will save a .ndv file of all the spectra taken in the current session.
  • The absorbance data at 0.2mm (NB on the report it is shown for 0.1mm so you can visually confirm it is a tenth of the 1mm absorbance, see the wikipedia page on the Beer-Lambert law for an explanation of this linearity) is stored in C:\NanoDrop Data\User name\ HiAbs — there is no other way to get to it.

The .ndv files saved are tab-delimited files which you can load into spreadsheet software to manipulate and plot. As I wanted to do further spectral analysis this was a bit useless to me, so I wrote a quick perl script to convert these files into a set of .csv files (one for each spectrum listed in the .ndv file).

This is available at my other site (for now, this link will expire in July 2013) at http://users.ox.ac.uk/~ball3126/ndv-converter.pl and is also listed below. Run ‘ndv-converter.pl –help’ for a usage guide. The default behaviour is to convert all .ndv files in the current working directory

ndv-converter code listing (pdf)

How to install texlive (full) on Fedora 17 – and why

Fedora provides a texlive package (http://fedoraproject.org/wiki/Features/TeXLive), however it is incomplete, usually out of date and I haven’t been able to easily install new latex packages through it. In theory new packages can be installed by issuing the command:

yum install 'tex(epsfig.sty)'

However this never worked for me, and despite some searching I couldn’t work out what was going wrong.

Personally, as someone with plenty of free disk space, I’ve found the best solution is to install the full version of texlive. Certainly, ever since doing so I’ve never had any problems compiling my latex files and haven’t had to think about the install ever since.

This excellent post on the tex StackExchange describes in detail how to do this with Ubuntu:
http://tex.stackexchange.com/a/95373
which I would recommend reading before following any of the advice here

For fedora it may be slightly different (especially in faking packages, see step 1 below), but in summary what I did was as follows:

  1. Install the official package from fedora using ‘yum install texlive’ (so that software with tex as a dependency can be installed)
  2. Download the installer for the full texlive from http://www.tug.org/texlive/acquire-netinstall.html
  3. Run the install-tl script
  4. Make sure the install path is /opt (or /opt/texlive if you’d like)
  5. Add /opt/texlive/2012/bin/x86_64-linux (with the correct year) to the path (see e.g. http://askubuntu.com/questions/60218/how-to-add-a-directory-to-my-path if unsure how to do this) making sure it’s added before /usr/bin so the correct latex programs are used rather than the ones from fedora’s texlive package
  6. If using some software such as texmaker to edit and compile your latex, make sure it is correctly configured to run pdflatex, biber etc. from /opt/texlive/2012/bin/x86_64-linux (e.g for texmaker follow the instructions at http://www.xm1math.net/texmaker/doc.html#SECTION02)

Advantages

  • A lot more packages are included
  • tlmgr is included, which allows incredibly easy installation of new packages from ctan (tlmgr install package-name)

Disadvantages

  • Not integrated into fedora’s package management
    • You’ll now have to manually update using
      tlmgr update –self
      tlmgr update –all
      rather than it simply working through yum (though there may be a way around this, I haven’t looked into it)
  • Uses a lot of space (something like 3-4GiB)