Impute your whole genome from 23andme data

2014-03-18 378 words 2 minutes

Contents

23andme is a service which types 602352 sites on your chromosomal DNA and your mtDNA. It is possible, by comparing to a reference panel in which all sites have been typed, to impute (fill in statistically) the missing sites and thus get an ’estimation’ of your whole genome.

The piece of software impute2 written by B. N. Howie, P. Donnelly, and J. Marchini gives good accuracy when using the 1000 Genome Project as a reference. However, there is some difficulty in providing the data in the right input format, using all the correct options and interpreting the output from this piece of software.

EDIT: As pointed out by lassefolkersen in the comments, this has now been nicely implemented at impute.me

I have written a tool to allow people with a small amount computational experience (but not necessarily any biological/bioinformatics knowledge) to run this tool on their 23andme data to get their whole genome output, which can be found at my github: https://github.com/johnlees/23andme-impute

To use this tool, you will need to do the following steps:

Download your ‘raw data’ from the 23andme site. This is a file named something like genome_name_full
Download the impute2 software from https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#download and follow their instructions to install it
Put impute2 on the path, i.e. run (with the correct path for where you extracted impute2): echo “export PATH=$PATH:/path/to/impute2” » ~/.bashrc
Download the 1000 Genomes reference data, which can be found on the impute2 website here: https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html
Extract this data by running: gunzip ALL_1000G_phase1integrated_v3_impute.tgz tar xf ALL_1000G_phase1integrated_v3_impute.tar (you will then probably want to delete the original, unextracted archive file as it is quite large)
Download my code by running: git clone https://github.com/johnlees/23andme-impute
Run ./impute_genome.pl to impute your whole genome!

The options required as input for impute_genome.pl should be reasonably straightforward, run with -h to see them, or look at the README.md on github.

As the analysis will take a lot of resources, I recommend against using the run command. I think –print or –write will be best for most people, and you can then run each job one at a time or in parallel if you have access to a cluster.

If you have any problems with this, please leave a message in the comments and I’ll try my best to get back to you.