Contents

Simulate disease state from a given odds ratio, fraction exposed, fraction affected

Contents

Problem: I have genetic data at a single variant site, where the minor allele frequency (MAF) is set, and the prevalence of disease is known (Sr). The variant is truly associated with the phenotype, at an odds ratio (OR) I want to set. How do I simulate the phenotypes given these three parameters, and whether each sample has the variant (exposed) or not?

This is analogous to simulating data in a 2x2 contigency table, as discussed in stack exchange here: http://stats.stackexchange.com/questions/13193/generating-data-with-a-pre-specified-odds-ratio

Solution: As alluded to in one of the comments of the stack exchange post, but unfortunately not derived or written out in a formatted way, the number of disease cases is a quadratic equation which can be rewritten in terms of the above parameters. I found the derivation from the table of p11 (or De, number of disease cases that have the variant) in a book appendix.

This requires a little rewriting to get in terms of these parameters, but can be written down as:

/images/screen-shot-2015-10-09-at-17-18-13.png?w=300

I have implemented this in c++ as a function returning a tuple

std::tuple<double, double> p_case(const double OR, const double MAF, const double Sr) {
   // Convenient defines
   const double m1 = Sr/(Sr + 1);

   // Quadratic equation
   const double a = OR - 1;
   const double b = -((m1 + MAF)*OR - m1 + 1 - MAF);
   const double c = OR*m1*MAF;
   double a1 = (-b - pow(pow(b, 2) - 4*a*c, 0.5)) / (2*a);

   // Probabilities to return
   double p_e = a1 / MAF;
   double p_ne = (m1 - a1)/(1 - MAF);

   return std::make_tuple(p_e, p_ne);
}

The whole code for the simulation can be found in subsample_seer.cpp here: https://github.com/johnlees/bioinformatics/tree/master/seer