Dogs and the central limit theorem

Contents

At school/University I had vaguely heard of the Central Limit Theorem (CLT) but never properly understood it or looked it up. My understanding which pretty much along the lines of ‘most measurements are normally distributed’.

This isn’t quite right (although as we’ll see not totally wrong). The CLT actually states that sample means taken from a distribution of measurements are normally distributed, irrespective of how the full distribution of all the measurements is distributed. So if we take large enough batches of samples of a measurement repeatedly and calculate their average, those averages will be normally distributed (with the same variance as the full distribution).

So why is this useful? Are there any good examples?

Earlier this year I was taken dog sledding, and as I watched our sleds separate out through the beautiful Swedish forest scenery I naturally starting thinking about the CLT.

The set up for the sledding was that from a pack of around 40 dogs (Siberian Huskies), groups of six were selected1 to pull each sled. The dogs have different power/speed depending on their size, sex, age, how much they have run recently, and their general temperament that day. Here’s a hypothetical distribution for the power of the dogs, and some example draws for the sled I was on:

Non-normally distributed dogs
Hypothetical power distribution for huskies, and some example draws from that distribution. (All draws are good boys and girls, but not all good boys and girls were draws.)

Each sled traveled at a different speed due to the average2 of the power μ\mu of the selected dogs. The distribution of these means, here the distribution of the sled speeds, is what the CLT can tell us about.

Normally distributed sleds
Each sled has six dogs, with a speed determined by the mean of the selected dogs. The sleds speeds are normally distributed, due to the CLT.

As these are means of draws from an underlying distribution, their distribution must be normal3. It doesn’t matter what shape the dog power actually is, the distribution of the sleds is always normal, with the same variance as the underlying distribution.

CLT relating dogs to sleds
Whatever shape the power distribution of the dogs, the distribution of sled speeds is always normal.

Practically what does this mean?4 Well we can always say something about the sled speeds due to properties of normal distributions. The speeds will be spread, symmetrically, with thin tails. So no skew of more fast ones out front than slower ones in the rear, no bimodal groups, no ‘fat tails’ of very fast or very slow sleds. We can also expect the scale of the spread of sleds to be the same as the spread of dogs. All very useful I’m sure you’ll agree!

This might all seems quite specific, but we can apply the CLT to understand why normal distributions are so ubiquitous in quantitative trait genetics.

For polygenic traits such as height, there are many different underlying genetic effects. Each of these genetic variants contributes an individually small effect, and when summed across the genome they give the overall phenotype for the individual, which results in measurable variation.

So we have genetic effect sizes (with some unknown distribution of effect sizes), and each person’s genome is effectively a sample of these variants5. So these sample means, e.g. heights of people, will be normally distributed.

Another example of a polygenic quantitative trait would be the power of a Siberian Husky6. So dogs probably have a normally distributed power anyway! (Of course, still leading to a normally distributed set of sled speeds).

3blue1brown has a couple of good videos on the CLT:


  1. Not random samples as compatible dogs are chosen for each pair, and only some dogs are suitable leaders. Which is a deviation from the CLT but let’s pretend it’s roughly random, which isn’t too far off as there’s not strong selection on power. ↩︎

  2. Sum, but the same as the mean because all sleds had the same number of dogs NN↩︎

  3. This is only true with infinite samples and here we only have six, but what’s a difference of infinity between friends. ↩︎

  4. We’re probably more interest in the spread of distances after some time. The sleds go at a pretty constant speed so (I think) these are still normal, but with scaling of variance proportional to time. ↩︎

  5. Ignoring any complexity in populations. Also see https://www.jstor.org/stable/3087302 ↩︎

  6. I reckon ↩︎