Building intuition for p-values and statistical significance

Marton Trencseni - Sun 25 April 2021 - Data

Introduction

This is the transcript of a talk I did on experimentation and A/B testing to give the audience an intuitive understanding of p-values and statistical significance.

The format of the talk was a short introduction, then live-coding an ipython notebook. The initial, short snippets I typed out to make it interesting, for the later parts I switched to an existing notebook to keep the talk's momentum going.

I opened the talk by saying that in experimentation Data Scientists have 3 jobs. To (1) design (2) run, and (3) evaluate experiments. This talk is about the evaluation phase, and can be summed up as "don’t get fooled by randomness".

Coin flips

Let's build our intuitive understanding of p-values and statistical significance by considering coin flips. Whenever I think about A/B testing, in my head I'm secretly thinking about coin flips.

Coin toss

Suppose somebody gives us a coin, and we suspect it's biased. What can we do? As an experimentalist, what we do is we start flipping it and record the outcomes.

Suppose we flip this coin 10 times, and we get 7 heads. What can we say about the coin?

To see how usual or unusual this is, let's conduct many coinflipping experiments, and count. However, instead of doing it for real, let's use the computer's built-in random number generator:

random()

> 0.6316696366339524

So if we call random(), it returns a random float number between 0 and 1. Using this, we can simulate a fair coin:

# let's simulate fair coin flips
def coinflip():
    if random() < 0.5:
        return 'H'
    else:
        return 'T'

Let's try it. If we call it repeatedly, it returns H or T, fifty-fifty:

coinflip()

> 'H'

Okay, now let's change one little thing. Let's agree to use 1 instead of H and 0 instead of T:

# let's simulate fair coin flips
def coinflip():
    if random() < 0.5:
        return 1
    else:
        return 0

This will make our life easier in subsequent steps. Now, let's flip our virtual coin 10 times:

# 10 coinflips
[coinflip() for _ in range(10)]

> [1, 0, 1, 0, 0, 0, 1, 0, 1, 1]

Since we're using 1 for heads, we can just add up the results to get how many heads we had:

sum([coinflip() for _ in range(10)])

> 5

Let's create a shorthand for this:

def coinflips(num_flips=10):
    return sum([coinflip() for _ in range(num_flips)])

coinflips(10)

> 5

Okay, now it gets interesting. Let'd conduct this 10-flip experiment 10 times:

[coinflips(10) for _ in range(10)]

> [6, 5, 2, 5, 6, 4, 4, 3, 6, 5]

So, the first time we got 6 heads out of 10, the second time 5, then just 2 out of 10, and so on.

Presentation note: this is where I switch from typing to running an existing notebook.

Let's count how many times we 1 heads out of 10, 2 heads out of 10, 3 heads out of 10, if we repeat the 10-flip experiment many times:

NUM_FLIPS = 10
NUM_EXPERIMENTS = 10
outcomes = [coinflips(NUM_FLIPS) for _ in range(NUM_EXPERIMENTS)]
count = defaultdict(lambda: 0)
for outcome in outcomes:
    count[outcome] += 1
for i in range(NUM_FLIPS+1):
    print('Out of {} experiments, we flipped {} heads {} times'.format(
        NUM_EXPERIMENTS, i, count[i]))

> Out of 10 experiments, we wlipped 0 heads  0 times
> Out of 10 experiments, we wlipped 1 heads  0 times
> Out of 10 experiments, we wlipped 2 heads  1 times
> Out of 10 experiments, we wlipped 3 heads  2 times
> Out of 10 experiments, we wlipped 4 heads  3 times
> Out of 10 experiments, we wlipped 5 heads  2 times
> Out of 10 experiments, we wlipped 6 heads  1 times
> Out of 10 experiments, we wlipped 7 heads  0 times
> Out of 10 experiments, we wlipped 8 heads  1 times
> Out of 10 experiments, we wlipped 9 heads  0 times
> Out of 10 experiments, we wlipped 10 heads 0 times

Let's increase the number of experiments to get more counts and more representative statistics. Since we're using our computer's random number generator, we can do it a million times in about a second:

NUM_FLIPS = 10
NUM_EXPERIMENTS = 1000*1000  <---
outcomes = [coinflips(NUM_FLIPS) for _ in range(NUM_EXPERIMENTS)]
count = defaultdict(lambda: 0)
for outcome in outcomes:
    count[outcome] += 1
for i in range(NUM_FLIPS+1):
    print('Out of {} experiments, we flipped {} heads {} times'.format(
        NUM_EXPERIMENTS, i, count[i]))

> Out of 1000000 experiments, we flipped 0 heads  1004 times
> Out of 1000000 experiments, we flipped 1 heads  9719 times
> Out of 1000000 experiments, we flipped 2 heads  44091 times
> Out of 1000000 experiments, we flipped 3 heads  116981 times
> Out of 1000000 experiments, we flipped 4 heads  206072 times
> Out of 1000000 experiments, we flipped 5 heads  245371 times
> Out of 1000000 experiments, we flipped 6 heads  205059 times
> Out of 1000000 experiments, we flipped 7 heads  117282 times  <---
> Out of 1000000 experiments, we flipped 8 heads  43764 times
> Out of 1000000 experiments, we flipped 9 heads  9675 times
> Out of 1000000 experiments, we flipped 10 heads 982 times

Going back to our initial question, what can we say about a coin if we get 7 heads out of 10? From the above, we can see that if repeat this 10-flip experiment with a fair coin a million times, we actually get 7 heads out of 10 about 117,000 times. Let's just do one more change, instead of showing raw counts, let's divide them by NUM_EXPERIMENTS = 1000*1000 to get percentages:

NUM_FLIPS = 10
NUM_EXPERIMENTS = 1000*1000
outcomes = [coinflips(NUM_FLIPS) for _ in range(NUM_EXPERIMENTS)]
count = defaultdict(lambda: 0)
for outcome in outcomes:
    count[outcome] += 1
for i in range(NUM_FLIPS+1):
    print('Out of {} experiments, we flipped {} heads {:.2f}% of the time'.format(
        NUM_EXPERIMENTS, i, 100*count[i]/NUM_EXPERIMENTS))

> Out of 1000000 experiments, we flipped 0 heads  0.10% of the time
> Out of 1000000 experiments, we flipped 1 heads  0.99% of the time
> Out of 1000000 experiments, we flipped 2 heads  4.40% of the time
> Out of 1000000 experiments, we flipped 3 heads  11.71% of the time
> Out of 1000000 experiments, we flipped 4 heads  20.53% of the time
> Out of 1000000 experiments, we flipped 5 heads  24.59% of the time
> Out of 1000000 experiments, we flipped 6 heads  20.51% of the time
> Out of 1000000 experiments, we flipped 7 heads  11.70% of the time  <---
> Out of 1000000 experiments, we flipped 8 heads  4.39% of the time
> Out of 1000000 experiments, we flipped 9 heads  0.97% of the time
> Out of 1000000 experiments, we flipped 10 heads 0.10% of the time

Just as with the counts, we see that if repeat this 10-flip experiment with a fair coin many times, we actually get 7 heads out of 10 about 11.7% of the time!

What is the p-value

Given an experimental outcome: "we flip 7 heads out of 10", assume the "boring", "non-action" case (the statistical term is "null hypothesis") = "the coin is fair, it's not biased".
Compute the probability of the experimental outcome (7 heads) in the "boring" case (=11.69%)
.. and also add to it the probabilities of even-more-extreme outcomes (8, 9, 10 heads, which is + 4.36% + 0.98% + 0.10%) = 17.13%

$ p = 0.17 $

In summary: in real life, we don't know if the coin is fair or not. We just know we flipped 7 heads out of 10. What we know from the above is, if we assume it's fair, we would see this outcome (7 heads), or more extreme outcomes (8, 9, 10 heads out of 10), a combined 17% of the time. That is not very unlikely, about 1 in 6 odds.

Comment: I've skipped 1-sided vs 2-sided testing, as it's not critical for building the core intuition.

What is statistical significance

Is the above statistically significant? What does "statistically significant" mean?

Statistical significance is a way to make a decision based on an experiment. It's a pre-agreed upon, set in protocol, $p_{critical}$ threshold for the p-value, so that if $p < p_{critical}$, we declare the result "statistically significant" and we reject the "boring" null hypothesis, and we accept the alternative, "action" hypothesis.

In our example, the "boring" hypothesis is that the "coin is fair", rejecting it would mean we conclude the "coin is biased". Since we calculated $p=0.17$, assuming we are working with $p_{critical} = 0.05$, we would not reject the "boring" null hypothesis. We would not reject the hypothesis that the coin is biased.

So statistical significance is just a human-agreement for decision-making. Usually $p_{critical} = 0.05$ or 5% or 1 in 20 odds, but this is completely arbitrary. There is nothing special or "right" about 5%. We can agree that we will use 1% or 10% or 20% in our experiments.

From the above we can see that for 10 coin flips, if our $p_{critical}$ is 5%, we'd need to get 0, 1 or 9, 10 heads to conclude that the coin is biased.

Notes:

At this point I explained that if we pick a p-value of 5%, that means on average, of the times when we reject the null hypothesis, we will be wrong 1 in 20 times. Depending on the audience, this may be too much and confuse them.
When I presented I had a question about how to pick $p_{critical}$. Here's a dedicated blog post I wrote about how to pick $p_{critical}$ and how to balance experimentation velocity vs being sure: A/B tests: Moving Fast vs Being Sure.

Sample size

We saw that getting 7 out of 10 flips, so 70% heads, is not that unusual. Is 70 heads out of 100 flips also not unusual? Let's check:

NUM_FLIPS = 100  <---
NUM_EXPERIMENTS = 100*1000
outcomes = [coinflips(NUM_FLIPS) for _ in range(NUM_EXPERIMENTS)]
count = defaultdict(lambda: 0)
for outcome in outcomes:
    count[outcome] += 1
for i in range(NUM_FLIPS+1):
    print('Out of {} experiments, we flipped {} heads {:.3f}% of the time'.format(
        NUM_EXPERIMENTS, i, 100*count[i]/NUM_EXPERIMENTS))

> ...
> Out of 100000 experiments, we flipped 70 heads 0.001% of the time
> ...

What we see here is that 70% only happens 0.001% of the time, very rarely! And if we add the %s of for 71, 72 .. 100, it's still a very small number. So with 100 coin flips, a 70% heads (or more) outcome yields a much-much smaller p-value than the 17% we had for 10 coin flips.

This is the effect of sample size! N=100 coin flips is a lot more than N=10 coin flips. Telling apart a biased and a fair coin is much easier if you flip more and more. If we flip a coin 100 times and get 70 heads, then it is very unlikely that this coin is fair.

Note: the notebook also has plots the bell curves, but I don't think it helped the audience.

A/B testing

Okay, now that we have built our intuition with coin flips, let's talk about a marketing A/B test. Suppose we have a control (A) and a treatment (B) version of an email. Maybe A is text-only, while B also has some colorful images. Let's say we send it out to N=1000 customers each. We wait one week, look at the logs, and the measurement outcome is that A converts at 10% and B converts at 12%.

What can we say? It's actually the same thing as coin flips! Except here, the "fair" base coin is not a 50-50 coin, it's the 10% conversion we see in our control group A.

Note: here I'm decreasing rigor to make the explanation flow better. The gloss over the fact that 10% conversion of control is also a sampled result, so it's not the same as the hypothetical fair coin, which has no uncertainty.

Let's build a "control (A)" coin:

def coinflip():
    if random() < 0.10: <--- 0.10 instead of 0.50
        return 1
    else:
        return 0

coinflips(100)

> 11

Now, let's see our experiment:

NUM_FLIPS = 1000
NUM_EXPERIMENTS = 10*1000
outcomes = [coinflips(NUM_FLIPS) for _ in range(NUM_EXPERIMENTS)]
count = defaultdict(lambda: 0)
for outcome in outcomes:
    count[outcome] += 1
for i in range(NUM_FLIPS+1):
    print('Out of {} experiments, we flipped {} heads {:.2f}% of the time'.format(
        NUM_EXPERIMENTS, i, 100*count[i]/NUM_EXPERIMENTS))

> ...
> Out of 10000 experiments, we flipped 120 heads 0.52% of the time
> Out of 10000 experiments, we flipped 121 heads 0.38% of the time
> Out of 10000 experiments, we flipped 122 heads 0.26% of the time
> Out of 10000 experiments, we flipped 123 heads 0.18% of the time
> Out of 10000 experiments, we flipped 124 heads 0.20% of the time
> Out of 10000 experiments, we flipped 125 heads 0.11% of the time
> Out of 10000 experiments, we flipped 126 heads 0.13% of the time
> Out of 10000 experiments, we flipped 127 heads 0.11% of the time
> Out of 10000 experiments, we flipped 128 heads 0.09% of the time
> Out of 10000 experiments, we flipped 129 heads 0.04% of the time
> Out of 10000 experiments, we flipped 130 heads 0.03% of the time
> ...

Let's change one thing, instead of saying "120 heads", let's change our vocabulary, divide by 1000, and talk about conversion, like 120/1000=12%:

NUM_FLIPS = 1000
NUM_EXPERIMENTS = 10*1000
outcomes = [coinflips(NUM_FLIPS) for _ in range(NUM_EXPERIMENTS)]
count = defaultdict(lambda: 0)
for outcome in outcomes:
    count[outcome] += 1
for i in range(NUM_FLIPS+1):
    print('Out of {} experiments, we had {:.2f}% conversion {:.2f}% of the time'.format(
        NUM_EXPERIMENTS, 100*i/NUM_FLIPS, 100*count[i]/NUM_EXPERIMENTS))

> ...
> Out of 10000 experiments, we had 12.00% conversion 0.48% of the time
> Out of 10000 experiments, we had 12.10% conversion 0.27% of the time
> Out of 10000 experiments, we had 12.20% conversion 0.41% of the time
> Out of 10000 experiments, we had 12.30% conversion 0.22% of the time
> Out of 10000 experiments, we had 12.40% conversion 0.18% of the time
> Out of 10000 experiments, we had 12.50% conversion 0.14% of the time
> Out of 10000 experiments, we had 12.60% conversion 0.11% of the time
> Out of 10000 experiments, we had 12.70% conversion 0.08% of the time
> Out of 10000 experiments, we had 12.80% conversion 0.07% of the time
> Out of 10000 experiments, we had 12.90% conversion 0.04% of the time
> Out of 10000 experiments, we had 13.00% conversion 0.03% of the time
> ...

Let's add up these "12% or more conversion cases":

print('p_value = ' + str(100*sum([count[i] for i in range(int(NUM_FLIPS*0.12), NUM_FLIPS)])/NUM_EXPERIMENTS) + '%')

> p_value = 2.13%

So if we assume that treatment's (B's) true, intrinsic conversion is also 10%, just like control's (A's) (=this is the "boring" null hypothesis), then we'd expect to see 12% or more conversion, just by chance, 2.13% of the time. If we were using $p_{critical} = 5%%$, we'd conclude that this is "statistically significant", we'd reject the "boring" hypothesis that A and B are the same, and we'd go with the "action" hypothesis: in the future, we will send the B variant to all users --- hence the name "action" hypothesis, it means we take some action.

Conclusion

I close the talk by talking briefly about the importance of having a culture of experimentation, that the most important thing is to actually run experiments, repeatedly, rigorously. I make the point of saying this is more important than statistical rigor itself. First is, run a lot of experiments, and then worry about p-values. I quote Jeff Bezos:

“Our success at Amazon is a function of how many experiments we run per year, per month, per week, per day.” - Jeff Bezos

A really good 5-minute pitch on experimentation culture is this Youtube video. Depending on how much time is left, some points can be lifted from here, like:

firsts organizations have to learn how to experiment
most experiments don't yield wins
eventual big wins pay for many losers.

Note: stressing experimentation culture may or may not be a good investment of time in a talk. For example, in big established companies, culture is usually top-down, so it only makes sense spending significant time on this if the audience includes top management.

Happy experimenting!