Reducing variance in conversion A/B testing with CUPED

Marton Trencseni - Sat 07 August 2021 - ab-testing

Introduction

In the previous post, Reducing variance in A/B testing with CUPED, I ran Monte Carlo simulations to demonstrate CUPED, a variance reduction technique in A/B testing on continuous data, like $ spend per customer. Here I will repeat the same Monte Carlo simulations, but with binary 0/1 conversion data.

The experiment setup is almost entirely the same as in the last post. The only difference is in the get_AB_samples() and get_AB_samples_nocorr() functions, since these now have to generate 0/1 conversion data. The jupyter notebook for this post is up on Github.

Generating correlated conversion data

The approach is to assume a base conversion before_p, say 10%. For each user, do a random 0/1 draw with this probability. This "before" outcome is either 0 or 1. Then, generate the "after" data:

if the "before" outcome was 0, do a random 0/1 draw with conversion 0 + offset_p, say 10%. This way, most 0s remain 0s.
if the "before" outcome was 1, do a random 0/1 draw with conversion 1 - offset_p, say 90%. This way, most 1s remain 1s.
in the B variant, add treatment_lift (say 1%) to the above probabilities.

This is just one possible scheme, but it results in "before" and "after" data being correlated. In code:

def draw(p):
    if random() < p:
        return 1
    else:
        return 0

def lmd(x):
    return list(map(draw, x))

def get_AB_samples(before_p, offset_p, treatment_lift, N):
    A_before = [int(random() < before_p) for _ in range(N)]
    B_before = [int(random() < before_p) for _ in range(N)]
    A_after  = [int(random() < abs(x - offset_p))                  for x in A_before]
    B_after  = [int(random() < abs(x - offset_p) + treatment_lift) for x in B_before]
    return lmd(A_before), lmd(B_before), lmd(A_after), lmd(B_after)

We can validate by computing conditional probabilities:

counts = defaultdict(int)
for b, a in zip(A_before, A_after):
    counts[(b, a)] += 1
    counts[(b)] += 1

print('(after = 1 | before = 0) = %.2f' % (counts[(0, 1)]/counts[(0)]))
print('(after = 0 | before = 0) = %.2f' % (counts[(0, 0)]/counts[(0)]))
print('(after = 1 | before = 1) = %.2f' % (counts[(1, 1)]/counts[(1)]))
print('(after = 0 | before = 1) = %.2f' % (counts[(1, 0)]/counts[(1)]))

Prints:

(after = 1 | before = 0) = 0.10
(after = 0 | before = 0) = 0.90 # if it was 0, it's likely to be 0 again
(after = 1 | before = 1) = 0.90 # if it was 1, it's likely to be 1 again
(after = 0 | before = 1) = 0.10

Simulating one experiment

The driver code is, apart from parametrization, the same as before:

N = 10*1000
before_p = 0.1
offset_p = 0.1
treatment_lift = 0.01

A_before, B_before, A_after, B_after = get_AB_samples(before_p, offset_p, treatment_lift, N)
A_after_adjusted, B_after_adjusted = get_cuped_adjusted(A_before, B_before, A_after, B_after)

print('A mean before = %.3f, A mean after = %.3f, A mean after adjusted = %.3f' % (mean(A_before), mean(A_after), mean(A_after_adjusted)))
print('B mean be|fore = %.3f, B mean after = %.3f, B mean after adjusted = %.3f' % (mean(B_before), mean(B_after), mean(B_after_adjusted)))
print('Traditional    A/B test evaluation, lift = %.3f, p-value = %.3f' % (lift(A_after, B_after), p_value(A_after, B_after)))
print('CUPED adjusted A/B test evaluation, lift = %.3f, p-value = %.3f' % (lift(A_after_adjusted, B_after_adjusted), p_value(A_after_adjusted, B_after_adjusted)))

Prints (not deterministic):

A mean before = 0.097, A mean after = 0.179, A mean after adjusted = 0.180
B mean be|fore = 0.099, B mean after = 0.191, B mean after adjusted = 0.190
Traditional    A/B test evaluation, lift = 0.012, p-value = 0.028
CUPED adjusted A/B test evaluation, lift = 0.010, p-value = 0.021

In this particular run, CUPED did a better job approximating the lift with a lower p-value. However, as we saw before, this is not always the case. Although on average CUPED is a better estimator with lower variance, there are experiment runs where CUPED makes the lift measurement worse. For example, after a few re-runs:

A mean before = 0.098, A mean after = 0.178, A mean after adjusted = 0.180
B mean be|fore = 0.103, B mean after = 0.189, B mean after adjusted = 0.187
Traditional    A/B test evaluation, lift = 0.011, p-value = 0.053
CUPED adjusted A/B test evaluation, lift = 0.006, p-value = 0.133

Next, it's interesting to print out the CUPED-transformed variables:

print('Possible mappings:')
mappings = set(['(before=%d, after=%d) -> adjusted=%.3f' % (b, a, p) for
  b, a, p in zip(A_before+B_before, A_after+B_after, A_after_adjusted+B_after_adjusted)])
for m in mappings:
    print(m)

Prints:

Possible mappings:
(before=1, after=0) -> adjusted=-0.722
(before=0, after=1) -> adjusted=1.081
(before=0, after=0) -> adjusted=0.081
(before=1, after=1) -> adjusted=0.278

Remember the CUPED transformation equation was $ Y'_i := Y_i - (X_i - \mu_X) \frac{cov(X, Y)}{var(X)} $. In this equation $\mu_X$, $cov(X, Y)$, and $var(X)$ are constants computed from the experiment results. Both $X_i$ and $Y_i$ can take on two values, 0 or 1, so there are 4 possible combinations, hence $Y'_i$ can take on 4 values. In the above run, the values were -0.722, 1.081, 0.081 and 0.278. It's a bit counter-intuitive, since now the conversion experiment has these weird, even negative values, instead of just 0 and 1.

Simulating many experiments

As with continuous variables, CUPED measures the same lift [in conversion], but with lower variance:

Simulating 10,000 A/B tests, true treatment lift is 0.010...
Traditional    A/B testing, mean lift = 0.010, variance of lift = 0.00030
CUPED adjusted A/B testing, mean lift = 0.010, variance of lift = 0.00019
CUPED lift variance / tradititional lift variance = 0.626 (expected = 0.668)

We can observe the tighter lifts on a histogram:

CUPED

As with continuous variables, the p-values decrease:

CUPED

As illustrated above, CUPED lifts and p-values are a better estimate with respect to variance, but not in all cases:

CUPED

No correlation

The simplest way to generate uncorrelated conversion data is to use random draws independently. In code:

def get_AB_samples_nocorr(before_p, treatment_lift, N):
    A_before = [before_p] * N
    B_before = [before_p] * N
    A_after = [before_p] * N
    B_after = [before_p + treatment_lift] * N
    return lmd(A_before), lmd(B_before), lmd(A_after), lmd(B_after)

Checking conditional probabilities:

P(after = 1 | before = 0) = 0.10
P(after = 0 | before = 0) = 0.90
P(after = 1 | before = 1) = 0.10
P(after = 0 | before = 1) = 0.90

It's uncorrelated, because P(after = 1) = 0.10 in both 0 and 1 before cases, so "after" is independent of "before" (and same for P(after = 0)).

With this generator function, running num_experiments=10,000, we can observe no variance reduction (since "before" and "after" is uncorrelated) on the lift and p-value histograms:

CUPED

Conclusion

CUPED works for both continuous and binary experimentation outcomes, and reduces variance if "before" and "after" are correlated.