Reducing variance in conversion A/B testing with CUPED
Marton Trencseni - Sat 07 August 2021 - ab-testing
Introduction
In the previous post, Reducing variance in A/B testing with CUPED, I ran Monte Carlo simulations to demonstrate CUPED, a variance reduction technique in A/B testing on continuous data, like $ spend per customer. Here I will repeat the same Monte Carlo simulations, but with binary 0/1 conversion data.
The experiment setup is almost entirely the same as in the last post. The only difference is in the get_AB_samples()
and get_AB_samples_nocorr()
functions, since these now have to generate 0/1 conversion data. The jupyter notebook for this post is up on Github.
Generating correlated conversion data
The approach is to assume a base conversion before_p
, say 10%. For each user, do a random 0/1 draw with this probability. This "before" outcome is either 0 or 1. Then, generate the "after" data:
- if the "before" outcome was 0, do a random 0/1 draw with conversion
0 + offset_p
, say 10%. This way, most 0s remain 0s. - if the "before" outcome was 1, do a random 0/1 draw with conversion
1 - offset_p
, say 90%. This way, most 1s remain 1s. - in the B variant, add
treatment_lift
(say 1%) to the above probabilities.
This is just one possible scheme, but it results in "before" and "after" data being correlated. In code:
def draw(p):
if random() < p:
return 1
else:
return 0
def lmd(x):
return list(map(draw, x))
def get_AB_samples(before_p, offset_p, treatment_lift, N):
A_before = [int(random() < before_p) for _ in range(N)]
B_before = [int(random() < before_p) for _ in range(N)]
A_after = [int(random() < abs(x - offset_p)) for x in A_before]
B_after = [int(random() < abs(x - offset_p) + treatment_lift) for x in B_before]
return lmd(A_before), lmd(B_before), lmd(A_after), lmd(B_after)
We can validate by computing conditional probabilities:
counts = defaultdict(int)
for b, a in zip(A_before, A_after):
counts[(b, a)] += 1
counts[(b)] += 1
print('(after = 1 | before = 0) = %.2f' % (counts[(0, 1)]/counts[(0)]))
print('(after = 0 | before = 0) = %.2f' % (counts[(0, 0)]/counts[(0)]))
print('(after = 1 | before = 1) = %.2f' % (counts[(1, 1)]/counts[(1)]))
print('(after = 0 | before = 1) = %.2f' % (counts[(1, 0)]/counts[(1)]))
Prints:
(after = 1 | before = 0) = 0.10
(after = 0 | before = 0) = 0.90 # if it was 0, it's likely to be 0 again
(after = 1 | before = 1) = 0.90 # if it was 1, it's likely to be 1 again
(after = 0 | before = 1) = 0.10
Simulating one experiment
The driver code is, apart from parametrization, the same as before:
N = 10*1000
before_p = 0.1
offset_p = 0.1
treatment_lift = 0.01
A_before, B_before, A_after, B_after = get_AB_samples(before_p, offset_p, treatment_lift, N)
A_after_adjusted, B_after_adjusted = get_cuped_adjusted(A_before, B_before, A_after, B_after)
print('A mean before = %.3f, A mean after = %.3f, A mean after adjusted = %.3f' % (mean(A_before), mean(A_after), mean(A_after_adjusted)))
print('B mean be|fore = %.3f, B mean after = %.3f, B mean after adjusted = %.3f' % (mean(B_before), mean(B_after), mean(B_after_adjusted)))
print('Traditional A/B test evaluation, lift = %.3f, p-value = %.3f' % (lift(A_after, B_after), p_value(A_after, B_after)))
print('CUPED adjusted A/B test evaluation, lift = %.3f, p-value = %.3f' % (lift(A_after_adjusted, B_after_adjusted), p_value(A_after_adjusted, B_after_adjusted)))
Prints (not deterministic):
A mean before = 0.097, A mean after = 0.179, A mean after adjusted = 0.180
B mean be|fore = 0.099, B mean after = 0.191, B mean after adjusted = 0.190
Traditional A/B test evaluation, lift = 0.012, p-value = 0.028
CUPED adjusted A/B test evaluation, lift = 0.010, p-value = 0.021
In this particular run, CUPED did a better job approximating the lift with a lower p-value. However, as we saw before, this is not always the case. Although on average CUPED is a better estimator with lower variance, there are experiment runs where CUPED makes the lift measurement worse. For example, after a few re-runs:
A mean before = 0.098, A mean after = 0.178, A mean after adjusted = 0.180
B mean be|fore = 0.103, B mean after = 0.189, B mean after adjusted = 0.187
Traditional A/B test evaluation, lift = 0.011, p-value = 0.053
CUPED adjusted A/B test evaluation, lift = 0.006, p-value = 0.133
Next, it's interesting to print out the CUPED-transformed variables:
print('Possible mappings:')
mappings = set(['(before=%d, after=%d) -> adjusted=%.3f' % (b, a, p) for
b, a, p in zip(A_before+B_before, A_after+B_after, A_after_adjusted+B_after_adjusted)])
for m in mappings:
print(m)
Prints:
Possible mappings:
(before=1, after=0) -> adjusted=-0.722
(before=0, after=1) -> adjusted=1.081
(before=0, after=0) -> adjusted=0.081
(before=1, after=1) -> adjusted=0.278
Remember the CUPED transformation equation was $ Y'_i := Y_i - (X_i - \mu_X) \frac{cov(X, Y)}{var(X)} $. In this equation $\mu_X$, $cov(X, Y)$, and $var(X)$ are constants computed from the experiment results. Both $X_i$ and $Y_i$ can take on two values, 0 or 1, so there are 4 possible combinations, hence $Y'_i$ can take on 4 values. In the above run, the values were -0.722, 1.081, 0.081 and 0.278. It's a bit counter-intuitive, since now the conversion experiment has these weird, even negative values, instead of just 0 and 1.
Simulating many experiments
As with continuous variables, CUPED measures the same lift [in conversion], but with lower variance:
Simulating 10,000 A/B tests, true treatment lift is 0.010...
Traditional A/B testing, mean lift = 0.010, variance of lift = 0.00030
CUPED adjusted A/B testing, mean lift = 0.010, variance of lift = 0.00019
CUPED lift variance / tradititional lift variance = 0.626 (expected = 0.668)
We can observe the tighter lifts on a histogram:
As with continuous variables, the p-values decrease:
As illustrated above, CUPED lifts and p-values are a better estimate with respect to variance, but not in all cases:
No correlation
The simplest way to generate uncorrelated conversion data is to use random draws independently. In code:
def get_AB_samples_nocorr(before_p, treatment_lift, N):
A_before = [before_p] * N
B_before = [before_p] * N
A_after = [before_p] * N
B_after = [before_p + treatment_lift] * N
return lmd(A_before), lmd(B_before), lmd(A_after), lmd(B_after)
Checking conditional probabilities:
P(after = 1 | before = 0) = 0.10
P(after = 0 | before = 0) = 0.90
P(after = 1 | before = 1) = 0.10
P(after = 0 | before = 1) = 0.90
It's uncorrelated, because P(after = 1) = 0.10
in both 0 and 1 before cases, so "after" is independent of "before" (and same for P(after = 0)
).
With this generator function, running num_experiments=10,000
, we can observe no variance reduction (since "before" and "after" is uncorrelated) on the lift and p-value histograms:
Conclusion
CUPED works for both continuous and binary experimentation outcomes, and reduces variance if "before" and "after" are correlated.