Techtalk: Why are A/B tests the gold standard of causal inference?

Marton Trencseni - Sun 29 September 2024 - Data

Introduction

Recently, I delivered a techtalk on A/B testing to an audience of non-technical Product Managers and experienced Data Scientists who actively conduct A/B tests. Given the varied backgrounds of the attendees, I designed the presentation to ensure that both groups found the content engaging and valuable.

Why even do a techtalk on A/B testing? A/B testing isn’t just a tool—it’s a way of thinking. The idea is simple: test every change, no matter how small, to see its actual effect. Even minor tweaks can lead to significant improvements. For example, at Bing, a small change to ad headlines led to a 12% increase in revenue. This kind of result shows the power of systematic testing.

The primary goal was to create a session that offered something for everyone. For the Product Managers, many of whom may not have a deep technical background, I provided a general, gentle introduction to A/B testing. This segment was inspired by the insightful papers of Ron Kohavi, a renowned expert in the field. It covered the foundational aspects of A/B testing, including:

  • Basic Concepts: What A/B testing is and why it's essential, its relationship to Lean Product Development and Build-Measure-Learn.
  • Experiment Design: How to set up a controlled experiment with clear objectives and identify the right metrics to evaluate test outcomes.
  • Alternatives: Wild-wild west, HiPPO, flawed A/B testing.

For the Data Scientists in the room, who are already familiar with running A/B tests, the talk delved deeper into more sophisticated topics. The second half focused on the empirical findings from Gordon and Zettelmeyer’s paper A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook. This part of the presentation emphasized why A/B testing remains the gold standard for causal inference, particularly in complex environments like online advertising. Their study involved 15 large-scale advertising campaigns at Facebook, comparing the effectiveness of randomized experiments against various observational techniques like Exact Matching (EM), Propensity Score Matching (PSM), Stratification, and Regression Adjustment (RA). The findings were clear:

  • Significant Bias: Observational methods often misestimated the true treatment effect by a factor of three or more.
  • Reliability Issues: Even with large sample sizes and high-quality data, observational methods failed to provide accurate estimates.
  • Causal Inference: The study underscored that without randomization, it’s challenging to eliminate biases and confounding variables, making A/B testing indispensable for reliable causal inference.

Links:

Slides