Why is A/B testing the golden standard of causal inference?

Marton Trencseni - Mon 31 July 2023 - Data


Why is A/B testing is widely regarded as the golden standard of causal inference? What else could a Data Scientist do, and why is it inferior to A/B testing?

In the previous post, I discussed a Data Scientist posed with the challenge of attributing long-term sales to marketing channels, and modeling their effectiveness and response curves, solved in the industry using multivariate regression methods commonly referred to as Marketing Mix Modeling (MMM).

Now, let's look at another problem: imagine a Data Scientist who is faced with evaluting the effectiveness of an online marketing campaign, but for some reason A/B testing is not possible: either because the company does not believe in A/B testing, or is afraid of leaving out a control group and miss out on sales, or the campain has already concluded, so a post-facto measurement is needed. In such a case, what techniques are available to the Data Scientist do, and what kind of accuracy can she expect compared to an A/B test?

Facebook paper

First, let's look at the motivation from the introduction of the paper A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook by Gordon and Zettelmeyer:

In practice, few online ad campaigns rely on Randomized Controlled Trials (RCT). Reasons range from the technical difficulty of implementing experimentation in ad-targeting engines to the commonly held view that such experimentation is expensive and often unnecessary relative to alternative methods. Thus, many advertisers and leading ad-measurement companies rely on observational methods to estimate advertising’s causal effect.

In the paper we assess empirically whether the variation in data typically available in the advertising industry enables observational methods to recover the causal effects of online advertising. To do so, we use a collection of 15 large-scale advertising campaigns conducted on Facebook as RCTs in 2015. We use this dataset to implement a variety of matching and regression-based methods and compare their results with those obtained from the RCTs.

Facebook dataset

The data possesses several key attributes that should facilitate the performance of observational methods. First, we observe an unusually rich set of user-level, user-time-level, and user-timecampaign-level covariates. Second, our campaigns have large sample sizes (from 2 million to 140 million users), giving us both statistical power and means to achieve covariate balance. Third, whereas most advertising data are collected at the level of a web browser cookie, our data are captured at the user level, regardless of the user’s device or browser, ensuring our covariates are measured at the same unit of observation as the treatment and outcome.3 Although our data do not correspond exactly to what an advertiser would be able to observe (either directly or through a third-party measurement vendor), our intention is to approximate the data many advertisers have available to them, with the hope that our data are in fact better.

Observational methods


Accuracy of methods vs A/B testing