Understanding Facebook’s Planout A/B testing framework

Marton Trencseni - Fri 22 May 2020 - Data

Introduction

In previous A/B testing posts I focused on the mathematical aspects. This time, I will explain Facebook’s Planout product for A/B testing. Planout can be used to declare and configure A/B tests and to assign users into buckets (A, B, etc.) in production. Planout was released in 2014, its main author is Eytan Bakshy. This post is based on a PyData talk Eytan gave in 2014.

Unit of A/B testing

Let’s pretend we’re running A/B tests on our website, and we use Python/Django. Using Planout, we subclass SimpleExperiment and define the function assign(), like:

from planout.experiment import SimpleExperiment
from planout.ops.random import *

class MyExperiment(SimpleExperiment):
    def assign(self, params, user_id):
        params.button_size = UniformChoice(choices=[100, 120], unit=user_id)

In case A, the button size is 100, in case B the button size is 120, and we want to see whether B lifts the click-through-rate (CTR) of the button.

Once the class is declared, we can create an instance by passing in the unit (in this case, user_id), and then we can retrieve which experimental bucket the user is in by retrieving the param button_size, like:

for i in xrange(10):
    e = MyExperiment(user_id=i)
    print(i, e.get('button_size'))

It will print:

(0, 120)
(1, 120)
(2, 100)
(3, 120)
(4, 120)
(5, 120)
(6, 100)
(7, 120)
(8, 100)
(9, 100)

Exposure logging and power

Continuing the above example, a file MyExperiment.log will be generated, which looks like:

{"inputs": {"user_id": 0}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 1}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 2}, "name": "MyExperiment", "params": {"button_size": 100}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 3}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 4}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 5}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 6}, "name": "MyExperiment", "params": {"button_size": 100}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 7}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 8}, "name": "MyExperiment", "params": {"button_size": 100}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 9}, "name": "MyExperiment", "params": {"button_size": 100}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}

This is called the exposure log of the experiment. In a production environment this would be funneled into an event stream processing system and eventually stored in a data warehouse, where the experimental results can be evaluated. Note that Planout only deals with running the experiment, it does not deal with evaluation of the results, such as hypothesis testing.

It is important that the exposure logs are only emitted when the experimental parameter button_size is retrieved, not when the experiment object is created in the e = MyExperiment(user_id=i) line. This is important to get maximum statistical power. When we compare the CTR for A and B, it's clear how many in A and B clicked through, let's say it's 990 and 1051. But what was the sample size for A and B? It's important to only count users who were actually exposed to the experiment, in this case who accessed this dialog box. Why? Because a difference of 1051-990=61 is much more significant if 5,000 users were exposed to the experiment than if 500,000 users.

The best way to achieve this programatically is to emit the exposure logs as late as possible, when the parameter (button_size) is retrieved. There's no guarantee this is enough, a novice programmer could still write code which retrieves and saves these parameters in a database for later usage, but it's the best an A/B testing framework can do.

Pseudo-random but deterministic through hashing

If you run the above code on your computer, you will notice that you get the same results! In other words, if you create a class called MyExperiment like above, and pass in user_id=0, you will also get button_size=120. Also, if you re-run the code, you will always get the same results.

This is because Planout is deterministic. It doesn't actually use a random number generator to decide whether to put a user into bucket A or B. Instead, it uses hashing: it takes the salt of the experiment (by default, the salt is the name of the class, MyExperiment), combines it with the parameter name (button_size) and the unit value (the unit of the experiment is the user_id, eg. 0), and computes the SHA, so eg. SHA1('MyExperiment.button_size.0'). Since the experiment is setup to be a uniform choice between 2 cases, the result of the SHA1() would be mod 2'd.

Determinism is important for a number of reasons:

if a user comes back a 2nd time, we want her to be in the same experimental bucket (A or B) as the 1st time to keep the experiment consistent
as long as we know which users got exposed in the experiment (the user_id), we can re-construct which bucket they were in, even if logs are lost; in other words, in the case above, even if the logs did not contain the assigned button_size, we could re-compute it

See the documentation for more.

Changing salts

We can override the experiment name and salt, so it's not the (default) class name:

class MyExperiment(SimpleExperiment):
  def setup(self):
    self.name = 'My awesome experiment'
    self.salt = '91551c2a7e9117429e43c59ec1e4e8035c19ae15'
    # salt is the result of:
    # date | awk '{print $0"My awesome experiment"}' | shasum

This is useful for a number of reasons:

in a large organization, different users of the A/B testing system could accidentally use the same names in their experiment (eg. MyExperiment and button_size). By explicitly setting the salt, experimental results will never get mixed up.
this way, if the class is renamed to a more descriptive name like BigButtonExperiment during refactoring, the experimental results don't change, users will continue to get hashed into the same buckets; if the salt is not set explicitly, renaming the class or the parameter will change the hashing!

The salt can also be explicitly set for the parameters, like:

params.button_size = UniformChoice(choices=[100, 120], unit=user_id, salt='27830a83e56b62d9f7cc03868a80f3a67cb69201')

In a sophisticated environment, the salts can be set automatically the first time the code is checked into source control.

Multiple parameters

Suppose we want to also experiment with the color of the button:

class MyExperiment(SimpleExperiment):
    def assign(self, params, user_id):
        params.button_size = UniformChoice(choices=[100, 120], unit=user_id)
        params.button_color = UniformChoice(choices=['blue', 'green'], unit=user_id)

If we do it like this, it's an A/B/C/D test, because we will get all 2x2 combinations of sizes and colors. But what if we just want an A/B test, with the combinations (button_size, button_color) = (100, blue) and (120, green). This can be accomplished by setting the parameter level salts to be the same:

class MyExperiment(SimpleExperiment):
    def assign(self, params, user_id):
        params.button_size = UniformChoice(choices=[100, 120], unit=user_id, salt='x')
        params.button_color = UniformChoice(choices=['blue', 'green'], unit=user_id, salt='x')

Chaining

Suppose we have a baseline recommendation engine v100, and we are experimenting with a new version, but we're not sure how to tune the new engine. For each user, we want to pick a tuned new engine (v200...v202) and in each session, we want to either use the baseline, or the new engine (but always the same for the same user).

class RecomendationEngineExperiment(SimpleExperiment):
    def assign(self, params, user_id, session_id):
        params.new_model = UniformChoice(choices=['v200', 'v201', 'v202'], unit=user_id)
        params.session_model = UniformChoice(choices=['v100', params.new_model], unit=[user_id, session_id])

for i in xrange(10):
    for j in xrange(10):
        e = RecomendationEngineExperiment(user_id=i, session_id=j)
        print(i, e.get('session_model'))

What we accomplish here is that in each session, the user either gets v100 or one of the new ones, but for a user, the new one never changes, eg. user_id=0 either gets v100 or v202.

Conclusion

Planout's design principles are solid and still apply today, so it's good practice to either understand Planout when designing an A/B testing framework, or just use Planout as-is.