Understanding Facebook’s Planout A/B testing system

Marton Trencseni - Fri 22 May 2020 - Data

Introduction

PlanOut is a framework for online field experiments. It was created to make it easy to run and iterate on sophisticated experiments in a statistically sound manner.

In previous A/B testing posts I focused on the mathematical aspects. This time, I will explain Facebook’s Planout product for A/B testing. Planout can be used to declare and configure A/B tests and to assign users into buckets (A, B, etc.) in production. Planout was released in 2014, its main author is Eytan Bakshy. This post is based on a PyData talk Eytan gave in 2014.

Unit of A/B testing

Let’s pretend we’re running A/B tests on our website, and we use Python/Django. Using Planout, we subclass SimpleExperiment and define the function assign(), like:

from planout.experiment import SimpleExperiment
from planout.ops.random import *

class MyExperiment(SimpleExperiment):
    def assign(self, params, user_id):
        params.button_size = UniformChoice(choices=[100, 120], unit=user_id)

In case A, the button size is 100, in case B the button size is B, and we want to see whether B lifts the click-through-rate (CTR) of the button.

Once the class is declared, we can create an instance by passing in the unit (in this case, user_id), and then we can retrieve which experimental bucket the user is in by retrieving the params, like:

for i in xrange(10):
    e = MyExperiment(user_id=i)
    print(i, e.get('button_size'))

It will print:

(0, 120)
(1, 120)
(2, 100)
(3, 120)
(4, 120)
(5, 120)
(6, 100)
(7, 120)
(8, 100)
(9, 100)

Exposure logging and power

Continuing the above example, a file MyExperiment.log will be generated, which looks like:

{"inputs": {"user_id": 0}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 1}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 2}, "name": "MyExperiment", "params": {"button_size": 100}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 3}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 4}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 5}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 6}, "name": "MyExperiment", "params": {"button_size": 100}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 7}, "name": "MyExperiment", "params": {"button_size": 120}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 8}, "name": "MyExperiment", "params": {"button_size": 100}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}
{"inputs": {"user_id": 9}, "name": "MyExperiment", "params": {"button_size": 100}, "time": 1590153690, "salt": "MyExperiment", "event": "exposure"}

This is called the exposure log of the experiment. In a production environment this would be funnel into an event stream processing system and eventually stored in a data warehouse, where the experimental results can be evaluated. Note that Planout is only for running the experiment, evaluation is up to the user.

It is important that the exposure logs are only emitted when the experimental parameter button_size is retrieved, not when the experiment object is created in the e = MyExperiment(user_id=i) line. This is important to get maximum statistical power. When we compare the CTR for A and B, it's clear how many in A and B clicked through, let's say it's 99 and 105. But what was the sample size for A and B? It's important to only count users who were actually exposed to the experiment, in this case who accessed this dialog box. The best way to achieve this programatically is to emit the exposure logs as late as possible, when the parameter (button_size) is retrieved. There's no guarantee this is enough, a novice programmer could still write code which retrieves and saves these parameters in a database for later usage, but it's the best an A/B testing framework can do.

Pseudo-random but deterministic through hashing

If you run the above code on your computer, you will notice that you get the same results! In other words, if you create a class called MyExperiment like above, and pass in user_id=0, you will also, and always get button_size=120. Also, if you re-run the code, you will always get the same results.

This is because Planout is deterministic. It doesn't actually use a random number generator to decide whether to put a user into bucket A or B. Instead, it uses hashing: it takes the salt of the experiment (by default, the salt is the name of the class, MyExperiment), combines it with the parameter name (button_size) and the unit value (the unit of the experiment is the user_id, eg. 0), and computes the SHA, so eg. SHA1('MyExperiment.button_size.0'). Since the experiment is setup to be a uniform choice between 2 cases, the result of the SHA1() would be mod 2'd.

Determinism is important for a number of reasons:

  • if a user comes back a 2nd time, we want her to be in the same experimental bucket (A or B) as the 1st time to keep the experiment consistent
  • this way, as long as we know which users got exposed to the experiment (in this case, the user_id), we can re-construct which bucket they were in, even if logs are lost; in other words, in the case above, even if the logs did not contain the assigned button_size, we could compute it when evaluating the experiment at the end.

See the documentation for more.

Changing salts

We can override the experiment name and salt, so it's not the (default) class name:

class MyExperiment(SimpleExperiment):
  def setup(self):
    self.name = 'My awesome experiment'
    self.salt = '91551c2a7e9117429e43c59ec1e4e8035c19ae15'
    # salt is the result of:
    # date | awk '{print $0"My awesome experiment"}' | shasum

This is useful for a number of reasons:

  • in a large organization, different users of the A/B testing system could accidentally use the same names in their experiment (eg. BigButtonExperiment and button_size). By explicitly setting the salt, experiment logs will never get mixed up.
  • this way, if the class is renamed to a more descriptive name like BigButtonExperiment during refactoring, the experimental results don't change, users will continue to get hashed into the same buckets; if the salt is not set explicitly, renaming the class or the parameter will change the hashing!

The salt can also be explicitly set for the parameters, like:

params.button_size = UniformChoice(choices=[100, 120], unit=user_id, salt='27830a83e56b62d9f7cc03868a80f3a67cb69201')

In a sophisticated environment, the salts can be set automatically the first time the code is checked into source control.

Multiple parameters

Suppose we want to also experiment with the color of the button:

class MyExperiment(SimpleExperiment):
    def assign(self, params, user_id):
        params.button_size = UniformChoice(choices=[100, 120], unit=user_id)
        params.button_color = UniformChoice(choices=['blue', 'green'], unit=user_id)

If we do it like this, it's an A/B/C/D test, because we will get all 2x2 combinations of sizes and colors. But what if we just want an A/B test, with the combinations (button_size, button_color) = (100, blue) and (120, green). This can be accomplished by setting the parameter level salts to be the same:

class MyExperiment(SimpleExperiment):
    def assign(self, params, user_id):
        params.button_size = UniformChoice(choices=[100, 120], unit=user_id, salt='x')
        params.button_color = UniformChoice(choices=['blue', 'green'], unit=user_id, salt='x')

Chaining

Suppose we have a baseline recommendation engine, and we are experimenting with a new version, but we're not sure how to set tune some variables in the new engine. For each user, we want to pick a variable setting, and in each user session, we want to either use the baseline, or the new engine with the appropriate variables.

class RecomendationEngineExperiment(SimpleExperiment):
  def assign(self, params, user_id, session_id):
    params.new_model = UniformChoice(choices=['v200', 'v201', 'v202'], unit=user_id)
    params.session_model = UniformChoice(choices=['v100', params.new_model], unit=[user_id, session_id])

for i in xrange(10):
    for j in xrange(10):
        e = RecomendationEngineExperiment(user_id=i, session_id=j)
        print(i, e.get('session_model'))

What we accomplish here is that in each session, the user either gets v100 or one of the new ones, but for a user, the new one never changes, eg. user_id=0 either gets v100 or v202.

Conclusion

Planout was released in 2014 and never took off. But its design principles are solid, so it's good practice to either understand Planout when designing an A/B testing framework, or just use Planout as-is.