Exploring prior beliefs with MCMC

Marton Trencseni - Sat 06 July 2019 - Math

Introduction

In the previous post Food deliveries, Bayes and Computational Statistics a first principles simulation was used to calculate the Bayes posterior probability of UberEats being the most popular food delivery service. After I finished writing the post my I was left feeling unsatisfied: I wrote too much simulation code. I realized I could've used a Markov Chain Monte Carlo (MCMC) framework to get the same result with less code. The notebook is up on Github.

So let's do that. Let's use PyMC3, the standard MCMC library for Python. This code is so simple, without further ado, here it is:

import pymc3 as pm
%matplotlib inline

data = {
    'Carriage':  2,
    'Talabat':   2,
    'UberEats':  16,
    'Deliveroo': 12,
}

def uniform_prior(k):
    rs = [pm.Uniform('raw%s' % i, 0, 1) for i in range(k)]
    s = sum(rs)
    return [pm.Deterministic('p%s' % i, rs[i]/s) for i in range(k)]

def run_model(prior, data):
    with pm.Model() as model:
        vals = list(data.values())
        k = len(vals)
        ps = prior(k)
        multi = pm.Multinomial('multi', n=sum(vals), p=ps, observed=vals)
        trace = pm.sample(draws=10*1000, progressbar=True)
        return pm, trace

pm, trace = run_model(uniform_prior, data)

Okay, what's happening here? PyMC3 is an MCMC library, and computes representative samples of random variables based on some data. We specify a model, which is essentially a way to compute a desired random variable (RV) (essentially a probability distribution function, pdf) from other random variables. In this case, we say that multi is a multinomial RV, with the parameters of multi itself being RVs, these are ps in the code. When we specify multi, we pass in our observed values, so we "fix" this RV. The ps RVs are constructed in the prior helper function, named prior because here is where we encode our prior belief about the food delivery business. To match the previous post, we don't assume any courier is more popular than the other, constructed so that the probabilities sum to 1.

When we call sample() on the model, Markov Chain Monte Carlo simulation is performed; the details of this are beyond the scope of this article. The important thing is that after the simulation is complete, trace will contain a sample of all free (non-fixed) RVs, updated according to the observations passed ot the fixed RVs. Ie. when we specified these RVs, we gave our prior belief pdfs, $ p(H) $, whereas trace now contains samples from the posterior pdf, ie. $ p(H|D) $. That's exactly what we worked for in the previous post!

Given this sample that PyMC3 gives us, it's easy to calculate the hypothesis. We simply count the frequency of cases when the probability parameter of UberEats is the maximum in the trace:

hypothesis = [trace['p2'][i] == max([trace['p%s' % j][i] for j in range(k)]) for i in range(len(trace['p0']))]
sum(hypothesis) / len(hypothesis)

The program outputs the probability at 72%, like in the original post.

PyMC3 has a few useful helper functions. One is pm.summary(trace), this outputs some statistics about the RVs:

PyMC3 summary()

The other is pm.traceplot(trace), this outputs the pdf and the random walk for each RV:

PyMC3 traceplot()

There is something weird here. Looking at the original observation data, half of the deliveries are from UberEats (16/32). The corresponding RV in the summary() output is p2. So why is the posterior mean for p2 0.44 and not closer to 0.50? Intuitively, after looking at the data, shouldn't that be the maximum likelyhood (ML) prediction? Let's think this through:

the mean is not the ML, but looking at the posterior curve for p2, the maximum of the curve (=ML) is also not at 0.5
it could be that the simulation has not reached a stationary state (in Markov chain language); but looking at the random walk output, it has (and to double-check I ran it longer, this is not the problem)
numerical error: this is not the reason

The reason is simple: it's our prior belief. Our prior belief was a uniform distribution. We observed some data and showed it to the model, and based on this our posterior belief was updated, but the prior belief is still baked in. The way to think about this is, the prior pdf slowly morphs into the posterior pdf as data is fed to the model.

One good way to get a feel for this is to imagine feeding a low observation count to the model. Suppose we feed it observations like (1, 0, 2, 1). Should we now believe that Talabat never gets orders? Clearly not.

Another good way to get a feel for this is to feed the model more data, but keeping the same ratios, ie. multiplying each number by 10, like (20, 20, 160, 120). At such high observation counts, the posteriors will be very narrow, and close to the actual ratios. This is called flooding the priors: as the model observes a lot of data, the prior belief is less and less important.

PyMC3 summary()

Also the hypothesis probability of UberEats being the most popular will be 98.7%, or almost certain.

PyMC3 also allows us to easily explore other prior beliefs. For example, we could specify a prior centered around the observed ratios. BAKING OBSERVATIONS INTO THE PRIOR IS A MODELLING MISTAKE, DON'T DO THIS. I just used it as a debugging exercise to make sure that in such a case the posteriors are what I expect them to be, they remain centered on the observed ratios (yes). The evilness also comes out in the function signature, passing the observed values to the prior shouldn't happen. But in any case, it shows how easy it is to change the priors with PyMC3:

def beta_prior(vals, k):
    rs = [pm.Beta('raw%s' % i, vals[i], sum(vals)-vals[i]) for i in range(k)]
    s = sum(rs)
    return [pm.Deterministic('p%s' % i, rs[i]/s) for i in range(k)]

pm, trace = run_model(beta_prior, data)

Going back to the original problem, we can make our code shorter by using PyMC3's built-in Dirichlet distribution. The Dirichlet distribution can be parameterized so that it's the uniform distribution in n-dimenions:

def dirichlet_uniform_prior(k):
    return pm.Dirichlet('ps', a=1*np.ones(k))
pm, trace = run_model(dirichlet_uniform_prior, data)
hypothesis = [ps[2] == max(ps) for ps in trace['ps']]
sum(hypothesis) / len(hypothesis)

Running this reveals something interesting. The probability of the hypothesis that UberEats is the most popular comes out to ~77%, which is about 5% more than before. What's going on here? It turns out that prior we baked in before, although it didn't favor any courier over the others, did have some bias built-in. We sampled each individual raw RV uniformly between 0 and 1, and then divided by the sum. Each raw's mean is 0.5, sum's mean is 2, so the ps component's mean is 0.25. So by constructing the prior like this, we biased it towards (0.25, 0.25, 0.25, 0.25), ie. we said this is a more likely prior than eg. (0.7, 0.1, 0.1, 0.1). This can be seen by using MCMC to just sample:

with pm.Model() as model:
    k = 4
    rs = [pm.Uniform('raw%s' % i, 0, 1) for i in range(k)]
    s = pm.Deterministic('sum', sum(rs))
    [pm.Deterministic('p%s' % i, rs[i]/s) for i in range(k)]
    trace = pm.sample(draws=10*1000, progressbar=True)
    pm.traceplot(trace)

PyMC3 traceplot()

Looking at the sum:

PyMC3 traceplot()

Versus the Dirichlet distribution (see this blog post for more):

with pm.Model() as model:
    k = 4
    pm.Dirichlet('ps', a=1*np.ones(k))
    trace = pm.sample(draws=10*1000, progressbar=True)
    pm.traceplot(trace)

PyMC3 traceplot()

I would say the original post is buggy, because I intended to make no prior assumption (=use the "uninformative prior"), so the Dirichlet distribution (at 𝛼=1) is the right prior.