Evaluating Media Mix Model Accuracy: The CRPS Score in Backtesting

Can you trust your media mix model? That is an obvious but not asked enough question that everyone using MMM should consider – in-house, with a vendor, it doesn’t matter.

As you try and answer it, you’ll run into overfitting, which is one of the biggest problems in MMM.

Overfitting happens because MMM models are so powerful that they can capture the noise along with the signal in the historical data it’s trained on. It looks like it’s picked up true causality on the training data, but it doesn’t work when you feed it new, unseen data.

Backtesting is one of the ways we check for this.

Backtesting in Bayesian MMM:

With backtesting, we want to evaluate how well our MMM model can predict the future. To do this, we train the model using data up to some time period in the past (say, 3 months ago). And then we ask the model to “predict” the next 3 months. 

The model hasn’t seen those 3 months of data, but we have, so we can evaluate the model’s forecast accuracy. It’s called “holdout forecast accuracy” since we have “held out” the last 3 months of data.

If the model can consistently forecast the future in a system without information leakage in the face of exogenous budget changes, then the model likely has picked up the true causal relationships.

But it’s not so straightforward: 

Recast uses a Bayesian model, and they don’t produce a single estimate – it produces 500 simulations, each corresponding to a different draw from the model.

If you’ve ever seen the hurricane trajectory on a plot with different possible hurricane paths, it’s a little bit like that. 

Now, we want to take the 500 simulations that we produced and compare them with a single number. How do we do that?

What is the CRPS score and why do we use it in backtesting?

One way could be to find the difference in the mean estimate for our forecast compared with the actual one, but that wouldn’t account for penalizing very wide forecasts vs more narrower ones (even if they were right and the actuals went right down the middle).

You could also compare the error with all of the draws, but you wouldn’t be able to privilege the forecast that is racketing the true value versus the one that is systematically off in one direction. 

The solution we found is called the Cumulative Ranked Probability Score, CRPS, which takes the concept of mean absolute error and applies it to forecasts that are in the form of a distribution. 

It’s a statistical measure used to evaluate the accuracy of probabilistic forecasts, comparing predicted distributions to observed outcomes. 

If you have a really wide forecast and a really narrow forecast that are both dead on, it’s going to give a better score to the really narrow forecast.

If you have two forecasts that have the same air, but one is systematically off over predicting and the other has it racketed, but they have the same average air, it’s going to privilege the one that has it racketed.

Since it penalizes forecasts that are overly broad or uncertain and rewards more precise predictions even if they have similar average errors – it’s a great way to assess the quality of our backtests. 

Unlike simple difference measures, CRPS also takes into account the ranked probability of outcomes, making it more robust against outliers and skewed predictions.

There are a lot of technical details we could get into with what the CRPS is under the hood, but it has a number of great properties, and that’s why we use it to assess the quality of our backtests.

About The Author