How we do holdout testing at Recast

The central modeling challenge for an MMM is the following: given any set of marketing inputs (budget decisions, promotion dates, new product launches, etc.) to be able to accurately predict what our customer’s performance will be. 

But, how do we judge our model’s ability to meet this challenge? How do we know if it’s doing well? How do we know if it’s improving or degrading? 

This turns out to be complicated — more complicated than most folks, even data scientists, will assume.

Predictive Accuracy

A good place to start that many data scientists are familiar with is checking the predictive accuracy of the model.

Models are trained on data they have seen, but when put into production, they are used on new observations, or data they haven’t seen. In order to see how well a model is likely to perform when actually used, its developers need to test how well the model does on data that has been held out from the training of the model.

For time-series models, new data is simply “the future”, at least relative to the last date in the training data. The prediction problem is forecasting, but when we are validating the model — using data that we know but the model does not — it is generally called backtesting.

An MMM is a time-series model. Its inputs are marketing spend and the key dates of promotions and holidays (and a few other things) and its output is a KPI, typically revenue or conversions. You might think at this point that we’re basically done: simply train the model up to a certain date in the past, and see how well the model predicts the rest of the data. 

As warned, it’s not that simple; two factors introduce serious complications: causality and information leakage. 

At the beginning, we noted that the goal was to predict the KPI for any given set of marketing inputs. We didn’t call this out at the top, but embedded in this clause is the idea that we need a causal model, not simply a predictive one. 

To explain the difference, we can use a classic example: the relationship between studying and grades. Let’s say you measure how much each student studies and use that to predict what grade they receive. It keeps being accurate every time you compare predicted grades with those actually given. This is a highly-performant predictive model. However, we don’t yet know if it’s causal

Now, let’s say that you induce some of the students to study more by running an experiment. What happens? It is likely that the students induced to study more will not improve their grades by as much as your model would have predicted. Why? Because there is — at least from the perspective of the model — a confounding third variable, socioeconomic status, that leads students both to study more and to get higher grades through other mechanisms unrelated to the amount of studying. The observed relationship between studying and grades is strong, but the actual effect of studying on grades is weaker. 

Purely predictive models need to predict y from some x. In this regime the inputs — x — come from somewhere else. They are handed to you as-is, and from them you predict the likely range of y. Causal models do not treat x as a given, they treat x as something that can be changed. In the language of causality, this is the difference between predicting y given x, and predicting y after you make a change to x. Rather than being told how much a student studied, and then predicting their grade, this is deciding how much a student will study, and then predicting their grade. The latter is a much harder problem.

Causality and MMMs

In the world of media mix modeling, the variables we are interested in include things like marketing plans and promotions. From the perspective of a data science team that is not directly involved with setting budgets, or for an outside consultant, it can seem like this is simply set, outside the model. It can seem that the usual tools for checking accuracy are enough, but they are not. 

An MMM needs to be able to provide answers to questions like:

  • What if we cut TV spend in half? Or doubled it? 
  • What if we turned off all of our small digital channels simultaneously?
  • What if we ran another buy-one-share-one promotion?” 

That is, it needs to be able to forecast under any set of possible inputs, not just those that we happen to observe in the existing data. 

How can we validate that we are able to do this? At Recast we do backtest every model we deploy and track their accuracy over time. However, this is not enough to demonstrate causality on its own. In order to test the ability of our model to identify causal effects and to forecast accurately under a wide range of possible scenarios, we weight the forecasting ability of our models by the “difficulty” of the forecast. “Difficulty” is a measure of how much the marketing mix is changing over time relative to just keeping everything constant — how much intervention was done on the marketing plan, measured by how different it was relative to what would have been expected given past patterns.

Information Leakage

The second problem that we have to address is called information leakage or look-ahead bias. This is the problem of including predictors that you would not have had knowledge of when making the prediction. For example, a hedge fund backtesting a trading strategy might make this mistake if they used closing prices versus opening prices in an algorithm to determine their decisions on a given day since they obviously wouldn’t have known the closing price at the beginning of the day when they’re making their trades.

Another example of this issue is when you include information as predictors that are themselves outputs of what you are trying to predict.

In the context of MMMs, we’ve seen people make this mistake in a few different ways:

  • Including marketing spend from commission-based affiliates as an input to the forecast
  • Including web traffic or sessions as an input to the forecast
  • Including branded search spend as an input to the forecast

Spending on commission-based affiliates is marketing spend, and it’s probably included in the marketing plan. It may seem unobjectionable to include this spend in your model but it would be a grave error to include this spend in backtests.  

The reason is that commissions paid are a function of conversions; for each conversion from an affiliate, a commission is paid out. Including this spend in the model is cheating because it is pretending you can know in advance the very thing you are trying to predict. 

Affiliates are a straightforward example, but other channels, like branded search and retargeting (in some cases), have the same issue. Branded search is not a function of conversions directly, but it is a function of marketing effectiveness — the more effective your marketing, the more people searching for your brand, and the more you spend. Including branded search spend is giving the model information about the very thing it is trying to predict. The same thing is obviously true of including something like on-site sessions or web traffic in the model.

The Recast Approach

At Recast, we take model validation incredibly seriously and so we have built a number of tools that help us evaluate modeled results and accuracy. In particular we:

  1. Constantly run backtests to check out-of-sample forecast accuracy
  2. Evaluate forecast accuracy against forecast difficulty by confirming that we can predict the future even when marketing spend is changing
  3. Make the problem as hard as possible on ourselves by preventing any information leakage and not including data from branded search, web sessions, or affiliate performance as “inputs” to the prediction problem.

Folks are often surprised when we say that making the most accurate model is not our goal at Recast. This is because accuracy alone is not enough — accuracy even when the inputs are changed, and with great care to make true forecasts with no information leakage, is what we are after.

About The Author