How to Run Proper Backtests and Holdout Forecast Checks for MMM

Most media mix models get judged on the wrong things. We see this often when a media mix model returns a high R², and it’s blindly taken as a big sign of confidence. But in-sample goodness-of-fit metrics are not as useful for evaluating MMMs as you might think.

They can be interesting to look at – sometimes we check in-sample MAPE or R² as a quick sanity check – but they are not a meaningful signal of model validity. A high in-sample R² just tells you that it’s learned how to replicate the data it has already seen. It doesn’t tell you how well your MMM will predict the impact of future spend.

The problem here is that MMMs are highly flexible. They can even have more parameters than data points in some cases. This can result in strong in-sample fit that is driven not by high model quality, but by overfitting of the model.

On paper, everything looks great, but in practice, the recommendations are not stable enough to trust them with your budget. That’s why you see models where:

ROAS estimates swing wildly when you nudge a prior or lag window
Optimizations jump around from run to run
“Winning” channels change when you re-fit on a slightly different date range

To actually audit whether a model is trustworthy, you need to check its out-of-sample performance. That’s why we put so much emphasis on backtesting and holdout forecast checks.

What a Proper Backtest Actually Is

A backtest (or holdout forecast accuracy check) is a very specific exercise: you train the model using data up to some time period in the past, and then you ask it to predict a period of data it has never seen.

For MMM, that looks like:

You pretend it’s three months ago.
You fit the model using only data available as of that date.
You then ask the model to predict what happened over the next three months.
You compare those predictions to the actuals you already know.

If you do this correctly, the exercise tells you how well the model might be able to forecast the next three months into the future, when those months are actually unknown.

Two key elements to run them rigorously:

Time comes first.

Media mix modeling is a time series problem. That means you can’t treat days like independent rows and randomly hold some out. Backtests must hold out a contiguous block of future time: you train up to a cutoff date T, then forecast T+1 onward. Dropping random days turns the task into filling in gaps between known points, which is much easier than predicting the future and will overstate your model’s quality.

It’s not a one-time event.

Holdout forecast accuracy is not a one-and-done check. It should be applied at multiple time points during the initial build and then in an ongoing way as the model is refreshed. At minimum, you want to know: if we had stopped the data 30 days ago, how well would we have predicted the last 30 days? What about 60? 90?

Those different horizons test different things:

Short horizons (7–30 days) tell you whether the model can handle near-term dynamics and respond to recent changes in spend and mix.
Longer horizons (60–90 days) stress-test it against seasonality, macro noise, promotions, and slower structural changes in the business.

In a production setting, this should all be automated. Every time you deploy a model, you snapshot it. When new data arrives, you go back to that snapshot and ask: how well did you predict the last 7, 30, 60, or 90 days? You keep those results and track them over time to determine how much trust you should place in your MMM for future budget planning.

Stop Letting Your Model Cheat: The Information Leakage Checklist

One of the biggest ways to misdesign a backtest and break its rigor is if information leakage happens. If your forecast is allowed to “peek” at the answer – even indirectly – this all stops working.

In MMM, leakage usually doesn’t look like giving the model future revenue directly. It happens through variables that are tightly coupled to revenue:

Branded search.

You only spend on branded search when people are searching for your brand. They search for your brand when they’ve learned about you and are interested in buying – exactly the thing your marketing is supposed to cause. If you feed actual branded search spend into a forecast, you’re giving the model a noisy but very strong hint about sales in that period. It will look much more accurate than it should.

Website / sessions / foot traffic.

If the conversion rate is roughly stable, telling the model your sessions is almost equivalent to telling it your conversions. Again, the model will cheat and look very accurate.

Affiliate and coupon spend.

Commission-based affiliate programs pay after a sale, so spend is literally a function of orders. If you give the model affiliate spend in the holdout period, you are telling it how many sales you got from that channel.

For every input in your backtest, ask, “would we really know this at the moment we’re making the forecast?” If not, it doesn’t belong in a holdout.

How to Read a Backtest: Metrics, Uncertainty, and Benchmarks

Once you’ve designed a proper backtest and removed leakage, the next mistake we tend to see is in the ways teams interpret the results. “What’s your MAPE?” is the wrong first question.

A media mix model is not just a point forecast engine; it should tell you what’s likely and how uncertain that answer is. That’s why, at Recast, the primary metric we use is a proper scoring rule.

A proper scoring rule rewards two things at once:

Getting the answer right – actual results sitting near the middle of what the model said was possible.
Narrowing the results – not hiding behind huge uncertainty bands that technically always contain reality.

Because Recast is Bayesian, the model produces many samples of the future, not a single line. A good metric needs to evaluate the entire distribution: if you predict a tight range and reality lands in the middle, you get a high score. If you predict a tight range and reality lands outside, you’re punished harder than if you had admitted you were unsure.

The goal is to place the correct amount of confidence in the model’s estimates. If you say there’s an 80% interval for the next 30 days of revenue, reality should land inside that band about 80% of the time across many backtests. If your 50% intervals contain the truth 95% of the time, you’re underconfident; if they contain it 10% of the time, you’re wildly overconfident. Both are problems.

TL;DR

Most MMM “validation” is theater. High in-sample R² and pretty fitted lines tell you nothing about whether the model will hold up when you actually change spend. Random-row cross-validation is just interpolation in a time series problem.
A proper backtest is: train the model only on data up to a point in the past (30/60/90 days ago), forecast a contiguous future block (7/30/60/90 days), and compare those forecasts to actuals. Do this continuously by snapshotting every deployed model and re-scoring it as new data arrives.
Information leakage will completely fake your results. Branded search, sessions, affiliate/coupon spend, and multi-stage setups where one stage sees the full series all let future information sneak into the “holdout.”
Reading a backtest means looking at distributions, not just MAPE: use proper scoring rules and calibration checks to see if your intervals are both accurate and honest.

How to Run Proper Backtests and Holdout Forecast Checks for MMM

What a Proper Backtest Actually Is

Stop Letting Your Model Cheat: The Information Leakage Checklist

How to Read a Backtest: Metrics, Uncertainty, and Benchmarks

TL;DR

About The Author

Marti Sanchez

What a Proper Backtest Actually Is

Stop Letting Your Model Cheat: The Information Leakage Checklist

How to Read a Backtest: Metrics, Uncertainty, and Benchmarks

TL;DR

Related posts:

About The Author

Marti Sanchez