Why Forecast Accuracy Is the Real Test of MMM Quality

Most MMMs look great… as long as you don’t change anything. 

But that’s the whole point of doing media mix modeling. You need to be able to double your TV spend or cut your branded search and have your MMM forecast accurately what will happen. You don’t need prettier curves or higher R² – you need falsifiable forecasts that survive intervention. 

In this piece, we strip away all “the theater” around MMM and show you the North Star metric that you should actually care about.

What isn’t a real test of MMM quality (and why teams still get fooled)

If you’re interviewing MMM vendors or thinking about doing it internally, you might hear different answers to the question – “do we have a model that actually works and that we can trust?” 

Here are three tests that sound like proof but don’t actually tell you if the model works:

1. First, beware “out-of-sample” tests that aren’t under intervention. Out-of-sample tests are great and we recommend them, but if spend patterns stayed the same, even a bad model may look fine. Holding out the last few weeks and showing a nice line match looks impressive – but you have to test whether the model’s structure can survive a budget shift.

2. Second, you might hear about error metrics such as MAE, MAPE, SMAPE, and . These were actually built settings with many independent predictions (e.g., classifying customers), but a quarterly revenue forecast is a single prediction that must communicate both accuracy and confidence – and point errors don’t tell you if uncertainty was calibrated. A model that says “$20M ± $12M” can’t be used for planning; a model that says “$20M ± $0.5M” and misses by $3M is dangerous. 

3. Third, non-falsifiable vendor claims. You’ve heard versions of: “The true incrementality of Facebook is 5.7x.” Fantastic… but there’s no way to prove it wrong with ground-truth. You can argue experiment design, time lags, hidden variables forever – but it’s worthless until we’re working off falsifiable claims.

The only test that matters: forecast accuracy under intervention

When you hear our co-founders Michael or Tom talk about what really matters for assessing model performance, you’ll hear the same thing over and over: an MMM “needs to be able to forecast well when you actually make an intervention.”

Models that can predict well under intervention are the models that capture the true causal relationships under the hood. If your model’s structure is right, its forecasts survive plan changes; if it isn’t, they don’t.

Why intervention forecasting reveals causality (not just correlation)

To predict well under intervention – when you actually change something – the underlying causal model has to be right.

Think about a channel like TV that tends to move with everything else. Historically, whenever you’ve spent more on TV, you’ve also run big promos. A naive model will happily conclude: “TV is incredibly incremental.”

Now imagine you actually make a change: you double TV next quarter, but you don’t change your promo strategy. If TV wasn’t the true driver, the model’s forecast will miss badly.

In other words: the model looked great when all the historical patterns stayed the same, but it was just basing it on correlations. The moment you intervene and break those patterns, everything breaks. Only models that actually capture true causal relationships can predict well when you actually do something different.

And this is why we anchor on falsifiability. Forecasting is falsifiable – you put dollars in, and get revenue out. There’s a clear right and wrong: “the model said we’d hit $4.2M in new-customer revenue next quarter? Let’s see what actually happens.” If the result lands outside the model’s uncertainty, the hypothesis fails, and the model must update.

To score those claims properly, we use CRPS (Continuous Ranked Probability Score). It rewards calibrated cones and penalizes overconfidence or uselessly vague intervals. Basically, it evaluates two things at once:

  1. Did the forecast get close to reality, and
  2. Was it appropriately confident (not too wide, not too narrow)?

Operationally, this becomes your North Star: we take model versions from 30/60/90 days ago – before they saw the recent data – ask them to forecast, then compare to realized outcomes on data they hadn’t seen. We do this across hundreds of model refreshes and publish the rolling median accuracy (around 95% using CRPS) so everyone can see whether the system actually works on out-of-sample with interventions.

If you’re evaluating MMM vendors, this is what you should ask: not just backtests on historical data where nothing changed, but evidence that the model can forecast accurately when budgets shift. Ask to see their validation methodology. Ask for their accuracy scores. And be skeptical if they resist.

TLDR:

  • R², pretty curves, and point-error metrics aren’t real tests for MMM quality. Validation needs to include a budget change to see if the model has actually captured causation.
  • The only meaningful standard is forecast accuracy under intervention: dollars in scored against outcomes out. 
  • Use CRPS to judge both accuracy and calibration; over-narrow or over-wide intervals are very dangerous.
Scroll to Top