The Importance of Out-of-Sample Goodness of Fit Metrics in Marketing Mix Modeling (MMM)

The goal of MMM is not just prediction—it’s understanding causality. When marketers ask how much each channel contributed to sales, they’re asking about the true causal impact of their marketing efforts. 

But how do you know if your model has actually picked up true causation?

This article will cover 3 quick but very important ideas:

  1. Why in-sample validation methods can lead to overfitting and why that won’t prove your model’s robustness.
  2. Why traditional metrics like R-squared often lead to overconfidence and misleading results – it’s a mistake to rely on them. 
  3. How and when to use out-of-sample validation methods such as holdout forecasting.

Let’s jump in. 

2. The Dangers of Overfitting in MMM

Overfitting happens when a model becomes too complex and it tailors itself too closely to the training data. This often happens when there are more parameters in the model than data points, which is not uncommon in MMMs. 

Media mix models tend to have many variables—spend across multiple channels, brand metrics, external factors like seasonality, and even more granular details like creative variation. As the model becomes more complex, it captures not only the real signal (true marketing impact) but also the noise (random fluctuations in the data).

The result? 

A model that looks fantastic when measured against in-sample metrics but fails to generalize beyond the data it was trained on. 

Without out-of-sample validation, you run the risk of making decisions based on a model that doesn’t reflect reality. In practice, a model that overfits might tell you that certain channels are far more effective than they actually are, and that could lead you to allocate your budget to channels you think are incremental but really aren’t.  

An example of an in-sample method that is often used for validation but can easily backfire is in-sample R-squared. Let’s look into it:

1. Why In-Sample R-Squared Isn’t Enough

At first glance, R-squared can seem like a useful metric—it measures how much of the variance in your dependent variable is explained by your model.

However, when evaluating MMM, R-squared can be misleading because it focuses on how well the model fits the training data (the historical data used to build the model). 

Because of their complexity, models can overfit and “memorize” the historical data – which sure, gets high R-squared values, but performs poorly when faced with new, unseen data.

To make this clear: a high R-squared doesn’t necessarily mean the model is useful for predicting future outcomes or providing actionable insights for budget allocation. You can have a terrible model that has a really high R-squared, and you can have a great model that has a really low R-squared. 

In-sample R-squared might give an illusion of accuracy, but it doesn’t tell you if the model has captured the true causal relationships between marketing channels and sales. Instead, it simply shows how well the model can replicate the patterns in the data it has already seen. 

For a robust marketing mix model, it’s critical to test the model’s ability to predict future outcomes—this is where out-of-sample testing comes in.

3. Why Out-of-Sample Testing is Essential

Out-of-sample testing offers a way to combat overfitting. This involves splitting your data into two sets: a training set and a holdout set. The model is trained on the former and then asked to predict the outcomes in the latter. 

To put this into a practical scenario: you could give the model data up to April 1st, but not April and May, and then we want it to make a prediction about the number of sales we’re going to get in April and May based on data the model hasn’t yet seen.

Since the model hasn’t seen the holdout data during training, its performance on this dataset provides a more realistic measure of its predictive accuracy. This approach mirrors how the model will actually be used in practice – when we’re taking action based on our MMM model to make decisions about our budget next quarter, we’re implicitly asking the model to predict performance on data it hasn’t seen before.

This is important – you always want to be validating the model against data it hasn’t seen yet. It helps you avoid overfitting, build trust in your model, and helps you identify where the model is the weakest (for example, during holiday periods, when a specific channel is heavily used, etc.).

Before you go, 2 final notes:

  • Out-of-sample forecasting alone isn’t enough to prove causality. 

While a model that performs well in out-of-sample testing can provide confidence in its predictive power, other validation methods, such as lift tests, are also necessary to continue to build trust.

  • If you’re working with an external vendor to build your MMM, make sure they are willing to implement out-of-sample validation. 

Many vendors are hesitant because it can reveal weaknesses in the model, but it’s essential for ensuring that the model can be trusted. This type of holdout accuracy check is an excellent way to validate the model and should be part of any vendor’s process. If a vendor is reluctant to do this, it’s a sign that they may not have confidence in their own model’s ability.

About The Author