At Recast we spend a lot of time thinking about how to evaluate whether or not a model is “good”. People new to Recast are often surprised when we tell them that while we care a lot about checking to see if a model is good, we generally don’t use the most common measures of goodness of fit that many people are familiar with.
We made this decision because we think the most common measures of goodness of fit are generally too limited for use in the context of complex models and can actually be incredibly misleading.
So what are these measures of goodness of fit that are problematic?
- Any in-sample goodness of fit measure that doesn’t include uncertainty like
- R-squared
- MAPE (mean average predicted error)
- MAE (mean absolute error)
- Any measure of “statistical significance”
- Measures of statistical significance assume that the model is correct so aren’t useful for evaluating whether a model is good or not
The most important thing to know about these measures of goodness of fit is that they cannot differentiate between a good model and a bad model when it comes to understanding causal relationships. It is often the case that a “true” causal model will have worse performance in terms of goodness of fit so none of these measures can really tell us much about whether our model is good or not on their own.
What’s the problem with measures of in-sample fit?
In very complex models (like media mix models), we know that we can always fit the data very well. This is because we have lots of variables and lots of “free” parameters to fit the data, and so we know that we can always achieve a very high R-squared or a very low MAPE / MAE when looking at data that the model can see.
The real problem we need to be concerned about is over-fitting. That is, because we have a very complex model, over-fitting is more of a concern than under-fitting. Instead of looking at in-sample measures of fit we need to look at how well the model does at fitting data that the model hasn’t seen before: we need to look at out of sample fit. Out of sample fit, when structured correctly, can help us understand how well the model can fit to data it hasn’t seen before and that is a better measure of “model goodness” than looking at in-sample fit since in-sample fit is easy to improve while actually making the model worse.
What’s the problem with MAPE, MAE, and R-squared?
When it comes to making good statistical models, we care a lot about uncertainty. We want to evaluate how well our model forecasts taking into account the uncertainty of the forecast. We prefer forecasts where the actuals are within the uncertainty range of the forecasts, but we also prefer forecasts where the uncertainty range is narrower. In the ideal case we would be very certain in the forecast and that forecast would be accurate.
The problem with the most common measures of goodness of fit is that they don’t take into account uncertainty. They only look at the midpoint of the forecast and aren’t able to account for forecasts where maybe the midpoint misses but the uncertainty range includes the actual values.
So instead of using those metrics, we prefer to use the Continuous Ranked Probability Score (CRPS) to measure goodness of fit when accounting for uncertainty. This score reduces to the MAE when there is no uncertainty in the forecast.
What about statistical significance?
Measures of statistical significance assume that the model is correct and simply tell you about how likely or unlikely you are to observe the data you observed assuming the model is correct. This is obviously not a very useful measure in the context where we don’t know if the model is correct and we want metrics that can help us differentiate between different candidate models.



