It is a paradox of statistics that experiments with small sample sizes are more likely to find larger effects of an intervention, when they find anything at all.
This happens because smaller samples produce noisier estimates. For a result to clear the significance bar in a noisy environment, the observed effect has to be large. So the significance test ends up acting as a filter that systematically lets through the most extreme estimates: precisely the ones most likely inflated by random variation.
This is what we call the winner’s curse: the point estimate that clears the bar in a small sample tends to grossly overestimate the real impact.
In other words, if you only look at “statistically significant” results from small samples, you’re likely to be extra wrong. Here’s a quick visual explainer:

Now, let’s examine a real situation:
Your GeoLift experiment estimated an incremental ROI of 11x, statistically significant at the 90% level. But the in-platform conversion lift study, done at the user level and with far more data, found only ~2.3x. And even the in-platform ROAS you were already skeptical of was lower than 11x.
What’s going on here? Is this actually your best-performing channel?
Likely not. The 11x is exactly the kind of inflated estimate the winner’s curse predicts. The experiment was underpowered and noisy, so the only way to get a “significant” result in that setting was to observe something extreme.
This is fine as long as you interpret “statistically significant” correctly. All it means is that it is unlikely the true effect is zero. That’s it. You’ve ruled out zero, but nothing more.
So the proper way to read these results is not to fixate on the point estimate (the 11x) but to look at the full uncertainty interval. In this case, it might range from 0.5x to 22x. That means: while the experiment lets you reject zero, it does not let you reject the 2.3x from the user-level study or even the in-platform ROAS. This experiment, it turns out, doesn’t tell you much at all.
But it doesn’t tell you nothing either; you just have to read it carefully. Since this filtering process systematically inflates effect sizes, treat the bound of the confidence interval closest to zero as your conservative estimate. The true impact is more likely to lie near that bound than near your headline number. And if you can combine this with results from better-powered experiments to narrow the uncertainty, even better.



