Is statistical significance the golden rule of marketing experiments? Is 5% the line where an experiment becomes (in)conclusive?
It’s an easy standard to reach for, which is why so many teams over-rely on it. But marketing experiments are noisy, expensive, and tied to real budget decisions that just can’t wait for perfect evidence. So what should you look for instead?
This article will dive into why statistical significance is often overused in marketing, what p-values actually tell you, and what marketers should be considering when they evaluate experimental results.
Not statistically significant” is often the wrong question
A lot of marketers were trained on product or marketing A/B tests: does variant A beat variant B, and is the difference statistically significant? That mindset works reasonably well when the decision is discrete, and the unit of analysis is clean. But it maps much less well to incrementality testing, where the question is not whether there is any lift at all. Most marketing generates at least some lift. The real question is whether this activity is the best use of your budget.
You can absolutely have a statistically significant result that says “this lift is different from zero” and still have something that is almost useless for planning. Let’s say your test shows a 5% lift and translates to a midpoint estimate of 4.5x ROI. And yes, that sounds good.
But if the plausible range of outcomes goes from roughly break-even to very strong, you have not actually answered the question your CFO cares about: should we scale, hold, or cut spend?
The inverse is also true. “Not statistically significant” is not the same thing as “we learned nothing.” A noisy experiment can still narrow the plausible range enough to affect a decision. If the supported ROI range is mostly below your profitability threshold, that is meaningful, even if the result did not cross some binary threshold.
And sometimes wide confidence bounds do not really change the right next step anyway. What actually matters is whether, and how, the result changes the go-forward business decision. In marketing, the point of an experiment is not to produce a clean label. It is to improve the next budget decision under uncertainty.
What p-values actually tell you – and why marketers keep over-reading them
Even if you agree that significance is not the right decision framework, it is still worth being precise about what a p-value actually says because most teams give it much more meaning than it deserves.
A p-value does not tell you how likely your result is to be true. It does not tell you the probability that the channel worked. And it does not tell you whether the result is useful for planning. What it does tell you is this:: if there were truly no effect at all, how surprising would data this extreme be?
That is a very different question from the one marketers usually think they are answering. Most executives hear a low p-value and infer something like, “there’s a high probability this result is real.” But that is not what the statistic means. A p-value is a statement about the data under a null hypothesis. It is not a probability statement about the business conclusion you want to draw.
The problem gets worse because the 5% threshold is treated like some kind of natural law. But it’s just a convention we’ve all decided to agree on. Below 5%, we act; above 5%, we wait. That is already shaky in the world of statistics (look up replication crisis), but in marketing, it is even worse. Teams run many tests, slice results by segment, or keep looking until something “works.” Once you do that, false positives are basically inevitable.
What marketers should look at instead: effect size, ROI/CPA ranges, and uncertainty
If statistical significance is not the right way to interpret experiments, what should marketers look for instead?
The answer is not a single number. The output of an experiment is not one “true” ROI. It’s a range of estimates that are compatible with the data. That means the most important part of the readout is usually not whether the lift was above zero. It is the effect size, what that translates to in ROI or CPA terms, and how wide the uncertainty bounds are around it.
That is why you must bring in a business context when you’re analyzing experiments. Say your test shows a 5% lift and the midpoint estimate converts to 4.5x ROI. Great, but if the interval runs from 1x to 7x, and your profitability threshold is 3x, then the experiment hasn’t really told you very much. The channel might be wildly profitable… or it might be lighting money on fire. You just don’t know yet.
That is why the range of profitability matters more than the point estimate when measuring incrementality.
And sometimes the default confidence interval does not tell the full story either. One useful approach is bootstrapping the test to understand the shape of the distribution. If you run effectively repeated trials within treatment and control subsets, you can start to see where the higher-density area actually is. In some cases, you may have a skewed distribution where the plausible values cluster much more tightly than the raw interval suggests.
And, on the business side, always ask: what would we need to believe for this result to be true? If a Google Shopping test comes back at 0.5 IROI versus the 1.6 you were expecting, and that would require believing the overall business accelerated 10% during the test window, that should change how much confidence you place in the read.
TLDR:
- Waiting for statistical significance is often the wrong decision rule in marketing experiments, because the real question is not “is there any lift?” but “is this the best use of budget?”
- P-values are narrower than most marketers assume: they tell you how unusual the data would be if there were no effect, not how likely your result is to be true or useful for planning.
- The most decision-relevant output of an experiment is usually the range of ROI or CPA outcomes the data supports, not the midpoint estimate or whether the result crossed a 5% threshold.
- Strong teams do not treat one noisy experiment as a verdict; they combine business context, sensitivity analysis, and cumulative evidence to make better decisions under uncertainty.



