“I ran an experiment showing positive lift but didn’t see the results in the bottom line.”
Sounds familiar?
I think we’ve all had this experience: we set up a nice, clean A/B test to check the value of a feature or a creative. We get the results back: 5% lift, statistically significant. Nice! Champagne bottle pops, etc., etc.
Since we got the win, we bake the 5% lift into our forecast for next quarter when the feature will roll out to the entire customer base and we sit back to watch the money roll in.
But then, shockingly, we do not actually see that lift.
When we look at our overall metrics we may see a very slight lift around when the feature got rolled out, but then it goes back down and it seems like it could just be noise anyway.
Since we had baked our 5% lift into our forecast, and we definitely don’t have the 5% lift, we’re in trouble. What happened?
The big issue here is that we didn’t consider uncertainty. When interpreting the results of our A/B test, we said “It’s a 5% lift, statistically significant” which implies something like “It’s definitely a 5% lift”. Unfortunately, this is not the right interpretation. The right interpretation is: “There was a statistically significant positive (i.e., >0) lift, with a mean estimate of 5%, but the experiment is consistent with a lift result ranging from 0.001% to 9.5%”.
Because of well-known biases associated with this type of null-hypothesis testing, it’s most likely that the actual result was some very small positive lift, but our test just didn’t have enough statistical power to narrow the uncertainty bounds very much.
So, what does this mean?
When you’re doing any type of experimentation, you need to be looking at the uncertainty intervals from the test. You should never just report out the mean estimate from the test and say that’s “statistically significant”. Instead, you should always report out the range of metrics that are compatible with the experiment.
When actually interpreting those results in a business context, you generally want to be conservative and assume the actual results will come in on the low end of the estimate from the test, or if it’s mission-critical then design a test with more statistical power to confirm the result.
If you just look at the mean results from your test, you are highly likely to be led astray! You should always be looking first at the range of the uncertainty interval and only checking the mean last.