We know this sounds obvious in theory, but it’s still relevant: randomized does not automatically mean fair. In practice, we see plenty of geo experiments that still rely on a simple 50/50 split and treat the result as if the design validated itself. That is a problem.
When the number of geographies is small and the markets are uneven, a randomized test can still give you a distorted answer to a very expensive question. And if you use that answer to plan national spend, you can end up scaling the wrong channel for the wrong reason.
This article will expand on this issue – we will cover what they are, how they work, and what their limitations look like – and then give you alternative options so you can run more rigorous tests.
What random sampling in geo experiments actually is and why marketers use it in the first place
A geo experiment should be pretty straightforward. You take a pool of geographies, assign some to treatment and some to control, make a marketing change in treatment, and then measure the incremental effect. That change might be a heavy-up test where you spend more into that channel, or it might be cutting back on spend in a set of geos to estimate how much demand was truly incremental.
In a 50-50 random sampling, you take your pool of geos, split them randomly into two halves: one half gets the treatment, the other half gets the control. If you were testing Meta, for example, you might increase spend in 75 DMAs and leave the other 75 alone. If you were trying to understand TV before a major seasonal push, you might hold spend constant in one group and increase it in the other, and then compare the difference in outcomes.
It’s easy to see why this became common. A geo test is often one of the better options when user-level experimentation is not possible, when the channel operates at a market level, or when platform reporting is clearly not enough. It gives teams a way to move beyond correlation and ask what actually changed because we spent more or less. Random sampling experiments are simple to explain internally, and they feel objective.
But usefulness and robustness are not the same thing, and the default version of geo-randomization is weaker than many teams think. Let’s talk about why:
The problem: random splits break down when the number of geographies is small
The core issue with simple random sampling is not that it is inherently wrong – it’s that marketers often ask it to do more than it can actually guarantee.
Random sampling gives you nice performance under repeated trials. If you can run a hundred different trials, you know some samples are going to be weighted toward treatment, and some might be overweighted to control. Over all of those different trials, you get the benefit of the random sampling.
But that is generally not the case that we’re talking about in marketing experimentation. You are usually doing one trial, once, with an X number of DMAs (say 150). Here, you do not get strong guarantees that treatment and control are actually going to be comparable.
So random sampling has a few limitations:
1 – Comparability. With user-level experiments, one unit often looks a lot like another – but not with geo pools. Treatment and control are supposed to statistically represent one another well, but with a small number of heterogeneous markets, you could very easily end up with your 75 treatment DMAs all being tilted toward the East Coast and your 75 control DMAs tilted toward the West Coast.
Or you might have most of your high-revenue markets in control. In either case, these are randomized groups, but they are not especially comparable counterfactuals. Small changes in those markets are going to cause really, really big effect sizes.
2 – Market concentration. When large markets are overrepresented on one side, measured lift can be driven by a handful of places rather than by the channel change itself. That can make a treatment group look unusually strong or unusually weak for reasons that do not generalize well to the rest of the market once you roll that test out.
The issue is not just an imbalance in total revenue. There can also be an imbalance in seasonality, baseline sales, competitive intensity, channel responsiveness, and regional demand shocks.
3 – Representativeness. The whole point of doing a geo lift type experiment is to learn how a marketing program works when you run it on a business-as-usual basis, which generally means some kind of national campaign. So implicitly, what you want as a feature of the experiment is that the results are going to generalize to that national campaign you care about for planning purposes.
If the only treatment geographies were Houston and Nashville, for example, there is not a lot of guarantee that those results are nationally representative.
You can get a clean read that is directionally true for that exact split of markets and still make the wrong national decision from it.
What better geo test design looks like: matched markets, stratification, and validation
The answer is obviously not to give up on experimentation – just to be more deliberate about how treatment and control are constructed.
A better starting point is matched markets. Instead of taking your pool of geos and splitting them randomly into half, you build two subgroups of geos – treatment and control – that are designed to statistically represent one another well. You’re looking for a credible counterfactual.
In practice, that usually means looking for groups that are highly correlated in the pre-test period, so they move together before any intervention. That correlation directly influences the precision you are going to get from the test. Higher correlation between the two groups in the pre-period generally means a higher precision when you actually go to analyze the experiment.
You also want a litany of validation tests around that relationship. Are there structural breaks in one of the series? Does the relationship look stable over time? If you run an out-of-scope test – essentially a placebo test during a period where nothing should be happening – do you get an effect size of zero?
What you really want to avoid is a spurious correlation world where “solar power generated” and “cheddar cheese consumption” move together for no useful reason. If a proposed market pairing produces a false effect in a no-treatment period, that is a warning sign that it is just not robust enough to trust.
A second option is stratified sampling or matched pairs. Group similar markets into buckets first, then assign treatment and control within those buckets. That reduces the risk that all the high-revenue or otherwise unusual markets cluster in one arm. Sometimes teams also use a trimming process or exclude problematic geos altogether to improve stability.
Assigning markets under statistical and business constraints is absolutely a hard optimization problem. But you should be able to design realistic and rigorous geo experiments with these options.
TLDR
- A 50/50 random geo split can still produce treatment and control groups that are badly imbalanced when the number of markets is small, and the geographies are highly uneven.
- That imbalance shows up in three ways that matter for budget decisions: poor comparability between groups, overconcentration of large markets, and weak national representativeness.
- Better geo experiments are built around credible counterfactuals by using matched markets, stratified sampling, placebo checks, and pre-period validation rather than assuming randomization solved the design problem.



