“Incrementality testing” is extremely hot right now. And while I am personally extremely enthusiastic about this type of experimentation and I am very glad to see that more organizations are running more experiments, I see a lot of confusion around important concepts regarding these types of experiments that I think is important to clear up.
Myth #1: Incrementality = Experimental Results.
The truth is that incrementality is an abstract value that we do not have access to here on earth. While we know that there is some capital-t True Incremental relationship between streaming TV investment and a business’s revenue, that Truth is, from our perspective, unknown and unknowable.
Experiments (and other methods) can attempt to measure that true incrementality, but they do not tell us what the true incrementality actually is. Two things to keep in mind:
- Even a perfect experiment comes with uncertainty. When we analyze the data from the experiment, there will be a range of values for an incrementality function that are consistent with the data. The experiment can indicate that the true value likely lies within that range (depending on what assumptions you’re willing to make) but does not tell us the True Incrementality.
- Experiments in the real world are not perfect. As you may recall from your high school chemistry lab: there are many ways that experiments can go wrong and actually mislead us about true incrementality. This can happen due to imperfect experimental setup, the influence of outside factors other than the treatment of interest, and many other things all of which could impact the results or possibly invalidate the entire experiment. It’s generally impossible to be 100% confident that the experiment set up matches all of our assumptions.
So when running experiments, it’s important to keep these limitations in mind: you need to understand the uncertainty in the results and always be thinking about how the experiment might not be telling the full story.
Myth #2: Geographic experiments are equivalent to Individual-Level RCTs
People throw around the term “gold standard” a lot and unfortunately I think there’s some confusion about what exactly the gold standard in causal inference actually is.
In health sciences, we often refer to individual-level randomized controlled trials as the “gold standard” for causal inference. Each of these pieces is important, so let’s break it down:
- Individual-level: we track individual subjects (trees, animals, patients, etc.) and their outcomes as part of the trial.
- Randomized: there is an element of randomization in the way the treatment is allocated within the sample. So we take a sample of patients that are eligible for some treatment, and then we randomly select a subset of them to receive the treatment.
- Controlled: there is a control group that doesn’t receive the treatment (to compare against).
- Trial: there is an intentional intervention being made. We aren’t just observing what treatments people take and then comparing different groups, instead we are exogenously varying the treatment (i.e., we’re running an intentional experiment).
In the field of marketing science, for various reasons, individual-level RCTs are often impossible. For example:
- In the walled gardens (closed platforms like Meta, Google, or TikTok) we don’t know which users are eligible to receive an ad, so we aren’t able to randomize delivery within the eligible population.
- Some mediums like linear TV or terrestrial radio aren’t addressable at the individual level and so there can be no individualized delivery of ad units.
- Brick and mortar or non-trackable distribution channels don’t allow us to connect purchases to individual people.
To deal with this, many people have turned to different types of geographic-based experimentation. The idea is similar to what we described above: you do a randomized controlled trial at the level of treating certain geographies (zip code, DMA, commuting zone, state, etc.) where you do have the ability to target and you can make similar causal inference on the other side.
This generally works! However, it has some important drawbacks and limitations when compared with individual-level experiments:
- Less visibility into who is actually treated and who is not. When you’re serving ads to an entire state or an entire zip code, you have to worry about people crossing between states or zip codes. You might think your New Jersey resident is in the holdout group but what if they cross over to New York for work?
- Smaller n-size and more noise: with individual-level RCTs you can often run experiments with many thousands or tens of thousands of users (at least when doing online marketing) but at the geographic level you’re often much more limited and are running experiments across at most a few hundred geographies (if that). That means the resulting estimates of the efficacy of the intervention are much noisier and less precise than what can be expected from an individual-level RCT.
So if individual-level RCTs are the gold-standard, what does that make geographic-level lift tests? The bronze standard I think. Maybe the silver standard at best.
But I think we need to reserve the “gold standard” name for individual-level trials, not geographic-based experimentation.
Myth #3: Experimental methods with synthetic controls are not experiments
I was talking to a practitioner recently who was up in arms that people would even consider using synthetic control methods in the context of geographic-based controlled experiments. “It’s pseudo-science!” he said.
This person was definitely wrong, but his confusion was understandable because synthetic controls are a flexible method: they can be used in both experimental and non-experimental contexts. When used in experimental contexts they can be very powerful methods for increasing the precision and power of an experiment, while used in non-experimental contexts they are much more dangerous and potentially misleading.
So how are synthetic controls used in the context of experimental trials? As I noted above, in cases where you are experimenting with small sample sizes the results from even well-run experiments are often very uncertain and “noisy”. One theoretical reason for this is that with small sample sizes your treatment and control groups end up not being very similar because smaller samples are more likely than large samples to have extreme sample metrics even when perfectly randomized.
If you are running a clinical trial on a very rare disease, you only may be able to recruit 10 people for your trial, and after randomization there’s no guarantee that your group of 5 patients receiving the treatment will be similar to the group of 5 people not receiving the treatment. In fact there may be no grouping you could construct that makes the groups very similar on average!
So the idea behind synthetic controls is straightforward: you run the individual-level randomized controlled trial exactly as you normally would, but instead of doing a simple calculation of the difference in outcomes between the group receiving the treatment and the group that didn’t receive the treatment, you can actually do better. You can create a “synthetic control” group that better matches your treatment group (prior to receiving the treatment!) so that you can get more precise estimates of the difference. It might be the case that by **weighting** aspects of the control group, in some way, you can get a “synthetic” control group that better matches the treatment group prior to the intervention.
So your control group of five patients could receive “statistical weightings” like (0.3, 02, 0.4, 0.05, 0.05) instead of even weightings like (0.2, 0.2, 0.2, 0.2, 0.2) and these different weightings could account for inherent differences in the patients (maybe one had high blood pressure unrelated to the disease of interest, maybe we need to put more weight on the females to make them better match the treatment patients, etc.).
This use of synthetic controls has been well-studied in the causal inference literature for a very long time, and it is an accepted practice for generating more precise statistical reads in the case of small sample sizes. In fact, the FDA has approved drugs for human use in clinical trials that use a synthetic control methodology!
Now, that isn’t to say that synthetic controls don’t have any issues. They absolutely do, and they require more assumptions and can lead to bad statistics and bad causality estimates (especially via data “mining” to choose a set of controls that gives the answer you want), but under well-run conditions they are a very powerful tool in the causal inference toolbox for working with small sample sizes.
Myth #4: Observational methods with synthetic controls are equivalent to experiments
In non-experimental contexts, the use of synthetic control methods can be much more dubious. In the example above, the researchers were still running a trial where there was exogenous intervention (the treatment) on some random subset of patients from a given population. This makes it an experiment and allows us to do pretty clean causal inference with (relatively) few assumptions.
However, and this is what often confuses a lot of people, synthetic controls can often be used in non-experimental contexts where many more assumptions need to be made to make claims of causality (i.e., incrementality) and therefore where there are many more ways for things to go wrong.
The idea once again is pretty straightforward: we might look at our population and think:
Well some households we can see listen to Joe Rogan and hear our podcast ads, what if instead of actually running an experiment or a controlled-trial (painful! hard! slow!) we just look at our population of Joe Rogan-listening households and create a synthetic control that matches them to estimate our treatment effect? That way we can estimate incrementality without having to run an experiment!
This honestly sounds very reasonable, and while it sounds like it should be almost as good as the method I described about using synthetic controls in the context of a trial, it is definitely not as good and can be extremely misleading.
There are a few ways that this can go wrong, but the most important problem is the problem of selection effects. The problem of selection effects refers to the idea that there might be unseen forces that are impacting both the selection into the treatment (in this case listening to Joe Rogan) as well as the outcome of interest (buying some protein powder online, I guess).
The magic of a randomized controlled trial is that the act of randomization gets rid of selection effects — since we are exogenously varying the treatment in some random fashion, we know that there aren’t selection effects impacting both the likelihood of treatment and the outcome of interest.
In the Joe Rogan example, we don’t know that, and that causes us some real problems. Sure, you might have developed a “synthetic control” of households that match Joe Rogan-listening households according to the data that you have access to, but how do we know they actually match along the dimensions that matter? How do we know that the Joe Rogan households weren’t already more likely to purchase whatever the product is simply because of their natural affinities or their other media consumption habits? The problem is that we don’t.
Because of the problem of selection effects, non-experimental estimates of incrementality using synthetic controls deserve much more scrutiny and skepticism when interpreting and acting on the results. Non-experimental uses of synthetic controls in many situations end up yielding very similar estimates to a regression or other type of complex statistical model (e.g., an MMM) and therefore need a strict process of external validation before the results can be trusted and used.
Once again, that’s not to say that these methods can’t be valuable, but it is important to make sure that we are interpreting these different types of analyses correctly and not confusing one for the other!