Geo-testing Won’t Save You

Ever since Apple gave users the right to turn off tracking in the iOS14 update, marketers have been scrambling for a measurement solution. In our Reforge piece on “The Attribution Stack”, we wrote in depth about all of the methods you can use to evaluate the effectiveness of your ad dollars, and we think it’s a good thing that marketers are taking a more holistic approach to attribution. 

One method often paired with Marketing Mix Modeling (MMM) as the solution to the ‘iOS14 problem’ is geo-testing, also known as Geo Lift studies. It’s an incredibly intuitive approach that works across a wide variety of channels. It’s also appealing because it’s privacy friendly and platform independent, using only aggregated data (just like MMM), so it’s robust to whatever happens to digital tracking and privacy legislation in the future.

It works like this: 

  1. Select a subset of the regions you would normally advertise in
  2. Turn advertising off in those regions for a period of time
  3. Measure the difference in sales in the control vs test regions

There’s more to it than that: you have to be careful with how you define regions, what regions you select, and how long you run the test (more on that later). However this is the sort of thing any executive can understand and trust. If sales were 50% higher in regions A, B, & C (the test group with ads) than regions D, E, & F (the control group with no ads), and everything else stayed the same, then advertising must have driven those incremental sales.

Much like MMM, geo-testing is actually a decades old technique being revitalized with modern Bayesian methods, and both practices have spurred interest thanks to investments from Meta and Google in powerful open-source libraries. Like we’ve seen in MMM, many vendors have come into the market to make it easier for brands to run Geo Experiments without needing a data scientist. The emerging best practice amongst clients burned by iOS14 is to do both MMM and geo-testing, with the results of your experiments being used to calibrate your model. 

Is geo-testing the one true measurement method that saves us from iOS14, and tells us the truth about how our marketing is performing? I’m sorry to disappoint you: like all attribution methods, geo-testing has its issues and limitations.

Geo-testing has been in use for decades, and is essential for testing some channels like billboards, where there’s no easy way to track which people saw the ad in order to do a proper A/B test. Yet today it’s not in common use: why is that? Well it carries a reputation for being “overly complex, expensive and broad – compared to the cheap, easy scale of the web”, admits Measured, one of the leading modern vendors in the space. The problem historically was that setting up an experiment of this scale required a significant amount of work. You had to take care to design an empirically valid test, set up your campaigns in a way that was targeted by geo region, and then wait for 3 to 5 weeks to collect enough data (as a clean 4 to 8 week pretest period for comparison). This is too slow for all but the most important strategic decisions.

With only a limited number of ways to break the country into geo-regions you also can’t get very granular with your results. Geo-testing may be fantastic at telling you whether your TV campaign drove sales, but you can’t expect to use it to split test 100 different ad variations. As a result geo-testing has claimed the high ground of testing the incrementality of individual channels (or teasing out interactions between channels), but it can’t be used for determining what works on a tactical level. You can of course break each country into smaller and smaller regions, for example at the DMA or zip code level, but that increases noise in the data, which requires a stronger effect or longer test runtime to validate.

Geo-testing is often costly since it’s both complex and broad. Data scientists are expensive, and you often need them to set up the test and interpret the results. Additionally, there’s often a lot of manual overhead to actually running the test (you have to buy media in all of the different geographies!). Finally, there’s the actual cost of running the test itself: turning off an ad channel for several weeks means you lose sales in those regions you otherwise would have made. Good luck making your quarter! There has been innovation in the area, and modern vendors like Measured are starting to automate and abstract away some of these complexities for certain channels, but these costs are still very real they still need to be considered.

Common Issues with Geo-testing 

Geo-testing has more limitations than many people realize. It’s called geo-testing, and therefore gets a favorable comparison to A/B Testing, or Randomized Controlled Trials, which is the gold standard in the hierarchy of evidence. People know about “double blind” experiments from the field of medicine, and are usually aware that randomized control and test groups are the only way to truly establish causality. 

However geo-testing is not a Randomized Controlled Trial: it’s an aggregate experiment analyzed using statistics, closer to Causal Inference and Marketing Mix Modeling than the A/B test you run on your website. It could be called geo-modeling instead of geo-testing! There is a ‘treatment’, but individuals are not being split cleanly into buckets: you can’t guarantee that the treatment (spending money on advertising) only had an effect in your Test regions, or that your results would be the same if you scaled the campaigns nationally. There have been advances in geo-testing in recent years, but they’re advances in modeling: for example Meta’s GeoLift library’s ability to create an “artificial lookalike” of regions that have incomparable demographics, like in their example of the State of California.

These are valid techniques, and they are the closest thing you can get to a real randomized controlled test for some channels. Causal Inference techniques such as those used by geo-testing even won the Nobel Prize in recent years: they’re a powerful tool that are being applied by scientists across many disciplines, not just marketing. However it’s important to distinguish this from individual-level split-testing: we’re estimating the likelihood that one thing (advertising) caused another (revenue), not actually proving it without a doubt. These techniques certainly sit higher in the hierarchy of evidence than most others, but they’re not foolproof.

In the following sections we’ll explore a few of these limitations.

Cross-Group Contamination

People don’t stay in one place. At any given time a significant number of people who see your advertising are on vacation, or traveling for work, visiting family, or any number of reasons to be away from home. In some geographies this is endemic. For example, say you were running a job promotion campaign in New Jersey: where many thousands of people commute across the bridges and tunnels to New York City. If the State of New York was in your Control group and New Jersey was in your Test group, you’d have a real problem. Someone driving past your billboard in New Jersey, might search for your brand when they get into work in NYC. Instead the campaign should be run in the entire New York metropolitan area, which would decrease the effect of movement across borders, but it will never be perfect.

In this digital age can we really guarantee that our advertising messages will be seen only by the people in the region we target? If I live in England and I tell my friend in New York about a new brand of shoes I like, we’ve cross-contaminated any potential Geo Experiment that was running between the two countries. Quite often ad campaigns go viral, as they are usually posted to video platforms like YouTube and TikTok, and shared organically from there. That’s great for the performance of your campaign, but muddies the water for proving the incrementality of your performance.

Targeting & Placement

In many channels, the performance profile of your campaigns will be different than they would be at the national level. For example when you buy TV placements regionally you’re buying different networks than you would nationally! If you’re buying an ad that airs on the local news affiliate in 25 of the 50 states, you’re testing how well TV works on those local channels. When you scale the campaign up nationally and advertise on CNN, you are bound to get very different results. To muddy the water further, the distribution of people who watch local TV is non homogenous: meaning in some Geo Regions like New York City, more people might watch national TV channels, or stream on demand, rather than watch local channels live.

Even for the channels that support it, geo-targeting capability and restrictions vary across countries. Some channels support State / Province level only, others allow you to drill down to the DMA or even Zip / Postal code. Some countries like Japan don’t have street addresses like you may be used to if you live in North America or Europe! The country of France only has 12 provinces or regions (and 1 territorial collectivity, Corsica), and therefore any geo-test is going to be extremely limited in terms of reaching a high enough statistical power. 

Even when you do geo splits within a channel, it’s still not a clean test: you’re typically inflating the cost of your ads compared to running a broad audience for the whole country! This is because Facebook, Instagram, TikTok, or whatever channel you’re advertising on, now has less data to work with for each individual campaign. Their algorithms don’t have the freedom to hoover up a pocket of cheap to reach people in one location, they have to stick within imagined State Borders. Before they could move the budget freely to match whatever opportunity comes up, now it’s forced to throw good money after bad simply because that region is part of the test.

Structural Limitations

While geo-testing is reasonably well supported across most major ad channels, it’s often structurally impossible in the most important organic channels. For example you can’t delist your website from Google just in California, so it’s impossible to get a sense of how many incremental sales your investment in SEO for Google drives. You could of course delist your entire website, like Groupon famously did in a bold experiment, but good luck getting that past your manager. Even then it can’t give you much granularity, for example you couldn’t easily split out your blog from your product pages to test whether your investment in building links for blog posts has paid off (you have no guarantee Google will respect your robots.txt).

Another example of a structural challenge is the Podcasting space. Of course when people download podcasts, you could track their IP address, but there’s no way to ‘switch off’ the download button in Spotify for specific regions. You also have no control over when or where people listen to them once they’re downloaded on their phone. Many podcasts also syndicate over to YouTube, making analysis even harder, and to make matters worse fans often take clips of popular podcasts and post them on social media channels like TikTok. There’s just no way to untangle that mess.

Nobody controls the 24 hour news cycle, not even the news reporters themselves. If something important is happening, all relevant news outlets must report it, or risk losing the audience. So once your news story is out in the world, you have essentially no control over it. Sure you could pitch just to local news outlets, but they’re often syndicates of national chains, or act as sources for national reporters with wider distribution. As well as inconsistency across geos, PR also suffers from inconsistency in campaign results: a great story pitched on the wrong day might fade away quickly, whereas a weak story on a slow news day might be widely reported. Nobody knows what kind of day it’s going to be until it happens: that’s why it’s called ‘news’!

Saturation

One common but often under-appreciated aspect of running geo-tests is that it’s very easy to accidentally over-saturate the market as part of the test. Here’s an example that we’ve seen before:

  • You run advertising on the radio nationally
  • You want to run a geo-level test without turning off your national campaign
  • You spend a ton of money into certain geographies to measure lift from that additional radio
  • Within those geographies, users are hearing your ads constantly: you’re on nationally syndicated shows as well as local shows. You’re saturating the airwaves
  • Your test analysis tells you that the additional radio spend in those geographies wasn’t very effective.

The issue here should be obvious: with the additional spend you put into each of the geographies, you over-saturated the market with your advertising spend. Ideally, when you run a geo-test you’re delivering the ads with the same reach and frequency as you would nationally. This is unfortunately really hard to do – over-spending on a per-audience-member basis is really easy to do, especially if your power analysis (see below) doesn’t take into account the problem of saturation.

What we’ve found is that for some channels it’s impossible to design a geo-test that has sufficient power. This is because the channel’s saturation point is low enough that even if the channel is very effective at current levels of spend, that level of spend (per geo) isn’t sufficient to drive signal to consistently avoid false negatives. However, increasing spend would cause the channel to saturate and so there’s no way for them to get enough signal with that channel.

Delayed Effects

Your experiment has been running for several weeks now, and you can’t wait for it to finish so you can declare a winner and move onto the next test. Not so fast. Marketing campaigns don’t exert their entire effect on day one. Some channels like TV or online video still have an incremental impact days or weeks after the campaign finished running, called the ‘adstock’. If you stop your test too early, and don’t include a ‘cool down’ period in your analysis, you risk throwing out this causal effect, seriously undervaluing the return on investment you get from upper funnel channels.

This of course means tests take even longer than the typical 4 to 8 week pre test + 3 to 5 week test periods, limiting their application to only long term, strategically important decisions. It’s not just the delayed effect of the test you have to account for, but the value of information over time. If an experiment takes 3 months to run, it likely won’t be practical to run every quarter, especially when you have other experiments you want to try. That means you could be trading on old information for several years, before you get a chance to test again. With the rate of change and innovation in modern marketing, what might have been true only a few months ago might already be out of date by the time you get your test results.

Outliers & Missing Variables

One particular problem that always comes up with geo-testing is when a shock occurs – an unexpected spike or dip – either for a known or unknown reason. For example if there’s adverse weather in one part of the country, that can affect the control and treatment groups differently, which can bias the results of the test. Of course the longer you need to run your geo-test, the more likely you are to encounter this type of outlier event. Sometimes there’s a spike in the data that we can’t explain, because we rarely have all of the information we’d like about what’s driving sales. For example many national retailers don’t even know which of their stores are running promotions at any given time, and that sort of difference in geographies could distort the results.

In the event of an outlier it’s important to investigate what could have caused it, because if you know the variable you’re missing you could account for it in the analysis, or make a judgment call on its relative importance. Where root causes can’t be established for whatever reason, you can account for outliers through a ‘leave-one-out’ analysis: repeat the analysis as many times as there are geo regions, each time dropping one geo from the analysis, then look at a histogram of the estimates to see which geos have a clear effect on the analysis.

Location Data

How do you split users by their location in the first place? Primarily the IP address is used, but it turns out this isn’t that reliable an indicator of what region to bucket someone in. Actual IP address accuracy is as high as 95 percent to 99 percent accuracy of a user’s country. However, geolocation services only provide 55 percent to 80 percent accuracy for identifying a user’s region or state, and 50 percent to 75 percent accuracy for a user’s city. In practice, the actual accuracy may vary from provider to provider and depending on the location of the device. 

There’s a growing consumer backlash against ‘surveillance capitalism’: the idea that companies are invading your privacy and selling your data to the highest bidder. It is estimated that 31% of internet users have used a VPN, which hides your true IP address. Governments are starting to legislate against tracking user’s locations, for example the EU considers IP addresses as personal data, protected under the GDPR. In addition Apple’s privacy moves haven’t been limited to the fallout from iOS14: for example Private Relay masks your IP address and the email app blocks tracking pixels from loading (which is how email service providers track location)..

Things to Consider When Designing a Geo-test

In the previous section we discussed some common issues that people run into when doing geo-tests. In this section we’ll cover some important strategies to use when designing your own geo-test.

Matched Markets

You might think it’s as simple as randomly assigning geo regions to test and control groups, and measuring the differences, and that is a common approach, but there’s a problem. Different geographic regions have different properties, and if too many of one type of region randomly fall into the same group, you could be biasing your results. For example New York, Texas, and California are amongst the largest and wealthiest states – they have economies big enough to be their own countries – so if all three end up in your Test group, the performance of the group will be heavily affected by what goes on there. As a result splitting by geo region tends to have more variance and bias than splitting by traffic, as demonstrated by Wayfair below:

https://www.aboutwayfair.com/tech-innovation/using-geographic-splitting-optimization-techniques-to-measure-marketing-performance

Thankfully statistical approaches like Stratified Sampling and Market Matching have been developed that can automatically select more valid matches in terms of regions based on your data, so for example similar sized states are split to be more equal across groups and there are no differences in performance pre-test. Papers out of the University of Texas and Journal of Economic Literature, among others, have advanced our ability to select synthetic controls. In addition Google and Meta’s open-source libraries have made doing a proper statistical analysis more accessible to the average marketing team. However this requires knowing random assignment is a trap in the first place, and having someone technical on your team who can run and understand the results of these libraries.

Power Analysis

As with any experiment, it is essential that a geo experiment is designed to have a high probability of being successful (yielding statistically yielding, or statistically significant results). Before we run the test, we need to estimate the power of the experiment: the probability of detecting an effect if there really is truly an effect of a certain magnitude. Of course, we can’t know what the effect of our advertising is before we run the test, that’s why we’re running it! So instead the question we tend to ask is something more like “assuming there’s a 20% incremental impact from ads, how long should I run this experiment to prove that?”. 

The power of an experiment is a tradeoff between effect size, number of geos, and run time. More data lets you fine tune your experiment to detect smaller effects. Some countries like the U.S. have plenty of well defined geo regions, for example 50 states, 210 DMAs, and 41,683 zip codes. However other countries don’t have clearly defined geo regions, or even postal codes, so a cluster analysis must be done to synthesize these regions: potentially introducing another source of bias. 

An example of using a clustering algorithm to generate unofficial geo regions in France, a country with a limited number of official regions. Source.

For a given country, with a fixed number of geo regions, at some point the effect size will be too small to be worth running the experiment, because the time taken or ad spend needed are impractical, and you have a backlog of important things to test. In these cases you can run a shorter test to look for a larger effect size, then move on if you don’t find it, risking a false negative (recording a negative result, when in truth there was a positive effect that was too small to detect with your test).

The expected precision of our test can be computed by simulating possible experimental outcomes. We also check that the false positive rate (i.e. the probability of obtaining a statistically significant result if the true ROI is zero) is within an acceptable range (5% – 10%). This power analysis is absolutely essential in the design phase because it tells us how much we need to spend on ads and over what time period in order to have a valid test. In practice though there are budget limitations to consider, the effect of diminishing returns (higher spend hurts efficiency as all the low hanging fruit is taken), as well as structural reasons we might not be able to spend more money to reach significance.

Conclusion

You may reach this point in a state of despair: if I can’t trust geo-testing, what can I trust? The truth is that no attribution method is perfect: and geo-testing is actually one of the better ones! Often, it’s as close to a true randomized controlled trial as you can get, and in many channels it’s hard to do any better than a well-run geo-lift test. The takeaway from this post is not to abandon geo-testing, it’s that all methods are flawed, and you shouldn’t believe anyone who tells you otherwise. You should use geo-testing for what it’s good for, but don’t bet your whole career that it’ll finally reveal the “truth”.

The reality of the post-iOS14 world is that you can’t rely on just one attribution method. Instead you need to do multiple things – geo-testing, MMM, surveys, and yes, even lack click cookie-based tracking – in order to triangulate the truth, and work your way up the hierarchy of evidence. If we’re being honest, that’s always been the case: us marketers just got lazy when digital tracking came about, and we turned a blind eye towards the fact that just because someone saw or clicked on an ad before buying, that doesn’t mean it was the ad that caused them to buy.

When you use multiple attribution methods they will no doubt disagree, and that’s a good thing! The differences between them present an opportunity to investigate further and learn something about your users. As you do the hard work to figure out attribution, you’ll come out the other side with an unfair advantage your competitors can’t easily replicate. These methods are complicated and confusing, but smart vendors, open source libraries, and blog posts from experts in the field are making things easier and more accessible every day.

About The Author