No, you can’t accurately attribute, nor forecast, revenue to experimentation. Here’s why.
TLDR; I don’t believe you should attribute revenue to individual experiments, but I get the fact that we have to; it’s fuzzy. I don’t believe you can accurately forecast that revenue attribution; it gets even fuzzier. And, when you collate experiments together to determine overall revenue impact? I certainly don’t believe you can attribute that type of revenue; that’s super fuzzy. Experiment results tell you which way you’re headed, not how far you’re going.
This is part of a 4-article series on whether, and how, you can attribute revenue to experimentation:
I’m expecting debate. Please comment. Aggressively. This is something, I admit, has been a burning hole in my pocket for many years. And trying to convince many stakeholders that revenue attribution is not the main purpose of experimentation; especially to ecommerce clients where the main focus is, indeed, trade, has proven difficult for me personally.
In 2019, I took it upon myself to interview 30 or so of the best minds in the business. Why can we not attribute revenue to experimentation accurately? What does accurately mean? Can we forecast it? What do you do when stakeholders ask? How do we survive as individuals and collectives?
The response was clear; you cannot attribute revenue to AB testing accurately, but we need to.
Fast-forward two years later and it would appear it’s less clear-cut. (Poll is here)
I truly believe that it is difficult to attribute revenue to individual experiments. And that it shouldn’t be done.
Adding to that, forecasting that revenue is less possible, with less accuracy.
Adding to that, collating experiments to understand overall revenue impact is nigh impossible, with even less accuracy than the less accuracy that came before it.
Perhaps it’s the word accuracy and the determining degrees of acceptable confidence levels? From what I’ve seen in the industry, most experiment attribution falls on the wildly inaccurate rather than the moderately inaccurate. I’m talking about taking a winning experiment, doubling its impact (given it’s a 50:50 split) times it by 12 for each month, and cumulatively adding it to other winning experiments’ level of inaccuracy.
For me; experimentation done poorly is a very expensive way to slow down and over complicate bad decisions. Add in misattribution into that dangerous mixing pot, and you have an unethical delusion of “we’re doing things right”. Maybe that’s why my paranoia subverts the need to attribute?
In my previous article I wrote about why we assume experimentation is all about financial gain. The history of where this stigma comes from.
In this article, I’m going to talk about why experimentation isn’t about financial gain; literally. As in:
- why you shouldn’t attribute revenue to individual experiments
- why you cannot accurately forecast that revenue attribution
- and why you cannot collectively attribute from multiple experiments
You shouldn’t, really, attribute revenue to individual experiments
…but we have to. So we have to deal with it. Businesses are run on performance metrics and experimentation is a method of validating performance. Here are some reasons why we shouldn’t attribute revenue to experiments, and why the accuracy of said attribution can throw you under a bus.
# It takes longer, so ask what is the lost opportunity cost
The purpose of experimentation is to prove or disprove a hypothesis. It’s designed to validate. It is not designed to measure the difference of said validation.
“Most of the time, we run an experiment to see if there’s a difference in performance between the variations. Simply put, “Is B better than A?” But experiments aren’t typically designed to show “By how much is B better than A?” Stephen Pavlovich
We’re not looking at attributing revenue uplift because, typically, most of the time we’re testing on a binomial metric. Did X improve Y: yes or no. But when we look at non-binomial metrics that are a continuous scale, or a range, we introduce another complication altogether.
Revenue, the performance difference, is a non-binomial figure. And thus there are complications.
A single conversion could be £1,000 or £500 or £50,000 (and herein we could debate that we should be removing outliers, reducing our testable sample). Hubert Wassner, Chief Data Scientist at AB Tasty, calls this concept “a statistical nightmare”. You can never be really sure on how much you’re gaining (non-binomial), the only thing you can assess is that you are gaining something (binomial). That something, Wassner continues, is associated with probability.
“When we’re talking about a hypothesis, this is simply a “yes or a no” answer. The user either successfully converted, or they didn’t. When we look at revenue, we can assess that you are gaining or losing something, but the size is near impossible to assess. From a statistical solution, we can assess that there is a difference in the size, but the size of the difference is too difficult to assess because it will be very variable.” Hubert Wassner
That being said, could you report against a non-binomial metric to varying degrees of accuracy? Yes. Taking pharmaceutical samples of, say, a drug that looks to increase lifespan. That lifespan is a range, and whilst we can state whether the drug impacted the lifespan, the question of “by how much” is required as part of the hypothesis in of itself. In other words, whilst the hypothesis is binomial (yes or no), the ability to measure the hypothesis can be non-binomial (a continuous scale) and is achievable to degrees of accuracy.
The problem? It takes much longer. Your primary metric to determine the output of the hypothesis is a continuous scale, meaning you have more variables to be certain on the exact outcome.
“To have more confidence in the size of the uplift (not just that fact that it exists), you’d have to run the experiment for far longer. For example, you might continue to serve a losing variation (A) to 50% of your traffic so you can have greater confidence in the value of the winning variation (B). That comes at the expense of your customers and business.’ Stephen Pavlovich
Given our experiments are required to reach a level of statistical significance to be confident in proving a validated outcome, if we were looking at revenue as a statistically significant metric, this would take significantly longer than the duration of the experiment to prove and accurately attribute.
The largest contributor to the “delay”, when using revenue as a significant measure of test outcome, is that it is much more difficult to gather sufficient testable data simply because the vast majority of outcomes (at a session level) will be £0 revenue. Added to that, we obviously get “null” outcomes when looking at conversion rate, in other words outcomes generated by users who don’t convert.
The summary of this is that there is more variability in non-binomial metrics (eg revenue) than bi-nominal metrics, therefore experiments require a larger sample size, take longer and could be a lost opportunity cost — or should certainly be a consideration. At which point accuracy becomes skewed due to the complexities.
# The purpose of experimentation is not about revenue gain
At least, not all experiments, certainly. Surely not every decision or improvement we make is designed to make a direct positive impact on revenue? We worked with retailers who needed to appease brands for merchandising or aesthetic looks. We’ve created experiments to impact post-purchase customer satisfaction. We’ve ran tests to prove the CEO’s change in navigation right (or wrong); not a tactic I endorse or agree with by the way.
“The second you start measuring every experiment on just revenue, you lose the ability to think strategically. You have to balance two things: solving your customers’ most urgent pain points and proving your impact to the business. if you let one of these two overwhelm the other, you’re headed for a dead end.” Hazjier Pourkhalkhali
When Bravissimo approached us in 2018, the purpose was to reduce returns. A behemoth metric that took years to impact. It was impacted by making customers feel better and more confident in their decisions to ensure the fit was right (who knew so much went into choosing a bra?!). Qualitative metrics that did have quantitative values placed on them, but when cumulatively put together we knew we were positively impacting the customer experience. The attribution of which was less important, than the knowledge of “we knew we were moving in a positive direction”.
Forecasting experiment attribution is even harder
# Experiment results are based on a series of averages.
“Bottom line? What you’re doing is shift an average. On average, an average customer, exposed to an average AB test will perform averagely. There’s nothing earth shattering about that.” Craig Sullivan
It’s very common to hear ‘conversion is down YOY’ or ‘AOV is down YOY’ but those top line metrics are a series of averages. Hell, the acronym AOV literally stands for average order value. It’s a metric compiled of hundreds of different metrics.
A common scenario, for example, might be that within the variant of an experiment, the variation has, luckily, received a few £1000 orders that hugely impact the revenue uplift and thus AOV. The same is the case for conversion rate; where a cohort of users eg. a broad-stroke of PPC, or users landing on a product page from PPC, might hugely affect the average conversion rate.
Both metrics are non-normal distributions. They change over time based on the composition of users, behaviour and metrics. For example, running the same experiment in, something as arbitrary as December and June, will result in two very different average results.
“You can’t add the uplifts of winning experiments because you don’t know how they will interact or how your site and users will change. If you want to prove your impact, be conservative. You’re better off reporting a lower number everyone will defend than a higher number only a few believe.” Hazjier Pourkhalkhali
We need to ask questions like who’s inside the sample, who was exposed to the test, what kind of people, how much have they spent with you, how long have they been with you, what cohort of segment are they in? Ben Labay talks about “lowering the denominator” i.e if you isolate the analysis to the audience and experience, the analysis and measurement is simplified. Of course, this is true of all experiments, not just non-standard distributions (like AOV). And a reason why I’m a big fan of testing cohorts, rather than averages.
Ultimately, there are percentages of the segment within the experiment that will respond strongly, and some that will respond poorly, if you don’t know the composition and response of those, then you’re reporting on averages; which lacks a level of accuracy.
“Given an experiment we’ve run, how can we make a prediction for future behaviour? You can never do this based on average response on the site. As soon as you start plotting it out over a month, three months, twelve months, either you’ll hopelessly underestimate or hopelessly overestimate. Why? Because averages fundamentally lie.” Craig Sullivan
# Experiment results are a snapshot of time and behaviour
An experiment is a snapshot of time; the results of which are of a specific sample, who reacted in a specific way, providing a specific result.
“We are using retrospective to forecast the future. There is a limit to how much this discipline can tell you” Tim Stewart
Forecasting doesn’t take into account all the unforeseen external variables that could, and often do, happen. Regardless of degree of accuracy, forecasting is still, at best, an educated guess — because it largely involves understanding the past to understanding the future. And we’re not Marty McFly here betting on sports games.
“We can never value the forecasted and expected return to 100% because of the conditional factors that are unexpected — and ultimately we’re talking about attribution here. “When we are forecasting business revenue — there are a f*ck-tonne more variables to consider.” Tim Stewart
A new feature could have the “shiny” effect meaning usage is novelty and novelty wears off. A change in layout could have the “shock” effect meaning that users respond negatively to it because of uncertainty. Adversely, as users become conditioned to a way of behaving — a burger menu as your navigation or filters being at the top of a page — they respond positively over time because it’s what they know. Amazon being a prime (pun intended) candidate of this and how they’ve dictated the way a PDP looks for ecommerce despite not making any significant changes in years.
One consideration is that you can estimate the impact of an experiment within the testing period, but you assume there are no external factors that would influence it post the experiment concluding — of which there always are. For example, if the company decides to double the price of their products, the experiment will, naturally, not have the same uplift. Think about the stuff that you have control over changing (usually internally like brand) vs the stuff that you don’t have control over (usually external like new competitors)
Collective experiment attribution is nigh impossible
Then there’s the issue of, not just forecasting revenue uplift from a single experiment, but forecasting it from multiple experiments.
What happens if we create an experiment A designed to increase add to carts, and experiment B designed to increase product views. Where both experiments might be successful in their own right, the approach to increase product views might negatively impact the number of add to carts because of the intent of the users now coming through to the PDP. You can increase users getting through to the product, but it might not be the right product for example?
“I’d be hesitant on grouping together multiple experiments and grouping the impact. You are grouping fuzzy results, which adds a lot of… fuzz. It tends to be how a lot of people look it and it’s misleading” Stephen Pavlovich
The more factors you change within your product roadmap — tested or not — or the more noise in the metrics you are using, the greater that degree of fuzziness, too. I can experiment on the checkout all I like, but if we’re adding in a new payment provider because our marketing department gets a kickback from Mastercard, surely this will impact the impact of the experiment; retrospective or predictive.
One option to combat this is to have a hold-back set of, say, 10%. In other words, 10% of users always see the ‘original’ or control. I question the purpose behind this — is it just naval-gazing? Proving a further degree of accuracy that ‘we were right to the degree of x%’ rather than just ‘we’re moving in the right direction.
Bonus: Which leads me to the discussion of the term ‘accurate’…
When I posted the poll up for others to see, the majority of comments were along the lines of this.
Questioning the term ‘accuracy’. It’s my view that if you can’t attribute accurately, then you should attribute at all. If we attribute a single experiment wrong, and forecast that, we’re magnifying the issue.
“As experiments are typically designed to show that a difference exists (not in the size of the difference), it can be challenging to forecast from a single experiment. When you scale that across an entire experimentation programme, it magnifies that and can be misleading.” Steven Pavlovich
There needs to be some give and take. We’re managing risks of making the wrong decision, so we should consider the impact of expected loss vs expected gain. Perhaps even more so.
We’re managing levels of effort, too. How much effort is required to increase your degree of accuracy. Is this just naval-gazing to try to get as near to 100% as possible? What about the lost opportunity time? Would you rather run more experiments or be more accurate in any form of forecasting that you are able to achieve? I’ll address this in my next article “Yes. We need to attribute revenue to experimentation. Here’s how”
“Most of the time being accurate about the direction of the decision is more important than being precise about the exact nature of the outcome” Matt Lacey
Am I being a bit over-precise with my language? There’s definitely some semantics in here, right? But I think there’s your answer, if revenue is an indicator of performance, how accurate does your business need you to be with your experimentation attribution and does it require forecasting?
Revenue attribution from individual experiments is difficult and fuzzy and, in an ideal world, shouldn’t be done.
Revenue attribution from collective experiments is even fuzzier and nigh impossible to understand the impact of.
Forecasting attribution mostly lacks a reasonable degree of accuracy and, like any attribution, relies on the past to predict the future.
Why can you not accurately attribute nor forecast revenue to experimentation?
- Metrics and stimulus change over time
- Experiments are designed as a snapshot in time
We need to:
- Appreciate and agree on what levels of accuracy are needed for revenue attribution within your business
- To that end, appreciate that if we need a high degree of confidence, as revenue is a distribution, experiments will take longer to run, perhaps reducing your cadence
- And thus, what is the lost opportunity cost of doing so
- Not focus on revenue attribution as much as “what the purpose of experimentation is for your business”. Different maturities will yield different approaches to revenue attribution
I’m going to save “what to do” for next Friday’s article. How I, and others, attribute revenue to experiments. Perhaps, more importantly, how to communicate that to the rest of the business.
Be sure to subscribe to my newsletter at https://optimisation.substack.com/ where, every Friday at 7am, I’ll release a new series of thoughts and advice based on my time within CRO.
Leave a comment. Create debate. Have a drink. Disagree.
This is part of a 4-article series on whether, and how, you can attribute revenue to experimentation: