How do you decide whether an idea should be tested vs those things that should just go live without testing?

TLDR; I am not a believer that you should test everything and, instead, it should have a swim-lane of prioritisation. Testing is a (brilliant) methodology for decision making. Factors that affect this inc. purpose, traffic, type of test and time.

Reminder; There’s no such thing as right or wrong, just contextualised experiences. I’ll be sure to give my opinion, with others, on questions people have asked at my Slido. Please comment, like, share, debate, drink when you’re done. I’d love us learn more and share our experiences together to help our CRO community. Particularly, moving away from revenue-only, sticky-filter-changing methodologies who call our beautiful discipline “CROW”.

Twice, I was invited to talk to the product team at ASOS. The first question I got from both sessions was “how do you decide what to test vs what do just do?”

Next to “which AB testing platform is best”, I’d say this is the number one question I get asked in general, and probably why it was upvoted so highly on my Slido.

There was a subscription business who approached us last year. 100,000 sessions a week. Very few conversions, mind you (less than 1%). Booming in a pandemic environment. They were sold on an AB testing platform that they used… twice in 12 months. Each test, we worked out, cost them nearly £10,000 to execute.

Clearly this is an extreme example, but should they have been testing? My argument is no. They lacked the maturity, resource and infrastructure to do so (hence why they needed an agency, I suppose). That AB testing platform, too, I felt, mis-sold them. I’m not sure if you’ve seen this, too, but it really makes my blood boil. Commissioned sales people, lacking an understanding of the true challenge of the business, over-selling a ‘solution’, over-promised, over-sold as a plug n’ play.

Testing is just a methodology to get more data on something and, therefore, validate. But it needs certain things to determine whether you should be testing full-stop, or whether you should be testing individual ideas. Those, are:

It’s sexy; sure. So the theoretical evangelists amongst us will say “test everything”. How idealistic. But in the majority of cases that’s not fair or true because they lack the process, culture and in some cases the traffic and the time to execute.

“In an ideal world you’d test pretty much everything, because every test you run gives you an opportunity to learn. We don’t live in an ideal world though and are tasked to deliver a commercial impact to a business” Ryan Jordan, Brainlabs

So lets remove that stigma of testing everything.

We’re not Booking.com who tests Ten Novenonagintillion versions of its site. I don’t even know what that number is. We live in a practical world where, as much as we’d like a testing culture of “test everything, fail fast”, in 99% of cases, it’s not going to happen.

Purpose

Testing is designed to validate and learn. All experimentation is, is another method of data collection. But I think we can categorise what the perceived purpose is of testing, or an individual experiment, for different people.

1.Validation

You should test when have a hypothesis that needs to be validated. Proven or disproven. The question therefore becomes; how much validation do you require? It’s a matter of confidence in an idea, and experimentation might be needed if you have little understanding of that idea in the first place.

“Consider the strength of your hypothesis, how much supporting data and insight do you have to support this idea? If it’s an idea your CEO dreamed up over the weekend, there’s more risk involved than an idea backed up by Analytics data and user feedback” Emma Travis, Speero

2. Reduce risk

Experimentation is (also) the mitigation of Type 1 errors ie. preventing changes or behaviours that could lead to harm.

Below we ask “is B better than A”. Flipping that on it’s head, a method of testing could be about demonstrating that “B is not worse than A”. This is occasionally known as non-inferiority testing.

“I’m not sure experimentation is really about uplift; a lot of it is about mitigating risk. We may not experience a really harmful event, we still would have had the protection against it. For example, you may have collision car insurance, but not have had any accident over the year. What was the value of the insurance? Certainly not zero” Matt Gershoff, Conductrics

Take the example of AirBnB’s redesigned search page. They all thought it was ‘clearly’ better; and their users agreed in qualitative user studies. To keep in the spirit of their testing culture, they did test the new design to ensure they wouldn’t create a negative effect. Actually, also, more importantly, to gather knowledge about which aspects did and didn’t work.

They ended up with a neutral result. (in the end they found a bug that affected IE users and noted it increased click-through action of more than 2%)

I’d advise you to read Georgi Georgiev risk vs reward article here and review how to get an optimal ratio between risk and reward and, using calculations, achieve a better ratio.

3. Revenue Attribution

A super quick one here, if you’re testing for revenue attribution, forecastable or not * this purpose will often dictate what you should and shouldn’t test.

“If you are looking for long-term effects, you need to find metrics that predict these (the OEC problem), or run experiments for long and continue measuring. The latter is hard in practice due to survivorship bias in online cohorts, when cookies are used to track users” Ronny Kohavi, AirBnB

*not like you should do, but let’s be honest, most / everyone does or, at the very least, appends a level of ROI to their results.

It will dictate what you should and shouldn’t test by testing on higher priority pages or segments, looking for attributable revenue to demonstrate return and value. I’m not the biggest fan of this ‘purpose’ as it a) inherently skews your prioritisation modelling towards value away from risk-management b) is not the designed nor foundational purpose for AB testing and c) assumes revenue is attributable from a test; when in reality it is a non-binomial metric.

4. Learning

Experimentation is an explanation. Did we prove that B is better than A?

…but it is limited in its exploration. Why is B better than A.

It’s because you’re proving or disproving a hypothesis, only. You can learn from it, sure, but it is limited in what you learn because your parameters are set. There’s only so much data can infer, rather than inform.

I encourage you to think about the question “what am I wanting to learn from this test?” That will help you decide whether you answer your questions through experimentation, or something else. It will also help you understand whether your change is solves a genuine problem. Do you really want to learn something from changing button copy…?

In which case other exploratory methods might be more suited to your answers of “what am I wanting to learn from this?”

“Testing doesn’t have to mean AB testing, it isn’t the only method to validate an idea. User testing, design testing and copy testing can all provide you with data to support decision making and identify optimisation opportunities.” Emma Travis, Speero

In summary, always linking it back to the “why are you testing” should help you keep you on track whether you should be testing a solution.

Traffic

Some claim a minimum number of users (e.g. 10,000) is necessary. I only partially agree, as it’s a question of confidence. Presidential polls are done using samples of 1,000–2,000 people. Many studies in sociology are done on hundreds of users.

“…being comfortable with the fact that you’re not going to get [a] 100% solution, and understanding that you’re dealing with probabilities, so that you don’t get paralyzed trying to think that you’re going to actually solve this perfectly” Barack Obama

Barack goes on to state that being just 51% confident is enough and that, along the way, your approach can be moulded. I’m not suggesting at 51% statistical significance we should stop our tests; I’m advocating that I’d rather be 51% confident than 49% confident. And experimenting gives me confidence that a treatment will have a positive impact on our users.

The alternative to experimentation is opinion.

That all being said, the approach of adequate sample sizes and needing to run tests to a degree of confidence is still preferred in my opinion.

If sample sizes are low, alongside of a less immature process that I would potentially consider other methods of testing — user testing perhaps — rather than practical AB testing.

Type

What is the change that you’re looking at making? I tend to categorise changes into one of three pots:

There’s bugs and load time which are foundational to the above, too

The first, usability, is inherently linked to just making things easier. These are both a) the changes that industry ‘test’ the most and b) the changes that, on average, have the least impact. Sticky filters. Sticky call to actions. Sticky everything. I wrote about how this is causing us to lose our way with experimentation here.

“I’d prioritise any experiments that look to change the way our users think or feel about how they behave. Changes to address usability issues (which, are usually less impactful) should be de-prioritised from testing, because it’s either an already validated behaviour or problem and / or it’s unlikely to carry significant risk” Ryan Jordan, Brainlabs

The second and third and intrinsically linked to attitude that then affect behaviour. They are 34x and 100x more impactful than usability respectively.

Qubit published a paper in which they found that “Of 6700 experiments, 90% had an effect of less than 1.2% on revenue. Most simple UI changes are ineffective. Of the 29 common categories of treatment included in this paper only 8 have a greater than 50% probability of having a positive impact on revenue per visitor.

As User Conversion, we also wrote about what type of experiments have the biggest impact here.

“With the small changes, the effort and time to test is often unnecessary. Resources and time are limited, we do not want to spend X man hours on an experiment that is going to run for 60 days only to give us an inconclusive result and little insight. Particularly when running this experiment may have come at the expense of another potential test.” Max Bradley, Zendesk

Time

Resource is often cited as a reason why teams don’t experiment; particularly time.

“At TeamSnap we have 15 million users so we can run AB tests fairly quickly. However, it is still expensive to run lots and lots of AB tests from a labor standpoint”, Ken McDonald, ex-TeamSnap

I genuinely think the biggest barrier to adoption or acceleration of experimentation is the perceived notion that: it adds time.

In the short term, it probably does. But it validates the good decisions from the bad. The high performing development releases, from the non-impactful releases.

I’m an advocate of MVP (minimum viable product or feature) and through experimentation, surely you’d want to understand whether something works or not before fully building it and releasing it into the wild? That’s what Dropbox did with their product before releasing it. And Buffer. And Hubble. And eSalon. And Innocent. I wrote a whole article on experimenting everywhere that I presented to the AO team, discussing MVPs.

“Instead of embarking on a lengthy card-sorting process that would have taken us months to re-categorise and implement, we chose to redesign the navigation; how users interacted with it. We saved our team from an estimated £95k on a solution that took just 3 days to build” Adrian Hobson, Travis Perkins

I like to think that experimentation gives us agility.

Whilst I appreciate the notion of exploration vs exploitation in this debate, I think the notion of a positive result of the time and resource that could have been spent over a poor decision. trumps the sequential discussion here; although still worth pointing out.

“In order to exploit the situation by moving completely to the best scenario (e.g. show only that ad), we need to invest in exploring the situation until we are confident enough. This can be expensive since we might be paying to deliver an ad impression and/or because of the opportunity cost induced by not maximizing our return due to a “bad” choice of scenario(s). So, until we reach the desired confidence in our decision to move entirely to the designated superior scenario, we are purely paying to explore; after we have reached this point, we can fully exploit that situation (given the scenarios do not change over time). This doesn’t sound like an optimal solution, does it?”

Summary

I’d recommend asking yourself a variety of questions.

It doesn’t have to be in a specific process, it’s just about feeling it and understanding your prioritisation methodology; the sexiest of topics.

“If the primary goal for the CRO programme is to increase the average revenue per user (in which most cases it is) and we are observing data which shows hesitation to complete the purchase on the checkout page then we should test a treatment on the checkout page as it aligns with the CRO goal.” Amit Chhatralia, Cisco

Consider risk and resource as much as possible; even if the potential return may be unknown.

“When the cost of running A/B tests outweighs the potential return [is when you don’t test]” Alhan Keser, ex-American Express

As a recommendation, I’m a big fan of MVP, low-fidelity options. If you’re looking at creating a product finder, consider a quick survey instead. If you’re looking at restructuring your information architecture, consider the front-end look to the user first. If you’re looking at creating a new feature, create a false-door test and gauge user interest first.

One last thing …

Stories and advice within the world of conversion rate optimisation. Founder @ User Conversion. Global VP of CRO @ Brainlabs. Experimenting with 2 x children