The quagmire of A/B Testing

When I first started as a product manager, I didn’t know what an A/B test was. I read about it briefly and I thought “Okay, all you have to do is compare two versions of the same thing. No big deal”. It wasn’t until I actually implemented my first A/B test that I realized there were so many nuances I hadn’t even considered. As I’ve learnt and continue to learn over time, the concept of an A/B test is simple, if you know where and when to use it.

When you are immersed in a culture where you must test everything before coming to a conclusion, it becomes a habit to test even the most trivial things. If you don’t have a hypothesis to test or if you know for certain that the hypothesis is wrong—don’t A/B test it. This is something I’ve battled with constantly.

Whenever our CTRs (click through rates) are down, the first culprit pulled up is the user interface because this is the easiest, most visible thing to fix (ignoring other sources of the problem like page load times, seasonality, sources of traffic). So you jump in with your designer, do a revamp of the page and A/B test the old and new versions without an actual hypothesis. A wasted exercise.

The statistical significance of an A/B test is vital, and hence you can only do these exercises if you have a reasonable number of users. If you are using an A/B testing framework like Optimizely, the math is done for you. However, when we actually ran A/B tests on our own platform, knowing when to start (minimum number of users) and when to stop (how long to run the test) was much more complicated than we realized. The numbers at the end were shrouded in confusion and the result of the experiment was inconclusive. The lesson I learnt here was that unless you are confident about the statistical rigour of your model, rely on the experts in the industry.

There was an excellent post by Kissmetrics that I’m going to quote from here, which lays out the steps required to run an A/B test correctly.

  1. Decide the minimum improvement you care about. (Do you care if a variant results in an improvement of less than 10%?)
  2. Determine how many samples you need in order to know within a tolerable percentage of certainty that the variant is better than the original by at least the amount you decided in step 1.
  3. Start your test but DO NOT look at the results until you have the number of examples you determined you need in step 2.
  4. Set a certainty of improvement that you want to use to determine if the variant is better (usually 95%).
  5. After you have seen the observations decided in step 2, then put your results into a t-test (or other favorite significance test) and see if your confidence is greater than the threshold set in step 4.
  6. If the results of step 5 indicate that your variant is better, go with it. Otherwise, keep the original.

Image credit: https://vwo.com/ab-testing/

3 Comments

  1. The link for the sample size calculator mentioned above would work well only for A/B tests where we are dealing with conversion metrics (proportions) as they don’t require an associated standard deviation.

    For A/B tests which involve numerical metrics (like revenue per user) having an associated historic standard deviation, the below URL explains the way to find the correct sample size.

    http://www.statisticshowto.com/find-sample-size-statistics/#Find a Sample Size Given a Confidence Interval and Width known population standard deviation

    1. Absolutely, you bring up a super valid point. The example I’d considered was of click through rates, which is a proportion like you rightly said.

      It’s interesting though, when you’re just starting out with A/B tests on your product for numerical metrics, it can get tricky when you don’t have historic data to rely on. What do you do in such cases? Look for parallels in the industry and use those values?

      1. For an unknown standard deviation (sigma), below are some of the ways one can estimate it.

        1. Estimate sigma from previous studies of the same population of interest and similar metrics.

        2. Observe the particular metric for some time before running the A/B. In case of a very large population, one can create a preliminary sample and find the standard deviation.

        3. Finally, a common guess of sigma is range/4, where range is (maximum-minimum). To avoid outliers, one can take 99th percentile instead of maximum and 1st or 2nd percentile for minimum.

Leave a Comment

Your email address will not be published. Required fields are marked *