A/B testing

The A/B test (also known as a randomised controlled trial, or RCT, in the other sciences) is a powerful tool for product development.

some motivations:

With the rise of digital marketing led by tools including Google Analytics, Google Adwords, and Facebook Ads, a key competitive advantage for businesses is using A/B testing to determine effects of digital marketing efforts. Why? In short, small changes can have big effects.

This is why A/B testing is a huge benefit. A/B Testing enables us to determine whether changes in landing pages, popup forms, article titles, and other digital marketing decisions improve conversion rates and ultimately customer purchasing behavior. A successful A/B Testing strategy can lead to massive gains – more satisfied users, more engagement, and more sales – Win-Win-Win.

A major issue with traditional, statistical-inference approaches to A/B Testing is that it only compares 2 variables – an experiment/control to an outcome. The problem is that customer behavior is vastly more complex than this. Customers take different paths, spend different amounts of time on the site, come from different backgrounds (age, gender, interests), and more. This is where Machine Learning excels – generating insights from complex systems.

If your A/B testing doesn’t seem to work, you might be making one of the common mistakes, such as the peeking problem, wrong split, or wrong interpretation. This can completely destroy the profit from the experiment and can even damage the business.

As a data scientist, I want to describe the design principles of A/B tests based on data science techniques. They will help you ensure that your A/B tests show you statistically significant results and move your business in the right direction.

Business Values/What you have to understand while A/B testing:

1. Most A/B tests won’t produce huge gains (and that’s okay)

2. There’s a lot of waiting (until statistical confidence)

3. Trickery doesn’t provide serious lifts,understanding the user does

concepts needs to brush up:

  • alpha: false positive rate; or significance level; or type I error.
  • beta: false negative rate; or type II error.
  • power: true positive rate; 1 – beta.
  • 1 – alpha: true negative rate.

Alpha controls the false positive rate, in which case we reject null when we should not. Beta controls the false negative rate, in which case we fail to reject null when we should.

When Null Hypothesis is True

The empirical sampling distribution is the same as the one under the null hypothesis. We are right if we do not reject the null, which happens with probability 1 – alpha (shaded in cyan). The two tails shaded in red have total area equal to alpha, which is the probability we reject null by mistake.

When Null Hypothesis is False

The empirical sampling distribution is different from the one under the null hypothesis. We are right if we reject the null, which happens with probability 1 – beta (shaded in cyan). We fail to reject null with probability beta, which is the false negative rate (Type II error).

A subtle difference from the previous case is that if we land in the smaller tail (right tail in this example) and reject the null, we should not feel lucky. It either means we have a highly unusual event which is not repeatable, or we are hypothesizing in the totally wrong direction.

Fortunately, in reality, we do not know whether the null hypothesis is true. This confusion allows us to use alpha and beta together to quantify uncertainty (soon we will see how alpha affects power). But when speaking of alpha, always remember that we are conditioning on the possibility that null is true; vice versa for beta.

The use case

Let us assume timewithai is considering a landing page rearrangement for its professional learners.

The metrics that matter to the company are:

  • Average time spent on the landing page per session
  • Conversion rate, defined as proportion of sessions ending up with a transaction

An A/B test can be used to challenge to current arrangement.

The goal of the A/B is then to compare the conversion rates of the two groups using statistical inference.

The problem is that the world is not a vacuum involving only the experiment (treatment vs control group) and effect. The situation is vastly more complex and dynamic. Consider these situations:

  • Users have different characteristics: Different ages, genders, new vs returning, etc
  • Users spend different amounts of time on the website: Some hit the page right away, others spend more time on the site
  • Users are find your website differently: Some come from email or newsletters, others from web searches, others from social media
  • Users take different paths: Users take actions on the website going to different pages prior to being confronted with the event and goal

Often modeling an A/B test in this vacuum can lead to misunderstanding of the true story.

Note that you can choose the split of traffic not to be 50–50 and allocate more traffic to version A, in case you are concerned about losses due to version B.

However, keep in mind that a very skewed split often leads to longer times before the A/B testing becomes (statistically) significant.

Let us assume that after 7 days of A/B testing, the tracking metrics of the experiment are

  • Because version B exhibits higher CR, does it mean version B brings improvement? Similarly, can we conclude on the influence on the average time spent?
  • If so, with what level of confidence?
  • Did a higher CR/lower average time spent of version B happen by chance?

Before jumping into conclusions, what you need to keep in mind is that

The raw results we have are only samples of bigger populations. Their statistical properties vary around the ones of the populations they come from.

The process starts in stating a null hypothesis H₀ about the populations. In general, it is the equality hypothesiseg. “the two populations have the same mean”.

The alternative hypothesis H₁ negates the null hypothesis: eg. “the mean in the second population is higher than in the first”.

The test can be summarised in two steps:

  • 1. Model H₀ as a distribution on a single real-valued random variable (called the test statistic)
  • 2. Assesshow likely the samples, or more extreme ones, could have been generated under H₀. This probability is the famous p-value. The lower it is, the more confident we can be in rejecting H₀.

Z-test for average time spent

The hypothesis to test are:

  • H₀: “the average time spent is the same for the two versions”
  • H₁: “the average time spent is higher for version B”

The first step is to model H₀

The Z-test uses the Central Limit Theorem (CLT) to do so.

The CLT establishes that:
Given a random variable (rv) Xof expectation μand finite variance σ²
{X₁,…,Xn}∼ X,n independent identically distributed (iid) rv, the following approximation on their average (also a rv) can be made

Under H₀, we have equality of the true means and therefore the model

The second step is to see how likely our samples are under H₀

Note that true expectation and variance for A and B are unknown. We introduce their respective empirical estimators:


Our samples generated the following test statistic Z, which needs to be tested against the reduced centered normal distribution:

Conceptually, represents the number of standard deviations the observed difference of means is away from 0. The higher this number, the lesser the likelihood of H₀.
Conceptually, represents the number of standard deviations the observed difference of means is away from 0. The higher this number, the lesser the likelihood of H₀.


In cases where the sample size is not as big (< 30 per version), and the CLT approximation does not hold, one may take a look at Student’s t-test.

χ² test for conversion rate

The hypothesis to test are:

  • H₀: “the conversion rate is the same for the two versions”
  • H₁: “the conversion rate is higher for version B”

The first step is to model H₀

In H₀, conversions in version A and version B follow the same binomial distribution B(1,p). We pool the observations in both version A and B and derive the estimator for CR

The second step is to see how likely our samples are under H₀

It consists in computing the observed and deriving its corresponding p-value according to the χ² law.

This is how it can be done in Python:

There is a pvalue chance that a result at least as distant from the theoretical distribution as our observation would have happened under H₀. With a common go-to α criterion of 5%, we have pvalue>α and H₀ cannot be rejected.

AS A/B test is not understood properly by peoples you can check googles course here.

we will explain it in more details later.

github: click here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s