Amplitude's A/B testing features rely on standard statistical techniques to determine a variant's chance to outperform the baseline, as well as its statistical significance. This article explains those calculations.
Improvement over baseline
Improvement over baseline is the ratio of the mean variant (A) over the mean baseline (B), .
Chance to outperform
Amplitude calculates the probability of our variant (A) outperforming our baseline (B) using a Bayesian method. This probability is based on the distribution of the difference B – A. If the individual distributions of B and A are assumed to be normally distributed, then the difference B – A is also a normal distribution (Gaussian) with a mean of and a variance of
.
To find the chance of A outperforming B, Amplitude determines the area under the curve that falls to the right of zero.
The area under the curve or cumulative distribution can be expressed in terms of the error function erf, which has a mean of μ and variance of σ.
Erf can be calculated with a numerical approximation, and Amplitude incorporates the same approach to calculating chance to outperform:
Once erf has been determined, the final equation to calculate the chance that B is better than A is:
(Source: O'Connell, Aaron. “The Math of Split Testing Part 2: Chance of Being Better”)
Statistical significance
The A/B test view will also tell you whether statistical significance has been achieved in the top left corner of the chart. Amplitude uses a two-tailed p-value of 95% confidence interval to judge results, and it only looks at the best-performing variant.
Amplitude considers a 97.5% chance to outperform to be the threshold for significance. A 2.5% chance is also a significant result, but it signals a significant chance to underperform.
To help reduce false positives, Amplitude sets a minimum sample size before it declares significance. Currently, this minimum is set to 30 samples and five conversions.
Sample sizes of less than 30 are automatically considered not statistically significant.
When a test has reached statistical significance, you will see this green text:
Otherwise, you will see the following red text: