Amplitude's A/B testing features rely on standard statistical techniques to determine a variant's chance to outperform the baseline, as well as its statistical significance. This article explains those calculations.
Improvement over baseline
Improvement over baseline is the ratio of the mean variant (A) over the mean baseline (B), .
Chance to outperform
Amplitude calculates the probability of our variant (A) outperforming our baseline (B) using a Bayesian method. This probability is based on the distribution of the difference B – A. If the individual distributions of B and A are assumed to be normally distributed, then the difference B – A is also a normal distribution (Gaussian) with a mean of and a variance of
.
To find the chance of A outperforming B, Amplitude determines the area under the curve that falls to the right of zero.
The area under the curve or cumulative distribution can be expressed in terms of the error function erf, which has a mean of μ and variance of σ.
Erf can be calculated with a numerical approximation, and Amplitude incorporates the same approach to calculating chance to outperform:
Once erf has been determined, the final equation to calculate the chance that B is better than A is:
(Source: O'Connell, Aaron. “The Math of Split Testing Part 2: Chance of Being Better”)
Statistical significance
The A/B test view will also tell you whether statistical significance has been achieved in the top left corner of the chart. Amplitude uses a two-tailed t-test with a false positive rate of 5% to judge results, and it only looks at the best-performing variant.
Since Amplitude uses a 5% false positive rate, the threshold for significance is (1- p value) > 95%. You can set a different false positive rate in Amplitude Experiment, or in Experiment Results.
To help reduce false positives, Amplitude sets a minimum sample size before it declares significance. Currently, this minimum is set to 30 samples, five conversions, and five non-conversions, for each variant. Tests that do not meet these minimums are automatically considered not statistically significant.
When a test has reached statistical significance, you will see this green text:
Otherwise, you will see the following red text: