The method Amplitude Experiment uses to calculate statistical significance differs from the technique you may have learned in introductory statistics courses. This article will explain the difference.
Prior distribution
The sequential testing methodology that Amplitude Experiment uses requires it to define a prior distribution on the effect size. Generally speaking, a prior distribution is a description of one’s pre-existing expectations about their data—you can imagine it as a histogram of the effect sizes of all your past experiments, based on a weighted average of historical data.
The standard textbook example of a prior distribution is the coin-tossing example. If you have 10 coin tosses and you see eight heads, then without a prior distribution, your best guess at the probability of getting a heads is 0.8.
But perhaps you’re already familiar with how coin flips work, and you have no reason to suspect the coin is improperly balanced. In that case, you do have a prior, and you’d shift the probability of getting a heads from 0.8 to something closer to 0.5.
Amplitude Experiment works from the presupposition that effect sizes are normally distributed, with a mean of zero and a standard deviation equal to tau. With this prior distribution, Amplitude Experiment encodes its understanding of reasonable effect sizes. Essentially, this means Amplitude Experiment believes most effect sizes are between -3 tau and +3 tau.
Tau and the confidence interval
Tau should be the standard deviation of all prior effect sizes. Amplitude will transform the minimum detectable effect (MDE) into an approximation of the standard deviation of the effect size. As the sample size approaches infinity, the optimal value of tau for an individual experiment is the true effect size; however, this value is unknown.
If tau is much different from the true effect size, you will probably get a large confidence interval. For example, if tau is much smaller than the true effect size, your prior distribution is that the effect size should be small. But the observed data shows the opposite which means your prior may be wrong. If that’s the case, you’ll need a lot more data to reach significance than you would if your prior is correct.
How confidence intervals react to changing the MDE
When you change the MDE, you change tau, which in turn affects the confidence interval. This is one difference between the fixed horizon tests of traditional statistics and the sequential testing methodology Amplitude uses. With fixed horizon tests, the MDE does not affect the confidence interval calculation; we look at the MDE—which, by definition, is a minimum—so we are considering the worst case. With sequential testing, we look at the whole distribution of the effect size and look at the average case.
Another way to think of this: In fixed horizon testing, we define a point. In sequential testing, we define a distribution.
How to select the prior distribution is a contentious issue in Bayesian statistics. There really is no single, correct choice. Two people can have different prior beliefs, pick different prior distributions, and get different results.
Edge cases
No matter the method used to pick the prior distribution, there will be edge cases where the prior distribution will be wrong. For example, imagine your historical effect sizes are usually between -1 and 1, and you run an experiment with an effect size of 10. This outlier will have a negative impact on the results of any sequential testing you run.
On the other hand, a prior can protect you from some edge cases and essentially act as regularization—the coin toss example at the beginning of this article is a good illustration of this.
Another way of saying this is that if the true effect size is in the tails of the prior distribution, you’ll need a lot of data to convince yourself the prior was wrong.
Best practices
- Selecting a MDE should already be part of your experiment process, especially if you’ve been doing fixed horizon tests.
- You should choose your MDE before the experiment starts and keep it until the experiment is over. In Amplitude, it is possible to try out different MDE values in order to find the smallest confidence interval. However, that doesn’t mean it’s recommended. Selecting your MDE before looking at the data will help preserve the integrity of your experiment.
- Also, you should define your MDEs so that it acts as a threshold for determining the business significance of your experiment results.