In an experiment, think of each variant or metric you include as its own hypothesis. For example, by adding a new variant, you’re putting forth the hypothesis that whatever potential changes are included in that variant will have a detectable impact on the experiment’s results.

The simplest experiments have only a single hypothesis. Single-hypothesis tests can yield valuable insights. However, it’s often more efficient or enlightening to include more than one metric or variant, or **multiple hypotheses**.

That said, multiple hypothesis testing has the potential to introduce errors into your calculations of statistical significance, via the multiple comparisons problem (also known as *multiplicity* or *the look-elsewhere effect*): The probability of making an error (by basing a critical business decision on a false positive result) increases rapidly with the number of hypothesis tests you are running.

For example, imagine you want to run an experiment around the color of your site’s “Buy now” button. Currently, it’s blue (making it the **control**), but you also want to test out green (variant #1) and purple (variant #2). If your false positive rate is 0.05 (five percent) for each individual hypothesis test, the probability of finding a statistically significant result when the null hypothesis is **true** is:

`1 - 0.95^2 = 0.0975`

(This assumes the tests are independent.)

In other words, if you run enough tests, you’ll eventually get a statistically significant result no matter what. With a 0.05 false positive rate, you can expect that one out of every 20 hypothesis tests will be statistically significant by random chance alone.

The question asked by multiple hypothesis correction is, “is this stat sig result due to chance, or is it genuine?”

## False positives

You may already be familiar with the idea of a false positive rate. It’s the ratio between:

- the amount of negative events
**falsely**described as positive, and - the total amount of
**actual**negative events.

Every experiment carries the risk of a false positive result. This happens when an experiment reports a conclusive result in either direction, when there’s actually no real difference between variations.

The risk of a false positive result increases with each metric or variant you add to your experiment. This is true even though the false positive rate stays the same for each *individual *metric or variant.

Fortunately, there are statistical tools used to compensate and correct for the multiple comparisons problem. Amplitude uses the **Bonferroni correction** to accomplish this.

## Bonferroni correction

The Bonferroni correction is the simplest statistical method for counteracting the multiple comparisons problem. It’s also one of the more conservative methods, and carries with it a greater risk of false negatives than other techniques. The Bonferroni correction does not, for example, consider the distribution of p-values across all comparisons, which we would consider to be uniform if the null hypothesis is true for all hypotheses.

Mathematically, the Bonferroni correction works by dividing the false positive rate by the number of hypothesis tests you are running; this is equivalent to multiplying the p value by the number of hypothesis tests.

In the button color example above, dividing 0.1 by 2 equals .05, which is what we want. Thus, the family wise error rate (the probability of rejecting at least one hypothesis) is controlled.

The proof follows from Boole’s inequality.

The Bonferroni correction does, however, control for family-wise error rate.

Amplitude Experiment performs Bonferroni corrections on both the number of treatments, and the number of metrics in each of the two metric tiers (primary and secondary).