This article will help you:

Amplitude Experiment uses a sequential testing method of statistical inference. Sequential testing has several advantages over ttests, another widelyused method, chief among them being that you don’t need to know how many observations you’ll need to achieve significance before you start the experiment.
Why is this important? With sequential testing, results are valid whenever you view them. That means you can decide to terminate an experiment early based on observations made to that point, and that the number of observations you’ll need to make an informed decision is, on average, much lower than the number you’d need when using a ttest or similar procedures. You can experiment more quickly, incorporating your new learnings into your product and escalating the pace of your experimentation program.
This article will explain the basics of sequential testing, how it fits into Amplitude Experiment, and how you can make it work for you.
Hypothesis testing in Amplitude Experiment
When you run an A/B test, Experiment conducts a hypothesis test using a randomized control trial, in which users are randomly assigned to either a treatment variant or the control. The control represents your product as it currently is, while each treatment includes a set of potential changes to your current baseline product. With a predetermined metric, Experiment compares the performance of these two populations using a test statistic.
In a hypothesis test, you’re looking for performance differences between the control and your treatment variants. Amplitude Experiment tests the null hypothesis , where states there’s no difference between treatment’s mean and control’s mean.
For example, if you’re interested in measuring the conversion rate of a treatment variant, the null hypothesis posits that the conversion rates of your treatment variants and your control are the same.
The alternative hypothesis states that there is a difference between the treatment and control. Experiment’s statistical model uses sequential testing to look for any difference between treatments and control.
There are a number of different sequential testing options. Amplitude Experiment uses a family of sequential tests called mixture sequential probability ratio test (mSPRT). The weight function, H, is the mixing distribution. So we get the following mixture of likelihood ratios against the null hypothesis that :
Currently, Amplitude only supports a comparison of arithmetic means between the treatment and control variants for uniques, average totals, and sum of property.
How does sequential testing compare to a ttest?
As mentioned above, using sequential testing lets you look at the results whenever you like. But fixedhorizon tests—such as ttests, for example—can give you inflated false positives if you peek while your experiment is running.
Below is a visualization of pvalues over time in a simulation we ran of 100 A/A tests for a particular configuration (alpha=0.05, beta=0.2). As we ran a ttest on data coming in, we peeked at our results at regular intervals. Whenever we see the pvalue fall below alpha, we stop the test and conclude that it has reached statistical significance.
You can see the pvalues fluctuate quite a bit, even before the end of our test when we’ve reached 10,000 visitors. By peeking, we are inflating the number of false positives. The table below summarizes the number of rejections we have for different configurations of our experiment when we run a ttest.
Here, “baseline” is the conversion rate of our control variant, and “delta_true” is the absolute difference between our treatment and the control. Since this is an A/A test, there is no difference. With alpha set to 0.05, we can see that the number of rejections far exceeds that of our threshold that we set for our Type 1 error if we peek at our results—num_reject should never be higher than 5 in this example.
Now compare that to a sequential testing approach. Again, we have 100 A/A tests, and alpha is set to 0.05. We peek at our results on a regular interval and if we see the pvalue go below alpha, we conclude that the test has reached statistical significance. As a result of using this statistical method, the number of false positives stays below this threshold:
With alwaysvalid results, we can end our test any time the pvalue goes below the threshold. From 100 trials where alpha = 0.05, the number of those that fall below that is four, so Type 1 errors are still controlled.
The table below summarizes the number of rejections we have for different configurations of our experiment when we run a sequential test with mSPRT:
Using the same basic configurations as before, we see that the number of rejections (out of 100 trials) is within our predetermined threshold of alpha = 0.05. With alpha set to 0.05, we know that only 5% of our experiments will yield false positives, as opposed to 3050% when using a ttest. With sequential testing, we can confidently look at our results and conclude experiments at any time, without worrying about inflating false positives.
FAQs
What is the statistical power of this approach?
Given enough time, the statistical power of our sequential testing method is 1. If there is an effect size to be detected, this approach will detect it.
Why hasn’t the pvalue or confidence interval changed, even though the number of exposures is greater than 0?
For uniques, Amplitude Experiment does not compute pvalues and confidence intervals until there are at least 25 conversions and 100 exposures each for both the treatment and control.
For average totals and sum of property, Experiment waits until it has at least 100 exposures each for the treatment and control.
Why don’t I see any confidence interval on the Confidence Interval Over Time chart?
This is because the thresholds haven’t been met yet.
For uniques, Experiment waits until there are at least 25 conversions and 100 exposures each for the treatment and control. Then it will start computing the pvalues and confidence intervals.
For average totals and sum of property, Experiment waits until it has at least 100 exposures each for the treatment and control.
What are we estimating when we choose Uniques?
This measures whether or not your visitors fired a specific event. The result is the proportion of the population that has taken this action. It’s a comparison of proportions, or the conversion rates between treatment and control.
What are we estimating when we choose Average Totals?
This counts the average number of times a visitor has fired an event. For each visitor, Experiment counts the number of times they took the action you’re interested in, and then averages that across the sample within both the control and treatment. The result is a comparison of the average totals between the treatment and control.
What are we estimating when we choose Average Sum of Property?
This sums the values of an event per user on a specific property. For example, if you’re interested in getting the total cart value of a user across all times, you’d pick the event “add to cart,” with the property of “cart value”. The result of this specific example is a comparison of the average cart value between treatment and control.
What is absolute lift?
This is the absolute difference between treatment and control.
What is relative lift?
This is the absolute lift scaled by the mean of the control. Some people find this value useful to determine the relative change a treatment has with respect to the baseline.
Why does absolute lift exit the confidence interval?
Occasionally you may see the absolute lift exit the confidence interval, which can cause confidence bounds to flip. This happens when the parameter you’re estimating (i.e. absolute lift) changes over time and the allocation for your treatment and control has changed. The underlying assumption in the statistical model Experiment uses is that the absolute lift and variant allocation do not change over time.
The good thing about Experiment’s approach is that it’s robust to handle symmetric time variation, which occurs when both the treatment and control maintain their absolute difference over time, and their means vary in sync.
One workaround is to choose a different starting date (or date range) where the absolute lift is more stable and the allocation is static.