This article covers frequently asked questions about the Amplitude Experiment's duration estimate.
- How does it work?
- Why is the duration estimate not showing?
- What do worst case, average case, and best case mean?
- Is there a cap for the duration estimate?
- How does Amplitude Experiment determine the number of exposures per day?
- What types of errors are there?
How does it work?
The experiment duration estimate is designed to predict the length of time your experiment will need to generate statistically significant results. It can only be used with the primary metric and sequential testing. It is not currently supported in Experiment Results.
Amplitude Experiment uses the means, variances, and exposures of your control and variants to forecast expected behavior and calculate how many days your experiment will take to reach statistical significance. As Amplitude Experiment receives more data over time, this prediction will improve. However, if any of these inputs change significantly during the experiment, the accuracy of the prediction will likely decrease.
Why is the duration estimate not showing?
The duration estimate is visible when the follow criteria are met:
The following statistical assumptions are met:
- The absolute lift is outside the confidence interval
- The confidence interval is flipped (lower confidence interval > upper confidence interval). This can happen if, while the experiment is running, the mean for either the treatment or the control fluctuates, or if the experiment’s rollout weights or targeting segments are changed
- The standard error is very small
- The variance is negative
- The conversion rate > 1 or < 0 (if applicable)
- The metric has not yet reached statistical significance
- The end date of the analysis window is in the past
- The experiment has enough observations
- The experiment has been rolled out or rolled back
If the estimate is not showing, it likely means that one or more of these criteria have not been met.
What do worst case, average case, and best case mean?
Amplitude Experiment uses the worst case, average case, and best case to describe the uncertainty inherent in its estimate of the time it will take for a hypothesis test to reach statistical significance:
- Best case estimate of three days: 20% of the time, the experiment will reach stat sig in three days or less
- Average case estimate of seven days: 50% of the time, the experiment will reach stat sig in seven days or less
- Worst case estimate of ten days: 80% of the time, the experiment will reach stat sig in ten days or less
Is there a cap for the duration estimate?
Yes. The duration estimate is currently capped at 40 days, for the following reasons:
- The duration estimate uses real-time simulations, where latency scales with the number of days simulated.
- It’s often not wise to assume that the means and standard deviations won't change over time, especially for experiments with longer running times.
- Short-term predictions are easier to make accurately than long-term predictions. (This is why, for example, you rarely see weather forecasts for more than ten days in the future—and even those change frequently as the day in question grows closer.)
- Most experiments should not take 40 days to complete.
How does Amplitude Experiment determine the number of exposures per day?
Amplitude Experiment assumes a constant number of exposures per day. Exposures per day are calculated by dividing the cumulative exposures (as of today) by the number of days the experiment has been running so far.
What types of errors are there?
The experiment duration estimate is still an estimate; it should not be taken as ground truth. Here is a list of some of the error types you might encounter.
Irreducible error is error inherent to the estimation process; unfortunately, it cannot be corrected for.
In simulations, each one will reach statistical significance at different times: this difference is the main reason to run multiple simulations. This is just how randomness works. In fact, the time it takes for an experiment to reach stat sig is actually a random variable itself. It will depend on the p-value, which in turn depends on the data your experiment collects. Even if we cheat and pretend to know the control mean, control standard deviation, treatment mean, and treatment standard deviation, and if we force normal distribution and independence on everything, we still cannot reduce error all the way to zero.
See this video for more information.
When Amplitude Experiment generates a duration estimate, it estimates the control population mean and control population standard deviation, among other things, with the sample estimates. These estimates are as good as they can be. That said, there is potential for error here also.
For example, if today the control mean equals 5, and ten days from now the control mean equals 15, there is drift in the control mean. A common example of drift is seasonality. If there is any drift in any of the statistics, the estimate will do poorly. The estimate assumes no drift when doing hypothesis testing.