Plan experiments with help from the duration estimator

  • Updated

This article will help you:

  • Understand the components of the duration estimator
  • Utilize the duration estimator to plan experiment sample size and run time needed to reach statistical significance

The duration estimator can help you determine the sample size and experiment run time needed to reach statistical significance in your Amplitude experiment, and to help you decide if an experiment would be worthwhile.

NOTE: While Amplitude Experiment supports sequential testing, the duration estimator solely supports determining the sample size for a T-test. Click here to read more about the difference between sequential tests and T-tests. 

Before you begin

There are several factors that should be considered when using the duration estimator, such as active deployments, the number of variants or allocation of users that will be included (aka, the rollout percentage), and any relevant mutual exclusion, holdouts, or rule-based targeting in use. These variables may have a direct impact on how many days the experiment could take to run and how many users could be exposed. 

To gauge for the ideal setup of your experiment, the duration estimator can be used alongside its planning and configuration stages. At a minimum, set up the following before using the duration estimator:

  • Choose the primary metric for your experiment
  • Review the default direction (increase) and default MDE (2%), and modify as needed
  • Set a non-zero rollout percentage. 

The more you setup for your experiment, the more accurate the estimate will be. Complete everything on the Plan and Configure tabs for a better estimate.

Using the duration estimator

The duration estimator can be accessed from any tab in Amplitude Experiment. Once the minimum required setup is complete, follow these steps to estimate the duration for your experiment:

  1. Click No Duration Estimate to open the duration estimator.

durationEstimator.png

  1. In the modal that appears, add a proxy exposure event by clicking Exposure. A proxy event fires at the same time a user is exposed to the experiment, and closely resembles the primary metric's exposure event.
  2. If desired, add properties to the proxy exposure event by clicking + where
  3. Next, review the components needed for the duration to be estimated. The duration estimator allows you to enter varying metrics based on your unique business needs and relevant historical data. These values can be kept as the default or manually adjusted. Modifications may have an effect on the sample size and run time needed to reach statistical significance. Expect a larger sample size to require a longer run time. 

The table below describes the components involved in generating the duration estimate.

Component name and default setting Definition and data validation Relation to sample size needed for stat. sig.
Confidence Level: 95%

The confidence level measures how confident you are that you would receive the same results if you were to roll out the experiment again and again. For example, a confidence level of 95% means that 5% of the time you might interpret the results as statistically significant when they're not (false positive).

Amplitude recommends a minimum of 80%, else the experiment's results may no longer be reliable.

You cannot pick 0% and you cannot pick 100%.

Larger the confidence level, larger the sample size
Control Mean: Automatically computed for you when you select the primary metric

The control mean is the average value of the selected primary metric over the last 7 days (not including today) for users who completed the proxy exposure event. 

Consider adjusting the mean if there was a recent special event or holiday that may have impacted the average in the last 7 days. 

Cannot be 0 regardless of metric type. For conversion metrics, this cannot be 1. Note that for conversion metrics .5 means 50% and not .5%.

Smaller the control mean, larger the sample size
Standard Deviation: Automatically computed for you when you select the primary metric

Standard deviation signifies the variance, or the spread, in the data (average between each data point and the mean). It only shows up for numerical metrics and not for binary or 0-1 conversion rates. The automatic calculation will be based off of the standard deviation of the primary metric over the last 7 days (not including today) for users that completed the proxy exposure event.

Any positive number.

Larger the standard deviation, larger the sample size

Power: 80%

Power is the % of true positives, therefore, it can help measure the change's error rate.

Think of power as how precise you need to be in your experiment, or what risk you're willing to take for potential erroneous results.

You cannot pick 0% and you cannot pick 100%. Do not set below 70%.

Larger the power, larger the sample size

Test Type: 2-sided

A 1-sided t-test will look for either an increase or a decrease of the change compared to the mean, whereas a 2-sided t-test will look for both an increase and a decrease. 2-sided will require a larger sample size than 1-sided
Minimum Effect (MDE): 2%

The MDE, aka the minimum goal or effect size, is relative to the control mean of the primary metric; it is not absolute nor standardized. For example, if the conversion rate for control is 10%, an MDE of 2% would mean that a change would be detected if the rate moved outside of 9.8% to 10.2%.

The value of the MDE relies on context of the experiment. Use the smallest possible change desired to help determine if the experiment would be a success. 

Any positive number as a percentage. You cannot pick 0%.

Smaller the MDE, larger the sample size

Interpreting estimator results 

Once all components have been entered, the duration estimator will display a result. This will be the estimated number of days needed to reach statistical significance when conducting your experiment. The Estimate details will also display the total expected number of users that will be needed to conduct your experiment: This Total traffic estimate is based off of users that triggered the proxy exposure event in the the last 7 days.

The duration estimator will provide solutions if your results are greater than the optimal 30 days, such as removing a variant or two. If results are within a reasonable timeframe for your organization, the duration estimator will state that the estimated number of days "is the optimal amount of time to run your experiment." 

primeResults.png

Reducing experiment run time

Sometimes the results of the duration estimator could indicate a run time that is longer than desired. Consider the following to decrease your experiment's run time:

  • Modify error rates to reduce the sample size needed.
  • Change the primary metric and exposure event.
  • Target more users.
  • Modify the standard deviation so that outliers don't carry as much weight.
  • Lastly, decide if the experiment is worth the run time or if it should be scrapped.

Ultimately, the value of utilizing the duration estimator to help plan your experiment is based on the unique needs of your business goals and the risks that you're able to take to run them. Click here to read more about the experiment design phase. 

Debugging

If you try to run the estimator but get an error message, refer to warningIcon.png (warning) icons and try these steps to debug:

  1. Check that you have data for the proxy exposure event.
  2. Check that you have data for the metric, including that there are people who have done the metric after the proxy exposure event.
  3. Has a proxy exposure event been selected?
  4. Is the control mean 0 or 1?
  5. Is the MDE 0%?
  6. Is the confidence level 0% or 100%?
  7. Is the power 0% or 100%?
  8. Make sure your percentage rollout is non-zero.
  9. Are there assignment events for the deployment (for example, you just created a new deployment)? Try removing your deployments, re-running the duration estimate, and then adding back in the deployments.