This article will explain the cumulative exposures graph in Amplitude Experiment (also available in Experiment Results). We’ll look at several examples and explain the nuances of each.
The graph details the number of users who are exposed to your experiment over time. The x-axis displays the date when the user was first exposed to your experiment; the y-axis displays a cumulative, running total of the number of users exposed to the experiment. (Learn about the difference between an assignment event and an exposure event here.) Each user is only counted once, unless they are exposed to more than one experiment variant; in that case, they are counted once for each variant they see.
In the graph below, each line represents a single variant. March 20 is the first day of the experiment, with 158 users triggering the exposure event for the control variant. A day later, a total of 314 users have been exposed to the control variant. That number is the sum of exposures on March 20 and March 21.
This is a very standard cumulative exposure graph: the lines go up and to the right.
Mathematically speaking, the slope of each line is the change in the y axis divided by the change in the x axis:
∆y / ∆x =
(cumulative users exposed as of day T1 — cumulative users exposed as of day T0) / (number of days elapsed between T0 and T1) =
Number of new users exposed to the experiment, per day, from day T0 to day T1.
What are some other things we can say about this graph?
- It’s cumulative, which means the y-axis will not decrease. The slope of the line is the number of new users exposed to your experiment every day. The line may slow down, or even stop growing completely. But you won’t see a cumulative exposures graph where the line peaks and then drops.
- There’s a dotted line at the end, which means there is incomplete data for those dates. See this Help Center article for more information.
- The two lines do not track each other perfectly. That’s because each line represents a unique variant, and exposures can differ slightly between variants, even when they’re set to receive the same amount of traffic.
- Both variants are on a steady growth path. This means there is no seasonality. If, for example, users were more likely to engage with your product (and therefore more likely to be exposed to an experiment) on weekdays, you’d see this in the chart: on weekends, the y-axis value would increase more slowly.
Often, changing the x-axis to an hourly setting, as opposed to than daily, will offer new ways of understanding your chart:
Here, the trend is still fairly linear. But since we are now looking at an hourly graph, we can see that from 9 pm to about 5 am, almost no additional users are being exposed to the experiment. This is probably when people are sleeping, so it stands to reason they are not using the product. This is something we couldn’t have seen in the daily version of the graph.
This is a more extreme example. Here, the exposures look like a step function. In this case, it could be that the users who have already been exposed to your experiment at least once are evaluating the feature flag again during these “flat” time periods.
Sometimes, lines will have an inflection point:
You can see here that on February 27, the slope of all three lines changed a bit, from around 70 users per day per variant, to about 100 users per day per variant. (Note that the slope can also flatten after an inflection point.)
Why did this happen? There are several possibilities:
- Did you increase the traffic to your experiment?
- Did you increase the traffic allocations for each variant? (If you had increased traffic to a single variant, then only one line would show this inflection point.)
- Did you change the targeting criteria (i.e., originally you were targeting users from California, but then decided to target users from California and Florida)?
- Was there some external event, like an increase in the advertising budget or the release of a new feature, that could have driven more users to your experiment?
It’s strongly recommended that you do not change settings for traffic or traffic allocations to variants in the middle of an experiment; in fact, you will see warnings within Amplitude Experiment to discourage this behavior. Doing so can introduce Simpson’s paradox into your results (for a more technical explanation and analysis, see this article). If you have changed the traffic allocation, we suggest you restart the experiment by choosing a new start date. Avoid including any users who were already targeted.
Likewise, you shouldn’t change the targeting criteria during an experiment because the sample is then not representative of what would happen if you rolled out a variant to 100% of your users. Instead, consider gradually increasing traffic to the entire experiment, or doing a gradual feature rollout instead of an experiment.
- Imagine that in the first week of your experiment, you target only Android users, and your experiment is seen by 100 of them. The following week, you change the targeting criteria to include iOS users, and your experiment is seen by 20 of them. So far, your experiment has been seen by 220 users after two weeks; 9% of them (20/220 = 1/11 = 9%) are iOS users. However, when you release your experiment to 100% traffic, you discover the true percentage of iOS users is actually 16.7%. In this case, we are underestimating the effect of iOS users. If the experiment shows a positive lift for Android users but a negative lift for iOS users, you may be rolling out a feature based on what you think is a positive experiment, but it is in fact a negative.
Once you have answered the question of why the slope changed, consider whether the end date of the experiment should be adjusted. With more traffic, you’ll reach statistical significance faster. If you are getting less traffic, you will reach statistical significance more slowly.
Sometimes, an experiment’s cumulative exposures can start out strong but then slow down over time.
When this experiment launched, each variant was exposed to about 280 new users each day. But toward the end, those exposure rates were down to about 40 new users per variant, per day.
This can happen when you’re targeting a static cohort—i.e., one that does not grow or shrink on its own—you created in Amplitude. For example, imagine a static cohort with 100 members. On the first day, your experiment was shown to 40 of those users. That leaves only 60 more users eligible to be included in the future. With each passing day, there are fewer and fewer users who can enter into the experiment in the first place, and the slope of your cumulative exposures graph will inevitably flatten.
If you’re using a static cohort in an experiment, consider rethinking how you’re using the sample size calculator. Instead of solving for the sample size, you should ask what level of lift you can reasonably detect with this fixed sample size.
Static cohorts also limit the impact of your experiment.
Whenever you use a cohort in this way, ask yourself whether the cohort is actually representative of a larger population that would show a similar lift if more users were exposed to the winning variant. You can’t assume this; doing so would be like running an experiment in one country and then assuming you’ll see the same impact in any other country.
Other possible causes include:
- Using a dynamic cohort that isn’t growing quickly enough, or is effectively a static cohort
- How you handle sticky bucketing: If users enter the cohort and then exit, do you want them to continue to see the experiment (for consistency’s sake) even though they no longer meet the targeting criteria?
- The number of users that interact with your experiment might be limited
- The experiment is initially shown to a group of users who are not representative of users exposed later. Users who have been using your product for 30 days may interact with the feature you’re testing differently than those who’ve been around for 100 days, for example. Consider running your experiment for longer than you had originally planned, to make sure you’re studying the effect of the treatment on a steady state of users.
- Users gradually become numb to your experiment and stop responding to it after repeated exposures. But the flattening of your cumulative exposures graph doesn’t necessarily mean the experiment’s impact is limited—even if no additional users are being exposed to your experiment, those who are still in it might still be responding to it and delivering value.
Bear in mind that just because the cumulative exposures graph has flattened out does not mean that the experiment has a limited impact. It all depends on the specifics of your users’ behavior.
What does this mean for your pre-experiment duration estimate and current duration estimate? Seeing this kind of graph has serious implications regarding how long you will need to run your experiment. The standard method of calculating the duration of an experiment is to use a sample size calculator and divide the estimated number of samples by the average traffic per day. Here, that’s not the case. Generally, you’ll need to run the experiment for longer than expected, since the denominator was overestimated.
Focusing on the data from March 4 to March 11, the graph is fairly flat. This indicates that very few new users were added to the experiment during that time period. Potential explanations include:
- You’ve run out of users to add to the experiment
- There is a bug in the sending of exposure events, or
- Your product’s usage is strongly affected by seasonality.
You can see a strong illustration of that last bullet point in this hourly chart:
Between March 21 at 7 pm and March 22 at 9 am (the rightmost section of the graph), very few users were exposed to this experiment. But just before that, starting at around 5 am, a large number of users were exposed. Yet on the left hand side of the graph, the pattern is one of users slowly trickling in. When you consider that this experiment is run by an online gambling company, it makes sense that there would be these spikes in traffic when they run their jackpots.
Divergent lines with similar slopes
In this example, the two variants began receiving traffic on two separate days, February 23 and February 28, resulting in a pair of staggered lines on the graph.
Ideally, you should not begin an experiment until all variants are ready to receive traffic. Adding a new variant after the experiment is underway can present a misleading picture of the results, since all variants were not subject to the same conditions for the same length of time. It’s entirely possible that something could have occurred prior to adding the new variant that would have had an effect on the results.
Another potential issue is the novelty effect: users exposed to the green variant may have had more time to adjust to the new experience. Finally, users exposed to that green variant have had more opportunities to trigger the primary metric, especially if it’s an unbounded time metric, thus making the comparison between variants unequal.
A central requirement of experimentation is to make sure that the only difference between treatment and control is the feature you are testing. This way, you know any differences you see are the result of causation, and not just correlation.
Divergent lines with different slopes
This situation could arise for a number of reasons. If you’re using a custom exposure event, users may see old, cached variants of your experiment if they keep triggering the exposure event without triggering the assignment event.
For example, a user could be assigned to the control variant without triggering the exposure event. If in the future you set the traffic allocation for the control variant to 0%, that user could return afterwards and trigger the exposure event without triggering a new assignment event. That user will be counted as a control exposure.
This reasoning also holds for experiments with sticky bucketing on.
In this example, on March 15, this user rolled out their experiment to 100% for the control variant. Since sticky bucketing was turned on, we still see the number of “on” users increasing even after the traffic allocation is set to 0%. This happens because allocation occurs when the variant is requested via the SDK or API, so the variant can become “stuck” even without the user having been exposed to it.
You should keep in mind that when you select sticky bucketing and change the traffic allocation, you will not get the desired traffic allocation. Instead, you’ll get a weighted average between the two allocations, since the users who were previously bucketed will stay in their bucket. You’ll have to wait a bit to get close to the desired traffic allocation.
If your experiment has sticky bucketing turned on, and you’re planning to roll out a variant after it ends, you should delete the appropriate branch in the code and remove the feature flag. If you do not want to make a code deployment, you can also turn off sticky bucketing.