Alternative experiment protocols
I came across an interesting list of alternatives to vanilla A/B testing in this blogpost, “Choose your experiment platform wisely”:
Good experiment platforms go beyond the minimum functionality in many ways. Among other things, they may…allow experiment protocols other than simple A/B testing. Crossover,switchback, bandit, and staged rollout protocols (among others) have advantages over A/B designs in relevant situations
Here is a short summary of what exactly these alternative “experiment protocols” mean:
Example referenced: Crossover study - Wikipedia
“…a longitudinal study in which subjects receive a sequence of different treatments.”
“Nearly all crossover are designed to have “balance”, whereby all subjects receive the same number of treatments and participate for the same number of periods. In most crossover trials each subject receives all treatments, in a random order.”
“These studies are often done to improve the symptoms of patients with chronic conditions. For curative treatments or rapidly changing conditions, cross-over trials may be infeasible or unethical.”
Takeaway: this is useful if you have hypotheses about sequencing a repeatable treatment (like applying a medicine to a patient over time).
Example referenced: Switchback Tests and Randomized Experimentation Under Network Effects at DoorDash
“To be able to experiment in the face of network effects, we use a technique known as switchback testing, where we switch back and forth between treatment and control in particular regions over time. This approach resembles A/B tests in many ways, but requires certain adjustments to the analysis.”
“…the core concept is that we switch back and forth between control and treatment algorithms in a certain region at alternating time periods. For example, in the SOS pricing example, we switch back and forth every 30 minutes between having SOS pricing and not having SOS pricing. We then compare the customer experience and marketplace efficiency between the control time bucket and treatment time bucket metrics corresponding to the decisions made by the algorithm during the two periods.”
Takeway: this is useful if your variants affect each other–for example one variant siphons available resources from a second variant.
Example referenced: Multi-Armed Bandits and the Stitch Fix Experimentation Platform
“a multi-armed bandit learns to divert traffic away from poorly-performing arms and towards the better-performing ones.”
“The multi-armed bandit model is a simplified version of reinforcement learning, in which there is an agent interacting with an environment by choosing from a finite set of actions and collecting a non-deterministic reward depending on the action taken. The goal of the agent is to maximize the total collected reward over time.“
“The agent must make decisions based on incomplete information, resulting in a dilemma known as the explore vs. exploit problem. Early in the process, reward estimates are highly uncertain, so the agent is compelled to gather evidence about all the available actions, in order to be more certain about which action is likely to produce the highest average reward.”
“can provide more efficient optimization than standard A/B tests in some circumstances, and a more flexible method for long-term optimization”
Takeaway: this protocol is useful if you have many variants, are okay with letting the test run for a longer time period (perhaps indefinitely), and wish to minimize “regret” (opportunity cost) from over-allocating to under-performing variants.
Example referenced: Data Science for Startups
“…it may not be possible to control which users are part of a treatment group for an experiment…two different approaches for drawing conclusions when you do not have direct control of assigning experiment groups”
“Staged rollouts enable developers to release a new build to a subset of the user base. Rather than implementing A/B logic within the application itself, by writing ‘if’ blocks that maintain separate treatment and control code paths, developers can build two separate versions and deploy them simultaneously. This feature is useful when making major changes to an application, such as redesigning the UI. A drawback of staged rollouts is that we no longer control the A/B splits for experiments. Common use cases for staged rollouts include:
- Testing the stability of a new release.
- Measuring the usage of new features.
- Measuring the impact of a new release on metrics.”
Takeaway: this protocol is appropriate/useful if you do not have granular control over applying your variants’ treatment. For example if your experience is downloadable from an app store where you are beholden to their limited or non-existent experimentation tooling.
It’s always interesting to learn about these variations on the classic A/B testing experimentation protocol. As more software-enabled businesses crop up, there will be more distinctive experimentation protocols being applied–knowing about these innovations may help you when you encounter such distinctive situations.