A book by Ron Kohavi, Diane Tang, Ya Xu, which focuses on how to do online experiments (e.g. a/b testing and related statistics).
Definitions:
Overall Evaluation Criteron (OEC) : a quantitative measure of an experiment’s objectives (e.g. “active days per user”). Known as a “response” or “dependent variable” in statistics, outcome, evaluation or fitness function elsewhere.
Parameter : A controllable experimental value thought to influence the OEC. (e.g. “font color”) aka factors, variables.
Levels : a value of a parameter aka “value” (e.g. “helvetica” or “arial”)
Univariable test : An “a/b” style test across a single parameter with multiple levels.
Multivariate tests (MVT) : An a/b style test where we consider both multiple parameters and multiple levels.
Variant : A user experience being tested by assigning values to parameters.
Control : A special variant which has the existing functionality for baseline testing.
Treatment : another word for Variant, but often is the word opposite “control”
randomization unit : Hashing process is applied to units (e.g. users or pages) to map to variants.
expected value of information (EVI) : captures how additional information can help you in decision making
internal validity : the correctness of the experimental results without attempting to generalize to other populations or time periods. See Threats to internal validity in experiments
external validity : the extent to which the results of a controlled experiment can be generalized along axes such as different populations and over time. See Threats to external validity
They reference the Bing search-relevance team’s target to improve an OEC metric by 2% per year. This is the result of many experiments and treatment effects. To determine how much a given experiment/feature contributed to their 2% goal, they run a “certification experiment”, which is running the experiment again with a single treatment after the iteration on the feature is complete.
They mention that it’s beneficial to use experiments to drive site iteration, finding that full redesigns fail to not only achieve their goals, but often fail to achieve parity w/ the old site on key metrics (pg 21).
“Strategic Integrity” is the marrying of strategy to the OEC, in something that seems quite similar to OKRs.
Strategic integrity is not about crafting briliant strategy or about having the perfect organization: it’s about getting the right strategies done by an organization that is aligned and knows how to get them done. It is about matching top-down-directed perspectives with bottom-up tasks. (Sinofsky and Iansiti 2009)
Tenets of online experimentation
The organization wants to make data-driven decisions and has formalized an OEC
Many organizations will not spend the resources required to define and measure progress. It is often easier to generate a plan, execute against it, and declare success, with the key metric being: “percent of plan delivered”, ignoring whether the feature has any positive impact to key metrics.
They cite customer lifetime value (LTV) to be a strategically powerful OEC.
The organization is willing to invest in the infrastructure and tests to run controlled experiments and ensure that the results are trustworthy.
When running online experiments, getting numbers is easy; getting numbers you can trust is hard.
The organization recognizes that it is poor at assessing the value of ideas.
Only one third of the ideas tested at Microsoft improved the metric(s) they were designed to improve (Kohavi, Crook and Longbotham 2009). Success is even harder to find in well-optimized domains like Bing and Google, whereby some measures’ success rate is about 10-20% (Manzi 2012).
If you are on an experiment-driven team, get used to, at best, 70% of your work being thrown away. Build your processes accordingly” (Mosavat 2019)
Quotes I liked
If we have data, let’s look at data. If all we have are opinions, let’s go with mine
- Jim Barksdale, Former CEO of Netscape
Confidence intervals can overlap as much as 29% and yet the delta will be statistically significant. The opposite, however, is true: if the 95% confidence intervals do not overlap, then the treament effect is statistically significnat with p-value < 0.05.
Open questions
- How does one conduct “power analysis” to determine the size needed to detect a 1% change in revenue per user?
- I don’t fully understand page 41-42 where they say “When running an online controlled experiment, you could continuously monitor the p-values. In fact, early versions of the commercial product Optimizely encouraged this (Johari et al 2017). Such multiple hypothesis testing results in significant bias (by 5-10x) in declaring results to be statistically significant.” They offer two alternatives sequential test with always valid p-values and predetermined experiment duration (e.g. a week) for determine stat sig. My question: What’s wrong with observing what the p-value is at?
- Look up what “False Discovery Rate” is and how it relates to dealing with multiple tests.
Follow up reading
- A Dirty Dozen: Twelve P-Value misconceptions (Goodman 2008)