Summit for Open Feature at KubeconNA 2024.
How to rollout an update for a CNI without breaking the World Wide Web
Project Calico Jen Luther Thomas, product marketing, Tigera Reza Ramezanpour, dev advocate, Tigera They do commercial aspects of Calico https://www.tigera.io/project-calico/ .. which is contianer network security tooling
CNI w/ a pluggable dataplane. You can change the networking engine w/ flags
- eBPF
- normal linux (iptables)
- windows
- VPP (Vector Packet Processing) similar to DPDK
- recent support to nftables
They do feature flags bsaed on envvars
Discussion of how to use feature flags.
Don’t use feature flags for:
- don’t re-use flags (b/c if you re-use users may get a surprise)
- compliementary configs (@@ I don’t understand this)
- note this adds code complexity
Reddit Pi Day incident
- Trying to upgrade k8s v1.23 -> 1.24;
- k8s dropped a ‘master’ label from the primary controller
- calico expected this to be there
Test Smarter, Not Harder: QA Enhancements with OpenFeature
Meha Bhalodiya, QA @ RedHat, part of the k8s release team
Challenges in traditional QA
- Balancing speed & quality: How do we ensure software is bug-free without slowing down delivery?
- Complex & costly test environments: creating/managing/matintaining different environments for testing -> resource-intensive
- Risk of releasing new features: uncertainty and potential risk to existing functionality
Understanding openfeature
- separate deploy from release
Setting up flagging strategies
- need a strategy for managing the lifecycle of a flag
- naming conventions / flag governance
Implementing flags w/ openfeature (js code examples)
Migrating to Open Feature at Scale (mine/Chetan Kapoor’s)
TLDR: We ran a program that resulted in a flagging solution internally build on OpenFeature. Now it does trillions of calls a day.
Experimentation Programs at Scale: Lessons learned at top companies
Graham @ growthbook.io
“Most popular open source feature flagging and a/b testing platform” ycombinator 2022
Q: How many tests do you run? Self-decribed “we do this well”:
- financial institution: “3-5 tests a year” and were very proud
- Large social network: 50-60,000 at any given time
Why companies A/B test
GoodUI.org - examples of A/B tests that did well in the world
For a non-optimized product, an AB test does 33% success; 33% win, 33% lose, 33% no difference.
Numbers drop a lot as the site gets optimized.
“No one launches a feature they don’t think will win” … yet features are only successful about 33% of the time or less.
If you’re not testing, you’re getting it wrong 1/3rd of the time and don’t know to back out the change.
“Without testing, you’re guessing”
Instead of won/loss, they use won/save or won/learned
- “Did the experiment help us make the right decision”
Why run at scale?
Wymyn’s law (??) Low probability of success in general And each change has a low propability as well So to adjust, we run tons of experiments (so we have more changes to find something that work)
How to run at scale?
Experimentation can go wrong in many ways
- bias: confirmation, selection
- assignment issues (how we bucket users)
- stats issues (SRM, multiple exposrures, p-value corrections, variance reduction techniques (CUPED), priors, Winsorization
- Metrics: ratio metrics (delta method), quantiles
- decision risks: making product decisions when the platofrm has a bug (e.g. decisions made based on a bug in the experimentation platform)
- data engineering stuff
So they try to bring incremental costs as close to $0 as much as possible. Flags are helpful for A/B tests b/c you’re already bucketing users
Lifecycle:
- crawl: basic analytics
- walk: manual a/b tests; optimizing some parts
- run: common experimentation; important features tested; may have growth team
- fly: ubiquitous experimentation: a/b testing is the default for every feature; tests can be run and read by anyone
Netflix is trying to build a platform that will 1000x the number of experiments they run
For many processes, “done” means “shipped”
- he recommends that DoD defins success based on A/B tests
Experimentation program structures:
- isolated team: better than not testing.. but isolated and low frequency of tests
- centralized experimentation team (microsoft): oversees all experiments; Easy to get best practices w/r/t stats, but can be a bottleneck
- decentralized
- center of excellence: central team to advise teams, but the teams have control; good for transferable skills and training, but requires dilliegence and patience from data scientists
Top lessons learned?
- Without experimentation, they’re guessing
- Have a high experiment frequency to iterate quickly and ensure impact
- running experiments at scale is hard
- Try to bring cost per experiment close to $0
- Choose a program structure that works for the group
The hidden cost of feature flags: Understanding and managing adoption challenges
Shreya; Student/Product designer
complexity in version management:
- multiple state for each app version (e.g. can’t version w/ semver easily)
- backwards compatibility issues
Impact on user experience & support
- fragmented user experience (e.g. What flags do they have enabled?)
- increased support complexity
- Potential for feature fatigue for users
Tech debt
- unused or obsolete flags increase code complexity
- Extra maintenance burden for dev teams
Testing complexity
- exponential growth in test cases due to multiple flags
- increased QA burden & risk of untested scenarios
Ways to address:
- Version complexity
- Treat feature flags as code changes
- Document flag lifecycle (intro, rollout, retirement)
- Regularly clean up stale flags
- Governance
- Define criteria for flag usage
- limit flags for major, user-facing changes
- Schedule regular audits and expiration dates
- Perf
- cache frequently used flags
- used automated testing for critical flag states
- setup staging environments to validate flag behavior
Would a gradual rollout by any other hashing algorithm still smell as sweet?
chris @ dev advocate at gitkraken
Build Vexilla on his live stream
Gradual release:
- different than blue/green tests, but done at runtime not build time.
wrote his own SDKs (for streaming content) & algo
Original: (hashvalue * seed) mod 100
algo: graham (aka from minecraft) algo: FNV-1a
- growthbook uses this
algo: DJB2 algo: everything in node’s stdlib
wanted to understand speed and distribution
many algos use magic numbers
If you use numbers instead of strings, the distribution changes a lot.
hiccup in js:
- 64bit number >> 32 bit number to do bitwise algo >> back to 64 bit