Open Feature Summit

Summit for Open Feature at KubeconNA 2024.

How to rollout an update for a CNI without breaking the World Wide Web

Project Calico Jen Luther Thomas, product marketing, Tigera Reza Ramezanpour, dev advocate, Tigera They do commercial aspects of Calico https://www.tigera.io/project-calico/ .. which is contianer network security tooling

CNI w/ a pluggable dataplane. You can change the networking engine w/ flags

eBPF
normal linux (iptables)
windows
VPP (Vector Packet Processing) similar to DPDK
recent support to nftables

They do feature flags bsaed on envvars

Discussion of how to use feature flags.

Don’t use feature flags for:

don’t re-use flags (b/c if you re-use users may get a surprise)
compliementary configs (@@ I don’t understand this)
note this adds code complexity

Reddit Pi Day incident

Trying to upgrade k8s v1.23 -> 1.24;
k8s dropped a ‘master’ label from the primary controller
calico expected this to be there

Test Smarter, Not Harder: QA Enhancements with OpenFeature

Meha Bhalodiya, QA @ RedHat, part of the k8s release team

Challenges in traditional QA

Balancing speed & quality: How do we ensure software is bug-free without slowing down delivery?
Complex & costly test environments: creating/managing/matintaining different environments for testing -> resource-intensive
Risk of releasing new features: uncertainty and potential risk to existing functionality

Understanding openfeature

separate deploy from release

Setting up flagging strategies

need a strategy for managing the lifecycle of a flag
naming conventions / flag governance

Implementing flags w/ openfeature (js code examples)

Migrating to Open Feature at Scale (mine/Chetan Kapoor’s)

TLDR: We ran a program that resulted in a flagging solution internally build on OpenFeature. Now it does trillions of calls a day.

Experimentation Programs at Scale: Lessons learned at top companies

Graham @ growthbook.io

“Most popular open source feature flagging and a/b testing platform” ycombinator 2022

Q: How many tests do you run? Self-decribed “we do this well”:

financial institution: “3-5 tests a year” and were very proud
Large social network: 50-60,000 at any given time

Why companies A/B test

GoodUI.org - examples of A/B tests that did well in the world

For a non-optimized product, an AB test does 33% success; 33% win, 33% lose, 33% no difference.

Numbers drop a lot as the site gets optimized.

“No one launches a feature they don’t think will win” … yet features are only successful about 33% of the time or less.

If you’re not testing, you’re getting it wrong 1/3rd of the time and don’t know to back out the change.

“Without testing, you’re guessing”

Instead of won/loss, they use won/save or won/learned

“Did the experiment help us make the right decision”

Why run at scale?

Wymyn’s law (??) Low probability of success in general And each change has a low propability as well So to adjust, we run tons of experiments (so we have more changes to find something that work)

How to run at scale?

Experimentation can go wrong in many ways

bias: confirmation, selection
assignment issues (how we bucket users)
stats issues (SRM, multiple exposrures, p-value corrections, variance reduction techniques (CUPED), priors, Winsorization
Metrics: ratio metrics (delta method), quantiles
decision risks: making product decisions when the platofrm has a bug (e.g. decisions made based on a bug in the experimentation platform)
data engineering stuff

So they try to bring incremental costs as close to $0 as much as possible. Flags are helpful for A/B tests b/c you’re already bucketing users

Lifecycle:

crawl: basic analytics
walk: manual a/b tests; optimizing some parts
run: common experimentation; important features tested; may have growth team
fly: ubiquitous experimentation: a/b testing is the default for every feature; tests can be run and read by anyone

Netflix is trying to build a platform that will 1000x the number of experiments they run

For many processes, “done” means “shipped”

he recommends that DoD defins success based on A/B tests

Experimentation program structures:

isolated team: better than not testing.. but isolated and low frequency of tests
centralized experimentation team (microsoft): oversees all experiments; Easy to get best practices w/r/t stats, but can be a bottleneck
decentralized
center of excellence: central team to advise teams, but the teams have control; good for transferable skills and training, but requires dilliegence and patience from data scientists

Top lessons learned?

Without experimentation, they’re guessing
Have a high experiment frequency to iterate quickly and ensure impact
running experiments at scale is hard
Try to bring cost per experiment close to $0
Choose a program structure that works for the group

The hidden cost of feature flags: Understanding and managing adoption challenges

Shreya; Student/Product designer

complexity in version management:

multiple state for each app version (e.g. can’t version w/ semver easily)
backwards compatibility issues

Impact on user experience & support

fragmented user experience (e.g. What flags do they have enabled?)
increased support complexity
Potential for feature fatigue for users

Tech debt

unused or obsolete flags increase code complexity
Extra maintenance burden for dev teams

Testing complexity

exponential growth in test cases due to multiple flags
increased QA burden & risk of untested scenarios

Ways to address:

Version complexity
- Treat feature flags as code changes
- Document flag lifecycle (intro, rollout, retirement)
- Regularly clean up stale flags
Governance
- Define criteria for flag usage
- limit flags for major, user-facing changes
- Schedule regular audits and expiration dates
Perf
- cache frequently used flags
- used automated testing for critical flag states
- setup staging environments to validate flag behavior

Would a gradual rollout by any other hashing algorithm still smell as sweet?

chris @ dev advocate at gitkraken

Build Vexilla on his live stream

Gradual release:

different than blue/green tests, but done at runtime not build time.

wrote his own SDKs (for streaming content) & algo

Original: (hashvalue * seed) mod 100

algo: graham (aka from minecraft) algo: FNV-1a

growthbook uses this

algo: DJB2 algo: everything in node’s stdlib

wanted to understand speed and distribution

many algos use magic numbers

If you use numbers instead of strings, the distribution changes a lot.

hiccup in js:

64bit number >> 32 bit number to do bitwise algo >> back to 64 bit

The notes of Justin Abrahms

Recently updated

tests for quartz

Zero Knowledge Proofs (ZKP)

Sprint Ceremony input/outputs

Explorer