Summit for Open Feature at KubeconNA 2024.

How to rollout an update for a CNI without breaking the World Wide Web

Project Calico Jen Luther Thomas, product marketing, Tigera Reza Ramezanpour, dev advocate, Tigera They do commercial aspects of Calico https://www.tigera.io/project-calico/ .. which is contianer network security tooling

CNI w/ a pluggable dataplane. You can change the networking engine w/ flags

  • eBPF
  • normal linux (iptables)
  • windows
  • VPP (Vector Packet Processing) similar to DPDK
  • recent support to nftables

They do feature flags bsaed on envvars

Discussion of how to use feature flags.

Don’t use feature flags for:

  1. don’t re-use flags (b/c if you re-use users may get a surprise)
  2. compliementary configs (@@ I don’t understand this)
  3. note this adds code complexity

Reddit Pi Day incident

  • Trying to upgrade k8s v1.23 -> 1.24;
  • k8s dropped a ‘master’ label from the primary controller
  • calico expected this to be there

Test Smarter, Not Harder: QA Enhancements with OpenFeature

Meha Bhalodiya, QA @ RedHat, part of the k8s release team

Challenges in traditional QA

  • Balancing speed & quality: How do we ensure software is bug-free without slowing down delivery?
  • Complex & costly test environments: creating/managing/matintaining different environments for testing -> resource-intensive
  • Risk of releasing new features: uncertainty and potential risk to existing functionality

Understanding openfeature

  • separate deploy from release

Setting up flagging strategies

  • need a strategy for managing the lifecycle of a flag
  • naming conventions / flag governance

Implementing flags w/ openfeature (js code examples)

Migrating to Open Feature at Scale (mine/Chetan Kapoor’s)

TLDR: We ran a program that resulted in a flagging solution internally build on OpenFeature. Now it does trillions of calls a day.

Experimentation Programs at Scale: Lessons learned at top companies

Graham @ growthbook.io

“Most popular open source feature flagging and a/b testing platform” ycombinator 2022

Q: How many tests do you run? Self-decribed “we do this well”:

  • financial institution: “3-5 tests a year” and were very proud
  • Large social network: 50-60,000 at any given time

Why companies A/B test

GoodUI.org - examples of A/B tests that did well in the world

For a non-optimized product, an AB test does 33% success; 33% win, 33% lose, 33% no difference.

Numbers drop a lot as the site gets optimized.

“No one launches a feature they don’t think will win” … yet features are only successful about 33% of the time or less.

If you’re not testing, you’re getting it wrong 1/3rd of the time and don’t know to back out the change.

“Without testing, you’re guessing”

Instead of won/loss, they use won/save or won/learned

  • “Did the experiment help us make the right decision”

Why run at scale?

Wymyn’s law (??) Low probability of success in general And each change has a low propability as well So to adjust, we run tons of experiments (so we have more changes to find something that work)

How to run at scale?

Experimentation can go wrong in many ways

  • bias: confirmation, selection
  • assignment issues (how we bucket users)
  • stats issues (SRM, multiple exposrures, p-value corrections, variance reduction techniques (CUPED), priors, Winsorization
  • Metrics: ratio metrics (delta method), quantiles
  • decision risks: making product decisions when the platofrm has a bug (e.g. decisions made based on a bug in the experimentation platform)
  • data engineering stuff

So they try to bring incremental costs as close to $0 as much as possible. Flags are helpful for A/B tests b/c you’re already bucketing users

Lifecycle:

  • crawl: basic analytics
  • walk: manual a/b tests; optimizing some parts
  • run: common experimentation; important features tested; may have growth team
  • fly: ubiquitous experimentation: a/b testing is the default for every feature; tests can be run and read by anyone

Netflix is trying to build a platform that will 1000x the number of experiments they run

For many processes, “done” means “shipped”

  • he recommends that DoD defins success based on A/B tests

Experimentation program structures:

  • isolated team: better than not testing.. but isolated and low frequency of tests
  • centralized experimentation team (microsoft): oversees all experiments; Easy to get best practices w/r/t stats, but can be a bottleneck
  • decentralized
  • center of excellence: central team to advise teams, but the teams have control; good for transferable skills and training, but requires dilliegence and patience from data scientists

Top lessons learned?

  1. Without experimentation, they’re guessing
  2. Have a high experiment frequency to iterate quickly and ensure impact
  3. running experiments at scale is hard
  4. Try to bring cost per experiment close to $0
  5. Choose a program structure that works for the group

The hidden cost of feature flags: Understanding and managing adoption challenges

Shreya; Student/Product designer

complexity in version management:

  • multiple state for each app version (e.g. can’t version w/ semver easily)
  • backwards compatibility issues

Impact on user experience & support

  • fragmented user experience (e.g. What flags do they have enabled?)
  • increased support complexity
  • Potential for feature fatigue for users

Tech debt

  • unused or obsolete flags increase code complexity
  • Extra maintenance burden for dev teams

Testing complexity

  • exponential growth in test cases due to multiple flags
  • increased QA burden & risk of untested scenarios

Ways to address:

  • Version complexity
    • Treat feature flags as code changes
    • Document flag lifecycle (intro, rollout, retirement)
    • Regularly clean up stale flags
  • Governance
    • Define criteria for flag usage
    • limit flags for major, user-facing changes
    • Schedule regular audits and expiration dates
  • Perf
    • cache frequently used flags
    • used automated testing for critical flag states
    • setup staging environments to validate flag behavior

Would a gradual rollout by any other hashing algorithm still smell as sweet?

chris @ dev advocate at gitkraken

Build Vexilla on his live stream

Gradual release:

  • different than blue/green tests, but done at runtime not build time.

wrote his own SDKs (for streaming content) & algo

Original: (hashvalue * seed) mod 100

algo: graham (aka from minecraft) algo: FNV-1a

  • growthbook uses this

algo: DJB2 algo: everything in node’s stdlib

wanted to understand speed and distribution

many algos use magic numbers

If you use numbers instead of strings, the distribution changes a lot.

hiccup in js:

  • 64bit number >> 32 bit number to do bitwise algo >> back to 64 bit