Towards Zero Change Incidents: Intuit’s Strategy for Implementing AI-Driven Progressive Delivery
Progressive delivery:
- gradual release of a neew version
- quick rollbacks
- canary, blue-green, feature flags
- e.g. argo rollouts
They found 1/3rd of p0/p1 issues were caused by changes
- new features
- bug fixes
- dependency updates
Options:
- set a hard threshold for every metric
- e.g. +4% error rate; +400ms of latency
- every service is unique (different metrics/thresholds)
- metrics may be seasonal (daily/weekly)
- It’s not just one metric.. it’s really a bunch.
Multi-variate anomaly detection
multi-variate = a bunch of different statistics/metrics
ML requirements:
- completely unsupervised
- able to handle multiple features (b/c multi-variate)
- quick to train
- needs to require < 8 days of data
- generate an anomaly score that’s decipherable.
- Auto-model lifecycle management (retrain itself as needed)
Eng requirements
- stream data processing system
- support custom sources / sinks for metrics
- sliding window aggregation support
- lightweight pipeline (deployed widely across clusters)
- needs to hook into pipelines
Arch:
- service outputs metrics to prometheus
- canary version also outputs a canary set of metrics
- ML pipeline (numaproj) reads from prom store
- ML pipeline outputs an “anomaly store”
Input data processing:
- assume sliding window = 3
- 2 multivariate metrics to process
prom outputs error rate & latency; Data passes through the “sliding window reducer” which outputs a matrix of those metrics, by time; then..
- preprocess (normalize & smoothen spikes); get data in the shape for the requirements of the model itself
- neural network interface (raw output from the model; same model does predictions for stable & canary payloads)
- If the model exists already, use it.
- Otherwise, pull data from the prom store and save the resulting model
- postprocess (threshold the data for classification; normalize the score; score for individual metrics; single score for each window)
For the model, they needed it to be robust to anomalies (e.g. the whole point of the thing) and train w/o GPUs. They also needed to be able to weight the underlying metrics. tech stuff: CNN/ RNN & autoencoder networks
— had to stop listening. Couldn’t understand the speaker’s accent.
Effective Data Platforming with Open Source Tools For Faster Insights
Architectural aims shared between software & data enginering
- modularity
- separation of concerns (ingestion/processing/storage)
- loose coupling
- scaling
- low bar for entry and maintenance
- low cost
Common requirements from stakeholders:
- ability to handle and join/enrich data from varied sources (documents/warehouses/uncollected tool data)
- transmission of variexd data sources to points of usage (reporting/third party tools)
- quick ingestion, processing & availability of lots of data sources
Stages of development of such a data system:
- Understand the data / pre-processing
- gather the tools
- identify platform build tools
- assemble
- ingestion
- streaming / stream processing
- storage / dissemination
- monitoring / security
- other features
- data catalog (for. non-technical users)
—
Gather the tools:
Schema is very important
- confluence schema registry
- redhat artifact registry (redhat claims this is drop-in)
- confluent avro converter (schema detection)
Open Source & Cloud Native
- k8s on the cloud
- kafka
- camel connectors (connecting to various data sources)
- KSqlDB, Flink (stream processing; ksqldb is good for local testing; flink is suggested for prod)
- Strimzi (kafka on k8s)
- CloudEvents
Considerations:
- inflight processing, optimized transports of large data feeds w/ built-in scalability (kafka)
- ability to handle streaming & data at rest (kafka connect)
- observability (data catalog; platform monitoring w/ grafana / prometheus)
Build & Deploy
- helm (crds for everything; including Schema Registry!)
- terraform
- pipeline tools
Ingestion:
- push events into kafka via https
- kafka connect for pull-based streams
- apache camel or confluent hub are places for connectors
- … missed this.
Stream processing: input: kafka topic data output: another kafka topic amoritized (stream processing) vs all at once (batch)
- optimal resource usage
- quicker insights on incoming data
- can use lookup tables via ksqldb/flink for referential or master data
- Streaming data + lookup tables = enrichment
Storage:
- just like source connectors, there are sink connectors to dump data out.
- Support for outputting to a data warehoure / bigquery or some other system’s APIs
Monitoring / security
- secure things w/ rbac
- can use time-based revocations for adhoc access
Learnings
- kconnect is awesome and has great integrations
- resolve schemas automatically w/ avro converter & a schema registry (don’t write them by hand)
- data catalog is good for easy lookup for stakeholders who are hands off
- optimized transformations w/ stream processing
Q: Experience w/ managed kafka vs hosting it yourself?
Creating paved paths for platform engineers
Lots of solution architects from: Syntasso, Upbound, AWS, RedHat, Nirmata
How do you make sense of the CNCF landscape? It’s giant.
- you don’t, more-or-less;
- there are folks (TAGs) who are trying to make recommendations
What do you spend time on in your IDP? give people latitude build the platform piece when enough people have the problem use vendors for undifferentiated lifting “If you buy, you are going to adapt to this; they’re not going to adapt to them” “be careful how much pride you have in building..” (there are other concerns your business probably has)
“As a platform engineer, you’re not buliding a platform; You’re building a framework for someone who knows things to bring the solution to all the users; Enable them”
DX vs security?
Developers are important for the success of the business Platforms should focus on their actual needs, not the look and feel;
.. Went on to talk about how we should treat IDPs as actual products;
Are they your customers? I mean.. they’re definitely users.. but your customers might be the business;
How do you know that the thing you did was good? If users have no choice.. you don’t know. There must be choice so you know you’ve done something useful;
Otel w/ java & python
Great overview of how otel works and practical hands-on of it working.
Achieving and Maintaining a Healthy CI with Zero Test Flakes
About the testing strategy used for the k8s project.
Good tools:
- https://prow.k8s.io/ historical build/test logs
- https://testgrid.k8s.io/ (historical test runs broken down by suite and test)
- https://go.k8s.io/triage - aggregated failure information
Micro-Segmentation and Multi-Tenancy: The Brown M&Ms of Platform Engineering
“brown M&Ms” in the van halen sense.
Challenges
- more services/tools/langauges
- different teams operate differently
- microservices can make for sprawl
- networks can be complex/highly latent
- security is more important
- need info for auditing purposes
Standardizing:
- more maintainable b/c less unique config
- easier to onboard
- easier to roll out common fixes
- more predictable spend
k8s multi-tenancy:
- cluster-as-a-service: single tenant per cluster (can do different versions of CRDs)
- namespace-as-a-service
- simple way to isolate dev env ironments
- can have centralized governance & standardization
- avoids cluster sprawl (which the community has seen being bad)
- control plane as a service (shared node resources, but unique control planes)
Microsegmentation: Divide network into segments 9tiers / zones) e.g. frontend tier vs backend tier
- enforce security checks to prevent unauthorized movement
Cilium: CNI + observability Kyverno: policy as code
- works based on labels
Workspace segmentation:
- FE vs BE tier
- Across namespaces
Demo:
- Tries to create a namespace.. but it’s rejected b/c it lacks the right labels.
- by default, kyverno disallows all traffic; adding labels allows different traffic in or out
Neat network debug tool: https://github.com/nicolaka/netshoot
Q: Seems quite magical. How much underlying code was in those policies?
- not much code
For business workloads, she still likes namespace as a service; For infrastructure workloads, she thinks she’d use control plane as a service
This Platform Goes to 11: Boost Developer Productivity with Lessons from Salesforce
Joe Kutner, DX Architect @ Salesforce; co-founder of buildpacks.io
8k users 400-500 people in platform org (50 folks specifically on DX)
We made a platform. They hated us. We worked on it for 2 years. Now they still hate us, but they’re more productive.
To onboard a new service, they had to onboard onto many internal platforms. They needed a tool which integrates the underlying platforms
“Keeping it simple for developers often means making it complicated for yourself”
Values:
- unified interfaces
- extension points
- meet devs where they are
- reveal complexity gradually
- ephemeralization
Underlying platforms:
- CI
- CD
- terraform
- k8s
- network
- observability
Principals:
Unified interfaces:
- gitops
- cli
- web ui
- notification API
API second design (aka experience first design)
Extension points: “If it doesn’t have extension points.. it’s not really a platform”
- cli plugins
- buildpacks
- take source code and shove it into a docker image
- auto-adds helm charts, terraform stuff, etc.
- Outputs a container image which does the relevant provisioning stuff that they need to recreate the service (salesforce eng blog: “Hyperpacks”
- notifications API
- add-ons (e.g. Heroku)
- module which encompasses all concerns within a backing resource (APM, Database, etc)
- Encapsulates all the developer stuff (monitoring/scaling/decomm/etc)
- similar to the KRO blog post from AWS
Meet devs where they are (use industry standard tools like kubectl;helm;terraform;argo;buildpacks)
“The places we went wrong are almost always where we went bespoke”
Reveal complexity gradually;
- The wrong abstractions create “cliffs” in the experience
- blog post: Jean Yang: The case for ‘developer experience’ .. argues against strong abstractions and silver bullets
- example: they use go cdk
- generally abstracted
- but offer specific logic segmentation via directory structure
ephemeralization
- “do moreand more with less and less until you can do everything with nothing” ~ Buckminster Fuller
- ex: reel to real > vhs > dvds > streaming
Offered falcon init
to setup their existing repos with the stuff they needed
git push falcon main
did post-receive hooks to setup helm/terraform/etc platform stuff
- CD begins when those resources are made
- there’s a webportal (IDP) to view the results
- “using the git sha kind of like a trace-id across the platform”
falcon addons init <resource>
- e.g. rds, s3, dynamo, object-storage (cloud agnostic)
- prompts w/ questions
results:
- 85% faster onboarding
- 6x faster inner loop cycle time
- 55% reduction to release failures
Know ^^ b/c they measure things. They measure it via SPACE framework for developer productivity (forsgren) They have a dashboard for measuring ^^ to see how thedevs or teams are doing
SPACE:
- Satisfaction: individual “do you feel productive?” / team: “sprint retro success”
- Productivity:
- indiv: inner (start -> pr) & outer loop cycle time (outer: commit -> prod);
- team: p80 build time, p80 deploy time
- Activity:
- invid: # of PRs; # of bug fixes (not used to stack rank devs)
- team: deploy frequency
- Collab:
- indiv: # of reviews, PR pickup time
- team: distribution of PR reviews
- Efficiency
- indiv: focus time, wait time for reviews
- team: deploy fail rate, sprint forecast ratio
inner loop investment:
- “difficult to have code running locally that worked”
- invested in containerizing stuff
- telepresence https://www.telepresence.io/
Migratory Patterns: Making Architectural Transitions with Confidence and Grace
Pete Hodgson from Open Feature Previously worked at a bank where they deployed literally everything (fortran, atm code, mobile API) once a quarter
Example:
genAI service -> vector store <- data ingestion
For PoC, they used “pinecone” for the db. Then they wanted to move to “pgvetor”
Simplest: Big bang!
- stop the world
- backfill the data
- cut over
- start the world
Downsides:
- downtime
- very stressful during the downtime
- cut-over requires testing in real time (eek!)
- No plan B if it goes wrong once you start writing into the new system
Expand/Contract (no downtime!!)
- dual write
- backfill
- cutover reader
- single write (you can cut back whenever you want)
Downside:
- potential instability during dual write
“Expand/contract enables confident migrations” safety -> courage -> speed
Code examples.
- Reader uses feature flags to decide which to read from
- Could use feature flag as a slider for progressive rollout
- Shows example of telemetry broken out by feature flags
Example 2: extract a microservice
- internal module vs service
- put a shim in front so consumers use the shim
- Put the feature flag in the shim.
Parallel run pattern:
- call both services, compare responses
- discard the result for the new service
Dark launch: originally from FB chat launch;
- “simulate the backend usage from javascript without any UI elements”
^^ An example of how to do distributed load tests; make your users do it.
Q: instabilty during dual write?
- “it’s good to know now”