KubeconNA 2024

Towards Zero Change Incidents: Intuit’s Strategy for Implementing AI-Driven Progressive Delivery

youtube

Progressive delivery:

gradual release of a neew version
quick rollbacks
canary, blue-green, feature flags
e.g. argo rollouts

They found 1/3rd of p0/p1 issues were caused by changes

new features
bug fixes
dependency updates

Options:

set a hard threshold for every metric
- e.g. +4% error rate; +400ms of latency
- every service is unique (different metrics/thresholds)
- metrics may be seasonal (daily/weekly)
- It’s not just one metric.. it’s really a bunch.

Multi-variate anomaly detection

multi-variate = a bunch of different statistics/metrics

ML requirements:

completely unsupervised
able to handle multiple features (b/c multi-variate)
quick to train
needs to require < 8 days of data
generate an anomaly score that’s decipherable.
Auto-model lifecycle management (retrain itself as needed)

Eng requirements

stream data processing system
support custom sources / sinks for metrics
sliding window aggregation support
lightweight pipeline (deployed widely across clusters)
needs to hook into pipelines

Arch:

service outputs metrics to prometheus
canary version also outputs a canary set of metrics
ML pipeline (numaproj) reads from prom store
ML pipeline outputs an “anomaly store”

Input data processing:

assume sliding window = 3
2 multivariate metrics to process

prom outputs error rate & latency; Data passes through the “sliding window reducer” which outputs a matrix of those metrics, by time; then..

preprocess (normalize & smoothen spikes); get data in the shape for the requirements of the model itself
neural network interface (raw output from the model; same model does predictions for stable & canary payloads)
- If the model exists already, use it.
- Otherwise, pull data from the prom store and save the resulting model
- postprocess (threshold the data for classification; normalize the score; score for individual metrics; single score for each window)

For the model, they needed it to be robust to anomalies (e.g. the whole point of the thing) and train w/o GPUs. They also needed to be able to weight the underlying metrics. tech stuff: CNN/ RNN & autoencoder networks

— had to stop listening. Couldn’t understand the speaker’s accent.

Effective Data Platforming with Open Source Tools For Faster Insights

youtube

Architectural aims shared between software & data enginering

modularity
separation of concerns (ingestion/processing/storage)
loose coupling
scaling
low bar for entry and maintenance
low cost

Common requirements from stakeholders:

ability to handle and join/enrich data from varied sources (documents/warehouses/uncollected tool data)
transmission of variexd data sources to points of usage (reporting/third party tools)
quick ingestion, processing & availability of lots of data sources

Stages of development of such a data system:

Understand the data / pre-processing
gather the tools
identify platform build tools
assemble
1. ingestion
2. streaming / stream processing
3. storage / dissemination
4. monitoring / security
other features
1. data catalog (for. non-technical users)

—

Gather the tools:

Schema is very important

confluence schema registry
redhat artifact registry (redhat claims this is drop-in)
confluent avro converter (schema detection)

Open Source & Cloud Native

k8s on the cloud
kafka
camel connectors (connecting to various data sources)
KSqlDB, Flink (stream processing; ksqldb is good for local testing; flink is suggested for prod)
Strimzi (kafka on k8s)
CloudEvents

Considerations:

inflight processing, optimized transports of large data feeds w/ built-in scalability (kafka)
ability to handle streaming & data at rest (kafka connect)
observability (data catalog; platform monitoring w/ grafana / prometheus)

Build & Deploy

helm (crds for everything; including Schema Registry!)
terraform
pipeline tools

Ingestion:

push events into kafka via https
kafka connect for pull-based streams
- apache camel or confluent hub are places for connectors
… missed this.

Stream processing: input: kafka topic data output: another kafka topic amoritized (stream processing) vs all at once (batch)

optimal resource usage
quicker insights on incoming data
can use lookup tables via ksqldb/flink for referential or master data
Streaming data + lookup tables = enrichment

Storage:

just like source connectors, there are sink connectors to dump data out.
Support for outputting to a data warehoure / bigquery or some other system’s APIs

Monitoring / security

secure things w/ rbac
- can use time-based revocations for adhoc access

Learnings

kconnect is awesome and has great integrations
resolve schemas automatically w/ avro converter & a schema registry (don’t write them by hand)
data catalog is good for easy lookup for stakeholders who are hands off
optimized transformations w/ stream processing

Q: Experience w/ managed kafka vs hosting it yourself?

Creating paved paths for platform engineers

youtube

Lots of solution architects from: Syntasso, Upbound, AWS, RedHat, Nirmata

How do you make sense of the CNCF landscape? It’s giant.

you don’t, more-or-less;
there are folks (TAGs) who are trying to make recommendations

What do you spend time on in your IDP? give people latitude build the platform piece when enough people have the problem use vendors for undifferentiated lifting “If you buy, you are going to adapt to this; they’re not going to adapt to them” “be careful how much pride you have in building..” (there are other concerns your business probably has)

“As a platform engineer, you’re not buliding a platform; You’re building a framework for someone who knows things to bring the solution to all the users; Enable them”

DX vs security?

Developers are important for the success of the business Platforms should focus on their actual needs, not the look and feel;

.. Went on to talk about how we should treat IDPs as actual products;

Are they your customers? I mean.. they’re definitely users.. but your customers might be the business;

How do you know that the thing you did was good? If users have no choice.. you don’t know. There must be choice so you know you’ve done something useful;

Otel w/ java & python

youtube

Great overview of how otel works and practical hands-on of it working.

Achieving and Maintaining a Healthy CI with Zero Test Flakes

youtube

About the testing strategy used for the k8s project.

Good tools:

https://prow.k8s.io/ historical build/test logs
https://testgrid.k8s.io/ (historical test runs broken down by suite and test)
https://go.k8s.io/triage - aggregated failure information

Micro-Segmentation and Multi-Tenancy: The Brown M&Ms of Platform Engineering

youtube

“brown M&Ms” in the van halen sense.

Challenges

more services/tools/langauges
different teams operate differently
microservices can make for sprawl
networks can be complex/highly latent
security is more important
need info for auditing purposes

Standardizing:

more maintainable b/c less unique config
easier to onboard
easier to roll out common fixes
more predictable spend

k8s multi-tenancy:

cluster-as-a-service: single tenant per cluster (can do different versions of CRDs)
namespace-as-a-service
- simple way to isolate dev env ironments
- can have centralized governance & standardization
- avoids cluster sprawl (which the community has seen being bad)
control plane as a service (shared node resources, but unique control planes)

Microsegmentation: Divide network into segments 9tiers / zones) e.g. frontend tier vs backend tier

enforce security checks to prevent unauthorized movement

Cilium: CNI + observability Kyverno: policy as code

works based on labels

Workspace segmentation:

FE vs BE tier
Across namespaces

Demo:

Tries to create a namespace.. but it’s rejected b/c it lacks the right labels.
by default, kyverno disallows all traffic; adding labels allows different traffic in or out

Neat network debug tool: https://github.com/nicolaka/netshoot

Q: Seems quite magical. How much underlying code was in those policies?

not much code

For business workloads, she still likes namespace as a service; For infrastructure workloads, she thinks she’d use control plane as a service

This Platform Goes to 11: Boost Developer Productivity with Lessons from Salesforce

youtube

Joe Kutner, DX Architect @ Salesforce; co-founder of buildpacks.io

8k users 400-500 people in platform org (50 folks specifically on DX)

We made a platform. They hated us. We worked on it for 2 years. Now they still hate us, but they’re more productive.

To onboard a new service, they had to onboard onto many internal platforms. They needed a tool which integrates the underlying platforms

“Keeping it simple for developers often means making it complicated for yourself”

Values:

unified interfaces
extension points
meet devs where they are
reveal complexity gradually
ephemeralization

Underlying platforms:

CI
CD
terraform
k8s
network
observability

Principals:

Unified interfaces:

gitops
cli
web ui
notification API

API second design (aka experience first design)

Extension points: “If it doesn’t have extension points.. it’s not really a platform”

cli plugins
buildpacks
- take source code and shove it into a docker image
- auto-adds helm charts, terraform stuff, etc.
- Outputs a container image which does the relevant provisioning stuff that they need to recreate the service (salesforce eng blog: “Hyperpacks”
notifications API
add-ons (e.g. Heroku)
- module which encompasses all concerns within a backing resource (APM, Database, etc)
- Encapsulates all the developer stuff (monitoring/scaling/decomm/etc)
- similar to the KRO blog post from AWS

Meet devs where they are (use industry standard tools like kubectl;helm;terraform;argo;buildpacks)

“The places we went wrong are almost always where we went bespoke”

Reveal complexity gradually;

The wrong abstractions create “cliffs” in the experience
blog post: Jean Yang: The case for ‘developer experience’ .. argues against strong abstractions and silver bullets
example: they use go cdk
- generally abstracted
- but offer specific logic segmentation via directory structure

ephemeralization

“do moreand more with less and less until you can do everything with nothing” ~ Buckminster Fuller
ex: reel to real > vhs > dvds > streaming

Offered falcon init to setup their existing repos with the stuff they needed git push falcon main did post-receive hooks to setup helm/terraform/etc platform stuff

CD begins when those resources are made
there’s a webportal (IDP) to view the results
“using the git sha kind of like a trace-id across the platform”

falcon addons init <resource>

e.g. rds, s3, dynamo, object-storage (cloud agnostic)
prompts w/ questions

results:

85% faster onboarding
6x faster inner loop cycle time
55% reduction to release failures

Know ^^ b/c they measure things. They measure it via SPACE framework for developer productivity (forsgren) They have a dashboard for measuring ^^ to see how thedevs or teams are doing

SPACE:

Satisfaction: individual “do you feel productive?” / team: “sprint retro success”
Productivity:
- indiv: inner (start -> pr) & outer loop cycle time (outer: commit -> prod);
- team: p80 build time, p80 deploy time
Activity:
- invid: # of PRs; # of bug fixes (not used to stack rank devs)
- team: deploy frequency
Collab:
- indiv: # of reviews, PR pickup time
- team: distribution of PR reviews
Efficiency
- indiv: focus time, wait time for reviews
- team: deploy fail rate, sprint forecast ratio

inner loop investment:

“difficult to have code running locally that worked”
invested in containerizing stuff
telepresence https://www.telepresence.io/

Migratory Patterns: Making Architectural Transitions with Confidence and Grace

youtube

Pete Hodgson from Open Feature Previously worked at a bank where they deployed literally everything (fortran, atm code, mobile API) once a quarter

Example:

genAI service -> vector store <- data ingestion

For PoC, they used “pinecone” for the db. Then they wanted to move to “pgvetor”

Simplest: Big bang!

stop the world
backfill the data
cut over
start the world

Downsides:

downtime
very stressful during the downtime
cut-over requires testing in real time (eek!)
No plan B if it goes wrong once you start writing into the new system

Expand/Contract (no downtime!!)

dual write
backfill
cutover reader
single write (you can cut back whenever you want)

Downside:

potential instability during dual write

“Expand/contract enables confident migrations” safety -> courage -> speed

Code examples.

Reader uses feature flags to decide which to read from
Could use feature flag as a slider for progressive rollout
Shows example of telemetry broken out by feature flags

Example 2: extract a microservice

internal module vs service
put a shim in front so consumers use the shim
Put the feature flag in the shim.

Parallel run pattern:

call both services, compare responses
discard the result for the new service

Dark launch: originally from FB chat launch;

“simulate the backend usage from javascript without any UI elements”

^^ An example of how to do distributed load tests; make your users do it.

Q: instabilty during dual write?

“it’s good to know now”

The notes of Justin Abrahms

Recently updated

tests for quartz

Zero Knowledge Proofs (ZKP)

Sprint Ceremony input/outputs

Explorer

KubeconNA 2024

Towards Zero Change Incidents: Intuit’s Strategy for Implementing AI-Driven Progressive Delivery

Effective Data Platforming with Open Source Tools For Faster Insights

Creating paved paths for platform engineers

Otel w/ java & python

Achieving and Maintaining a Healthy CI with Zero Test Flakes

Micro-Segmentation and Multi-Tenancy: The Brown M&Ms of Platform Engineering

This Platform Goes to 11: Boost Developer Productivity with Lessons from Salesforce

Migratory Patterns: Making Architectural Transitions with Confidence and Grace

Graph View

Table of Contents

Backlinks