scaling event streaming applications

srecon

Talk by Liz-fong jones.

kafaka allows for separation of stateless & stateful systems

kafka is (or supports) an ordered log.

This allows for consistency to within an event stream.

You only get so many “innovation tokens”

don’t want to spend them on stuff that isn’t important (e.g. undifferentiated heavy lifting)

Look into: apache pulsar

They were able to add additional consumers on the data feed without configuration or coordination on the producer side.

Pros:

rolling restarts of most things
replay in case of incorrect consumer behavior

Cons:

if it breaks, everything breaks
running third party software is harder to debug/understand

— Their use-case is atypical.

don’t keep weeks or months of history
usually only read the last hour (though sometimes up to 1 day)
1-2 main topics, not tons.
self-managed partition allocation (~100 parittions)
high throughput per partition (50k msg/sec/partition)
no ksql
no librdkafka (pure C kafaka client); pure go shopify sarama

Best case of kafka failure:

best case: events are dropped on the floor
worst case: we fill the memory buffer on customer’s client apps

== issues they’ve seen

producer problems

“least outstanding requests” LB for ingest Tune batch sizes (mb, seconds)

this is about being economical about chattiness w/ brokers

queue depth

kafka has buffers and we need metrics/slos around them

guard against ooms

lame duck yourself before going oom

avoid persistently bad partitions

if something struggling, give it cooldown time

allocate load between partitions

broker problems

cpu bound?

use gravitron (beefy cpu instances, i4+ in aws)

on-disk storage

use smaller NVMe for predictable latency
use tiered storage for bulk/less frequently accessed
does NOT recommend EBS volumnes for scaling out (latency/cost)

autobalance brokers

load shed if you need to

horizontal scale-out (if needed)

not great to scale up and down.

consumer problems

kafka limits consumer-broker bandwidth regardless of distinct partitions

run more brokers or map consumers:partitions as 1:1

watch out for consumer group rebalancing

rolling restarts aren’t great here b/c when you rebalance (due to rolling restart). Just kill the world and bring it all back up.
this may change with upcoming sticky something-or-other kafka features

You could checkpoint to local disk, if possible, which will make startup faster.

Optimizations to consider

Don’t run brokers if you can avoid it. “Run. Less. Software.”

use zstd compression. CPU is ~cheaper than network. 20%+ savings on bandwidth vs snappy

^ This networking thing is due to cross-AZ costing on cloud providers.

Fully use kafka replication

don’t re-copy data between AZs; read from followers (not leader which may be cross-AZ) Don’t pay for more durability than you need. Kafka already provides R=3

Burst balances within cloud providers may result in meta-stable places.

read more at: https://go.hny.co/kafka-lessons

Profile! Profile! Profile! Use Corretto JVM, not GetOpenJDK Use a well-tuned GC algo Upgrade your JNI deps (e.g. Zstd) Replace java crypto libs with Corretto crypto provider.

Chaos test your DR strategy

they do this weekly.
they have a live kafka broker and consumer migration by killing nodes.

To generate traces for streaming events / long lived workers, they have a “tick” event which happens each minute. This is the root of the trace.

They don’t have SLOs for kafka. They instead have SLOs on things customers care about. Can we put data in? Is it fresh when we query it?

The notes of Justin Abrahms

Recently updated

tests for quartz

Zero Knowledge Proofs (ZKP)

Sprint Ceremony input/outputs

Explorer

scaling event streaming applications

producer problems

broker problems

consumer problems

Optimizations to consider

Graph View

Table of Contents

Backlinks