srecon

Talk by Liz-fong jones.

kafaka allows for separation of stateless & stateful systems

kafka is (or supports) an ordered log.

  • This allows for consistency to within an event stream.

You only get so many “innovation tokens”

  • don’t want to spend them on stuff that isn’t important (e.g. undifferentiated heavy lifting)

Look into: apache pulsar

They were able to add additional consumers on the data feed without configuration or coordination on the producer side.

Pros:

  • rolling restarts of most things
  • replay in case of incorrect consumer behavior

Cons:

  • if it breaks, everything breaks
  • running third party software is harder to debug/understand

— Their use-case is atypical.

  • don’t keep weeks or months of history
  • usually only read the last hour (though sometimes up to 1 day)
  • 1-2 main topics, not tons.
  • self-managed partition allocation (~100 parittions)
  • high throughput per partition (50k msg/sec/partition)
  • no ksql
  • no librdkafka (pure C kafaka client); pure go shopify sarama

Best case of kafka failure:

  • best case: events are dropped on the floor
  • worst case: we fill the memory buffer on customer’s client apps

== issues they’ve seen

producer problems

“least outstanding requests” LB for ingest Tune batch sizes (mb, seconds)

  • this is about being economical about chattiness w/ brokers

queue depth

  • kafka has buffers and we need metrics/slos around them

guard against ooms

  • lame duck yourself before going oom

avoid persistently bad partitions

  • if something struggling, give it cooldown time

allocate load between partitions

broker problems

cpu bound?

  • use gravitron (beefy cpu instances, i4+ in aws)

on-disk storage

  • use smaller NVMe for predictable latency
  • use tiered storage for bulk/less frequently accessed
  • does NOT recommend EBS volumnes for scaling out (latency/cost)

autobalance brokers

  • load shed if you need to

horizontal scale-out (if needed)

  • not great to scale up and down.

consumer problems

kafka limits consumer-broker bandwidth regardless of distinct partitions

  • run more brokers or map consumers:partitions as 1:1

watch out for consumer group rebalancing

  • rolling restarts aren’t great here b/c when you rebalance (due to rolling restart). Just kill the world and bring it all back up.
  • this may change with upcoming sticky something-or-other kafka features

You could checkpoint to local disk, if possible, which will make startup faster.

Optimizations to consider

Don’t run brokers if you can avoid it. “Run. Less. Software.”

use zstd compression. CPU is ~cheaper than network. 20%+ savings on bandwidth vs snappy

^ This networking thing is due to cross-AZ costing on cloud providers.

Fully use kafka replication

  • don’t re-copy data between AZs; read from followers (not leader which may be cross-AZ) Don’t pay for more durability than you need. Kafka already provides R=3

Burst balances within cloud providers may result in meta-stable places.

read more at: https://go.hny.co/kafka-lessons

Profile! Profile! Profile! Use Corretto JVM, not GetOpenJDK Use a well-tuned GC algo Upgrade your JNI deps (e.g. Zstd) Replace java crypto libs with Corretto crypto provider.

Chaos test your DR strategy

  • they do this weekly.
  • they have a live kafka broker and consumer migration by killing nodes.

To generate traces for streaming events / long lived workers, they have a “tick” event which happens each minute. This is the root of the trace.

They don’t have SLOs for kafka. They instead have SLOs on things customers care about. Can we put data in? Is it fresh when we query it?