System Design

Principles, patterns, and trade-offs for designing scalable, reliable software systems.


Core Concept

System design is the process of defining architecture, components, modules, interfaces, and data flows to satisfy specified requirements. It spans from high-level architecture to detailed component design.


Foundational Theory

Architecture Patterns

Data & Storage

Reliability

Observability

Scaling Patterns

Security Considerations


Fundamental Trade-offs

CAP Theorem (Restated)

In a distributed system experiencing a network partition, you must choose between:

  • CP: Consistent but unavailable during partition
  • AP: Available but potentially inconsistent during partition

Note: You don’t “choose 2 of 3”—you always need partition tolerance in distributed systems.

Latency vs Throughput

  • Optimizing for latency often sacrifices throughput (and vice versa)
  • Batching increases throughput but adds latency
  • Caching reduces latency but adds complexity

Consistency Models

ModelGuaranteeUse Case
StrongAll reads see latest writeFinancial transactions
EventualReads eventually consistentSocial media feeds
CausalCausally related ops orderedCollaborative editing

Synchronous vs Asynchronous

ApproachProsCons
SyncSimple, immediate feedbackCoupling, cascading failures
AsyncDecoupled, resilientComplexity, eventual consistency

Common Patterns

Communication

  • Request/Response: REST, gRPC
  • Event-Driven: Pub/sub, event sourcing
  • Streaming: WebSockets, SSE, Kafka

Data Management

  • CQRS: Separate read/write models
  • Event Sourcing: Store events, derive state
  • Saga Pattern: Distributed transactions via compensation

Resilience

  • Circuit Breaker: Fail fast when downstream is unhealthy
  • Bulkhead: Isolate failures to prevent cascade
  • Retry with Backoff: Handle transient failures
  • Timeout: Bound waiting time

Scaling

  • Horizontal: Add more instances
  • Vertical: Add more resources to instance
  • Sharding: Partition data across nodes
  • Caching: Reduce load on primary stores

Estimation Framework

Back-of-Envelope Numbers (2024)

OperationLatency
L1 cache1 ns
L2 cache4 ns
RAM100 ns
SSD random read16 μs
HDD seek4 ms
SF to NYC round trip40 ms
SF to London round trip80 ms

Capacity Planning

  1. Estimate daily active users
  2. Estimate actions per user per day
  3. Calculate QPS (queries per second)
  4. Plan for 10x growth
  5. Add 3x for peak traffic

Design Process

1. Requirements Clarification

  • Functional: What should it do?
  • Non-functional: Scale, latency, availability targets
  • Constraints: Budget, timeline, team skills

2. High-Level Design

  • Major components and their responsibilities
  • Data flow between components
  • API boundaries

3. Deep Dive

  • Database schema design
  • API specifications
  • Caching strategy
  • Security model

4. Trade-off Discussion

  • What are we optimizing for?
  • What are we sacrificing?
  • How might this fail?

Questions This Note Raises

  • How do you choose between microservices and monolith?
  • When is eventual consistency acceptable?
  • How do you design for multi-region?
  • What’s the role of AI in system design now?

Resources to Explore

  • “Designing Data-Intensive Applications” (Kleppmann)
  • System Design Interview books (Alex Xu)
  • High Scalability blog
  • Papers We Love repository

Hub note created 2026-01-31 by Claude. Connects system design concepts scattered across this vault.