System Design

Principles, patterns, and trade-offs for designing scalable, reliable software systems.

Core Concept

System design is the process of defining architecture, components, modules, interfaces, and data flows to satisfy specified requirements. It spans from high-level architecture to detailed component design.

Foundational Theory

CAP theorem - Consistency, Availability, Partition Tolerance trade-offs
Little’s Law - Queue theory fundamental (L = λW)
Hyrum’s Law - All observable behaviors become dependencies

Architecture Patterns

saga pattern - Distributed transaction management
c4 model (diagramming) - Architecture documentation approach
Building a HFT system w go and java (coinbase) - Real-world high-performance design

Data & Storage

statistics - Foundation for capacity planning
Distributed Hash Table - P2P data distribution (from p2p/ notes)
Content Identifier - Content-addressable storage (CIDs)

Reliability

availability - Measuring and achieving uptime
Disaster Recovery - BCDR planning
post-incident analysis - Learning from failures
synthetic testing (CXA) - Proactive reliability testing

Observability

Open Telemetry - Unified observability standard
opentracing and opentelemetry in async use-cases - Distributed tracing patterns
wireshark - Network-level debugging

Scaling Patterns

kubernetes - Container orchestration for scaling
batch size - Why smaller is often better

Security Considerations

End-to-End Encryption - Data protection in transit
Workload Identity Federation - Service authentication
OAuth - Authorization patterns

Fundamental Trade-offs

CAP Theorem (Restated)

In a distributed system experiencing a network partition, you must choose between:

CP: Consistent but unavailable during partition
AP: Available but potentially inconsistent during partition

Note: You don’t “choose 2 of 3”—you always need partition tolerance in distributed systems.

Latency vs Throughput

Optimizing for latency often sacrifices throughput (and vice versa)
Batching increases throughput but adds latency
Caching reduces latency but adds complexity

Consistency Models

Model	Guarantee	Use Case
Strong	All reads see latest write	Financial transactions
Eventual	Reads eventually consistent	Social media feeds
Causal	Causally related ops ordered	Collaborative editing

Synchronous vs Asynchronous

Approach	Pros	Cons
Sync	Simple, immediate feedback	Coupling, cascading failures
Async	Decoupled, resilient	Complexity, eventual consistency

Common Patterns

Communication

Request/Response: REST, gRPC
Event-Driven: Pub/sub, event sourcing
Streaming: WebSockets, SSE, Kafka

Data Management

CQRS: Separate read/write models
Event Sourcing: Store events, derive state
Saga Pattern: Distributed transactions via compensation

Resilience

Circuit Breaker: Fail fast when downstream is unhealthy
Bulkhead: Isolate failures to prevent cascade
Retry with Backoff: Handle transient failures
Timeout: Bound waiting time

Scaling

Horizontal: Add more instances
Vertical: Add more resources to instance
Sharding: Partition data across nodes
Caching: Reduce load on primary stores

Estimation Framework

Back-of-Envelope Numbers (2024)

Operation	Latency
L1 cache	1 ns
L2 cache	4 ns
RAM	100 ns
SSD random read	16 μs
HDD seek	4 ms
SF to NYC round trip	40 ms
SF to London round trip	80 ms

Capacity Planning

Estimate daily active users
Estimate actions per user per day
Calculate QPS (queries per second)
Plan for 10x growth
Add 3x for peak traffic

Design Process

1. Requirements Clarification

Functional: What should it do?
Non-functional: Scale, latency, availability targets
Constraints: Budget, timeline, team skills

2. High-Level Design

Major components and their responsibilities
Data flow between components
API boundaries

3. Deep Dive

Database schema design
API specifications
Caching strategy
Security model

4. Trade-off Discussion

What are we optimizing for?
What are we sacrificing?
How might this fail?

Questions This Note Raises

How do you choose between microservices and monolith?
When is eventual consistency acceptable?
How do you design for multi-region?
What’s the role of AI in system design now?

Resources to Explore

“Designing Data-Intensive Applications” (Kleppmann)
System Design Interview books (Alex Xu)
High Scalability blog
Papers We Love repository

Hub note created 2026-01-31 by Claude. Connects system design concepts scattered across this vault.

The notes of Justin Abrahms

Recently updated

Welcome to my digital brain

Abstract Data Types

Agentic LLM

Explorer

hub-system-design

System Design

Core Concept

Foundational Theory

Architecture Patterns

Data & Storage

Reliability

Observability

Scaling Patterns

Security Considerations

Fundamental Trade-offs

CAP Theorem (Restated)

Latency vs Throughput

Consistency Models

Synchronous vs Asynchronous

Common Patterns

Communication

Data Management

Resilience

Scaling

Estimation Framework

Back-of-Envelope Numbers (2024)

Capacity Planning

Design Process

1. Requirements Clarification

2. High-Level Design

3. Deep Dive

4. Trade-off Discussion

Questions This Note Raises

Resources to Explore

Graph View

Table of Contents

Backlinks

The notes of Justin Abrahms

Recently updated

Welcome to my digital brain

Abstract Data Types

Agentic LLM

Explorer

hub-system-design

System Design

Core Concept

Related Notes in This Vault

Foundational Theory

Architecture Patterns

Data & Storage

Reliability

Observability

Scaling Patterns

Security Considerations

Fundamental Trade-offs

CAP Theorem (Restated)

Latency vs Throughput

Consistency Models

Synchronous vs Asynchronous

Common Patterns

Communication

Data Management

Resilience

Scaling

Estimation Framework

Back-of-Envelope Numbers (2024)

Capacity Planning

Design Process

1. Requirements Clarification

2. High-Level Design

3. Deep Dive

4. Trade-off Discussion

Questions This Note Raises

Resources to Explore

Graph View

Table of Contents

Backlinks