System Design
Principles, patterns, and trade-offs for designing scalable, reliable software systems.
Core Concept
System design is the process of defining architecture, components, modules, interfaces, and data flows to satisfy specified requirements. It spans from high-level architecture to detailed component design.
Related Notes in This Vault
Foundational Theory
- CAP theorem - Consistency, Availability, Partition Tolerance trade-offs
- Little’s Law - Queue theory fundamental (L = λW)
- Hyrum’s Law - All observable behaviors become dependencies
Architecture Patterns
- saga pattern - Distributed transaction management
- c4 model (diagramming) - Architecture documentation approach
- Building a HFT system w go and java (coinbase) - Real-world high-performance design
Data & Storage
- statistics - Foundation for capacity planning
- Distributed Hash Table - P2P data distribution (from p2p/ notes)
- Content Identifier - Content-addressable storage (CIDs)
Reliability
- availability - Measuring and achieving uptime
- Disaster Recovery - BCDR planning
- post-incident analysis - Learning from failures
- synthetic testing (CXA) - Proactive reliability testing
Observability
- Open Telemetry - Unified observability standard
- opentracing and opentelemetry in async use-cases - Distributed tracing patterns
- wireshark - Network-level debugging
Scaling Patterns
- kubernetes - Container orchestration for scaling
- batch size - Why smaller is often better
Security Considerations
- End-to-End Encryption - Data protection in transit
- Workload Identity Federation - Service authentication
- OAuth - Authorization patterns
Fundamental Trade-offs
CAP Theorem (Restated)
In a distributed system experiencing a network partition, you must choose between:
- CP: Consistent but unavailable during partition
- AP: Available but potentially inconsistent during partition
Note: You don’t “choose 2 of 3”—you always need partition tolerance in distributed systems.
Latency vs Throughput
- Optimizing for latency often sacrifices throughput (and vice versa)
- Batching increases throughput but adds latency
- Caching reduces latency but adds complexity
Consistency Models
| Model | Guarantee | Use Case |
|---|---|---|
| Strong | All reads see latest write | Financial transactions |
| Eventual | Reads eventually consistent | Social media feeds |
| Causal | Causally related ops ordered | Collaborative editing |
Synchronous vs Asynchronous
| Approach | Pros | Cons |
|---|---|---|
| Sync | Simple, immediate feedback | Coupling, cascading failures |
| Async | Decoupled, resilient | Complexity, eventual consistency |
Common Patterns
Communication
- Request/Response: REST, gRPC
- Event-Driven: Pub/sub, event sourcing
- Streaming: WebSockets, SSE, Kafka
Data Management
- CQRS: Separate read/write models
- Event Sourcing: Store events, derive state
- Saga Pattern: Distributed transactions via compensation
Resilience
- Circuit Breaker: Fail fast when downstream is unhealthy
- Bulkhead: Isolate failures to prevent cascade
- Retry with Backoff: Handle transient failures
- Timeout: Bound waiting time
Scaling
- Horizontal: Add more instances
- Vertical: Add more resources to instance
- Sharding: Partition data across nodes
- Caching: Reduce load on primary stores
Estimation Framework
Back-of-Envelope Numbers (2024)
| Operation | Latency |
|---|---|
| L1 cache | 1 ns |
| L2 cache | 4 ns |
| RAM | 100 ns |
| SSD random read | 16 μs |
| HDD seek | 4 ms |
| SF to NYC round trip | 40 ms |
| SF to London round trip | 80 ms |
Capacity Planning
- Estimate daily active users
- Estimate actions per user per day
- Calculate QPS (queries per second)
- Plan for 10x growth
- Add 3x for peak traffic
Design Process
1. Requirements Clarification
- Functional: What should it do?
- Non-functional: Scale, latency, availability targets
- Constraints: Budget, timeline, team skills
2. High-Level Design
- Major components and their responsibilities
- Data flow between components
- API boundaries
3. Deep Dive
- Database schema design
- API specifications
- Caching strategy
- Security model
4. Trade-off Discussion
- What are we optimizing for?
- What are we sacrificing?
- How might this fail?
Questions This Note Raises
- How do you choose between microservices and monolith?
- When is eventual consistency acceptable?
- How do you design for multi-region?
- What’s the role of AI in system design now?
Resources to Explore
- “Designing Data-Intensive Applications” (Kleppmann)
- System Design Interview books (Alex Xu)
- High Scalability blog
- Papers We Love repository
Hub note created 2026-01-31 by Claude. Connects system design concepts scattered across this vault.