root.system / 0x10 / system

No single machine
knows everything.

A distributed system is a set of independent computers that, to anyone using them, appear as one coherent system. There is no boss, no shared clock, no shared memory; nodes can crash, lie, lose messages, or drop off the network entirely. Despite all of that, the system has to keep working and keep agreeing on a single truth. Everything that follows is the techniques humanity has invented to make that possible.

Beginner// level 01

What is a distributed system?

Imagine a village where everyone keeps the same ledger. When you spend money, you announce it; everyone updates their copy; the next time you try to spend, everyone checks their own copy. No bank. No central record. The truth is whatever a majority of ledgers say. That, in one sentence, is a distributed system.

Three things make distributed systems hard, and they're hard in a way that single-machine programming never prepared you for:

  • No shared memory. Nodes can only know what other nodes tell them, and the telling happens over a network that drops, delays, and reorders messages.
  • No shared clock. Different machines disagree about what time it is, sometimes by seconds. You cannot rely on "I sent this first" being globally true.
  • Partial failure. In a single program, either it runs or it crashes. In a distributed system, half of the nodes might be working while the other half are unreachable, and they cannot tell which side of that partition they are on.

The Byzantine Generals problem

The classical statement of the hardest version of this problem is from 1982 (Lamport, Shostak, Pease). Imagine several generals of the Byzantine army surrounding an enemy city. They must agree to attack at the same time or not at all. They communicate only by messenger. Some of the messengers may be intercepted. Some of the generals may be traitors, sending different orders to different generals to sow confusion.

Can the loyal generals still reach an agreement, even with traitors among them? The paper proved that yes, they can; but only if more than two thirds of the generals are honest. That two-thirds threshold reappears in every distributed system ever since.

// the ledger analogy
Replace "general" with "computer", "messenger" with "TCP connection", and "loyal vs traitor" with "honest vs malicious node", and you have the model that underlies every modern distributed database, every blockchain, and most cluster managers. The agreement isn't social; it's mathematical, secured by a quorum.

See it: clicking a node propagates a message

// gossip propagation
N1N2N3N4N5N6

click any node, it broadcasts to its three peers; each of those forwards to theirs; within a few hops every node has the message. no coordinator, just forwarding.

That is, in miniature, the basic mechanic. A node receives a message, then forwards it to the peers it knows about. Each of those forwards it onward. There is no coordinator and no central server; the message simply spreads, like a rumour in a crowded room, until everyone has it.

Most of the rest of this page is about the things that go wrong with this picture and the tricks to fix them.

Intermediate// level 02

How nodes agree: CAP, gossip, partitions

The CAP theorem: pick any two

Eric Brewer's CAP theorem (2000) is the most famous result in distributed systems. It states that a system can guarantee at most two of the following three properties:

  • Consistency (C): every read sees the most recent write.
  • Availability (A): every request gets some response (not an error).
  • Partition tolerance (P): the system keeps working when the network splits.

In practice, partition tolerance is non-negotiable; networks will partition. So the real choice is between C and A under partition. That's the actual tradeoff every database designer makes.

// CAP theorem - pick any two
CconsistencyAavailabilityPpartition tolerance

Bitcoin / blockchain: every node sees the same canonical chain; sacrifices availability under partition.

// CAP, refined: PACELC
Daniel Abadi's PACELC formulation (2010) makes CAP more precise: under a Partition you choose between A and C, else (no partition) you choose between Latency and Consistency. Most production systems pick AP under partition and EL (low latency over strong consistency) otherwise: DynamoDB, Cassandra, Riak. Strongly consistent systems like Spanner or Bitcoin pick CP under partition and pay the latency cost.

What happens when the network splits

// network partition + reconciliation
777777

partition the network, then write on either side - values diverge. heal the partition and the system reconciles (here: highest value wins; real CRDTs use more sophisticated merge rules).

Partitions are the heart of the problem. While the network is split, the two halves cannot tell which side has the "true" history. Each side can keep accepting writes (preserving availability) but those writes will diverge. When the partition heals, the system has to reconcile.

Strategies for reconciliation, roughly from simplest to most clever:

  • Last write wins. Use a wall-clock timestamp on every write; the latest timestamp wins. Simple, but lossy: the older write is silently dropped.
  • Version vectors / Lamport clocks. Track logical causality between updates so you can tell which write happened "before" the other even without synchronised clocks.
  • CRDTs (Conflict-free Replicated Data Types). Data structures designed so that merging two divergent copies always produces the same result regardless of order. Used in collaborative editors like Figma and Linear's sync layer.
  • Consensus protocols. Don't allow divergence in the first place: every write requires a quorum of nodes to agree. Paxos, Raft, PBFT, HotStuff. Stronger consistency, higher latency.

Gossip: how information spreads without a coordinator

The interactive widget at the top of the page was a sketch of gossip. Concrete properties of real gossip protocols:

  • Epidemic. Each node periodically picks a few peers at random and exchanges state. Information spreads exponentially; full network coverage is reached in O(log n) rounds.
  • Resilient. No central node to fail. Drop half the nodes; the survivors keep gossiping.
  • Tunable. Choose the fanout (how many peers per round) and the round duration to trade bandwidth for convergence speed.
  • Eventually consistent. All live nodes will agree, eventually. The word "eventually" is doing real work.

Real gossip protocols include SWIM (used by HashiCorp memberlist, Consul, Cassandra), HyParView + Plumtree (Erlang clusters), and the custom flooding used by Bitcoin and Ethereum. The networking page set up TCP and IP; gossip is what most distributed systems do on top.

// connect back to the hashing page
Many gossip implementations use consistent hashing (a hash ring) to decide which node "owns" which piece of state. Each node and each key get a position on a ring computed from hash(id) mod 2^32; a key is owned by the next node clockwise from its position. Adding or removing one node only moves 1/N of the keys, not all of them. The hashing page is the prerequisite.
Advanced// level 03

Distributed systems in code, and why Bitcoin is one

1. The connection: every node is a TCP listener and a TCP dialer

Every node in a distributed system has two faces. It listens on a port (for inbound peers) and dials peers on their ports (for outbound traffic). The OS handles the byte-level transport; the application speaks a protocol on top. Below: the C and Rust skeletons that every node, from a Raft replica to a Bitcoin Core instance, is built around.

Rust• • •
// One node, two roles: a TCP listener (the "server")
// and a TCP dialer (the "client"). Every node in a real
// distributed system runs both at once.
use std::io::{Read, Write};
use std::net::{TcpListener, TcpStream};

fn run_server(addr: &str) -> std::io::Result<()> {
    let listener = TcpListener::bind(addr)?;
    for incoming in listener.incoming() {
        let mut sock = incoming?;
        let mut buf = [0u8; 1024];
        let n = sock.read(&mut buf)?;
        // Echo back; in a real node we'd parse a protocol message
        // (ping, gossip-tx, request-block, etc.) and dispatch.
        sock.write_all(&buf[..n])?;
    }
    Ok(())
}

fn dial(peer: &str, payload: &[u8]) -> std::io::Result<Vec<u8>> {
    let mut sock = TcpStream::connect(peer)?;
    sock.write_all(payload)?;
    let mut reply = Vec::new();
    sock.read_to_end(&mut reply)?;
    Ok(reply)
}
C• • •
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>

// Listening side of a node.
int run_server(uint16_t port) {
    int s = socket(AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in addr = { 0 };
    addr.sin_family = AF_INET;
    addr.sin_port = htons(port);
    addr.sin_addr.s_addr = INADDR_ANY;
    bind(s, (struct sockaddr*)&addr, sizeof addr);
    listen(s, 32);

    for (;;) {
        int c = accept(s, NULL, NULL);
        char buf[1024];
        ssize_t n = recv(c, buf, sizeof buf, 0);
        send(c, buf, n, 0);             // echo for the demo
        close(c);
    }
}

// Dialing side. Open a connection to a peer, send bytes, read a reply.
int dial(const char *ip, uint16_t port, const char *payload) {
    int s = socket(AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in addr = { 0 };
    addr.sin_family = AF_INET;
    addr.sin_port = htons(port);
    inet_pton(AF_INET, ip, &addr.sin_addr);
    connect(s, (struct sockaddr*)&addr, sizeof addr);

    send(s, payload, strlen(payload), 0);
    char buf[1024];
    ssize_t n = recv(s, buf, sizeof buf, 0);
    close(s);
    return (int)n;
}

2. The hash ring: where does this key live?

Once you have many nodes, the next question is "which node owns which data?" The naive answer (hash(key) mod N) breaks every time you add or remove a node: all keys need to be remapped. Consistent hashing fixes this by placing both nodes and keys on a virtual ring and assigning each key to the next node clockwise from its position. Add a new node and only the keys near its ring position move.

Rust• • •
// Consistent hashing: a fixed ring of slots from 0 to 2^32 - 1.
// Each node is placed at hash(node_id) on the ring; each key is placed
// at hash(key); a key belongs to the first node clockwise from its
// position. Adding or removing a node only relocates 1/N of keys.
use sha2::{Sha256, Digest};

fn hash_to_u32(s: &str) -> u32 {
    let mut h = Sha256::new();
    h.update(s.as_bytes());
    let digest = h.finalize();
    u32::from_be_bytes([digest[0], digest[1], digest[2], digest[3]])
}

struct HashRing {
    nodes: Vec<(u32, String)>,   // (position_on_ring, node_id)
}

impl HashRing {
    fn add(&mut self, node: &str) {
        self.nodes.push((hash_to_u32(node), node.to_string()));
        self.nodes.sort_by_key(|&(pos, _)| pos);
    }

    fn owner(&self, key: &str) -> &str {
        let kpos = hash_to_u32(key);
        // First node whose position >= key position; wrap to first if none.
        self.nodes
            .iter()
            .find(|&&(pos, _)| pos >= kpos)
            .map(|(_, id)| id.as_str())
            .unwrap_or(&self.nodes[0].1)
    }
}

// In Cassandra, DynamoDB, memcached clusters: this is how you find the
// node that owns a given key. Add a node and only ~1/N of keys move.
C• • •
// Consistent hashing: a fixed ring of slots from 0 to 2^32 - 1.
// Each node is placed at hash(node_id) on the ring; each key is placed
// at hash(key); a key belongs to the first node clockwise from its
// position. Adding or removing a node only relocates 1/N of keys.
use sha2::{Sha256, Digest};

fn hash_to_u32(s: &str) -> u32 {
    let mut h = Sha256::new();
    h.update(s.as_bytes());
    let digest = h.finalize();
    u32::from_be_bytes([digest[0], digest[1], digest[2], digest[3]])
}

struct HashRing {
    nodes: Vec<(u32, String)>,   // (position_on_ring, node_id)
}

impl HashRing {
    fn add(&mut self, node: &str) {
        self.nodes.push((hash_to_u32(node), node.to_string()));
        self.nodes.sort_by_key(|&(pos, _)| pos);
    }

    fn owner(&self, key: &str) -> &str {
        let kpos = hash_to_u32(key);
        // First node whose position >= key position; wrap to first if none.
        self.nodes
            .iter()
            .find(|&&(pos, _)| pos >= kpos)
            .map(|(_, id)| id.as_str())
            .unwrap_or(&self.nodes[0].1)
    }
}

// In Cassandra, DynamoDB, memcached clusters: this is how you find the
// node that owns a given key. Add a node and only ~1/N of keys move.

3. A gossip protocol, simplified

Rust• • •
// A toy gossip loop. Each tick: pick a random peer, send our current
// view. Over enough ticks, every node converges to the same state.
use std::collections::HashSet;

struct Node {
    id: u32,
    peers: Vec<u32>,
    seen: HashSet<u32>,   // ids of messages we've already received
}

impl Node {
    fn receive(&mut self, msg_id: u32) -> bool {
        // Return true if this is a new message (worth forwarding).
        self.seen.insert(msg_id)
    }

    fn gossip(&self, msg_id: u32, network: &mut [Node]) {
        // Forward to a small random subset of peers.
        for &peer_id in self.peers.iter().take(3) {
            let peer = &mut network[peer_id as usize];
            if peer.receive(msg_id) {
                // peer also gossips on next tick; epidemic spread.
            }
        }
    }
}

// Real implementations:
//  - Bitcoin: send "inv" announcement, peer requests with "getdata"
//  - Ethereum devp2p: same pattern, richer message types
//  - SWIM (HashiCorp memberlist, Cassandra): ping + indirect ping
//  - HyParView + Plumtree (Erlang): structured gossip overlays
C• • •
// A toy gossip loop. Each tick: pick a random peer, send our current
// view. Over enough ticks, every node converges to the same state.
use std::collections::HashSet;

struct Node {
    id: u32,
    peers: Vec<u32>,
    seen: HashSet<u32>,   // ids of messages we've already received
}

impl Node {
    fn receive(&mut self, msg_id: u32) -> bool {
        // Return true if this is a new message (worth forwarding).
        self.seen.insert(msg_id)
    }

    fn gossip(&self, msg_id: u32, network: &mut [Node]) {
        // Forward to a small random subset of peers.
        for &peer_id in self.peers.iter().take(3) {
            let peer = &mut network[peer_id as usize];
            if peer.receive(msg_id) {
                // peer also gossips on next tick; epidemic spread.
            }
        }
    }
}

// Real implementations:
//  - Bitcoin: send "inv" announcement, peer requests with "getdata"
//  - Ethereum devp2p: same pattern, richer message types
//  - SWIM (HashiCorp memberlist, Cassandra): ping + indirect ping
//  - HyParView + Plumtree (Erlang): structured gossip overlays

Byzantine fault tolerance: 3f + 1

A non-Byzantine consensus algorithm (Paxos, Raft) tolerates crash failures: a node either runs honestly or stops responding. With n nodes, Raft survives (n-1)/2 crashes.

Byzantine consensus is stricter: nodes might lie, send conflicting messages to different peers, or collude. The classical result (Lamport et al., 1982) says you need at least 3f + 1 total nodes to tolerate f Byzantine faults:

Rust• • •
// Byzantine fault tolerance: with f faulty nodes you need at least
// 3f + 1 total to guarantee agreement.
//
//   f = 0  ->  n = 1   (trivial, single trusted node)
//   f = 1  ->  n = 4   (3 honest, 1 byzantine)
//   f = 2  ->  n = 7   (5 honest, 2 byzantine)
//   f = 3  ->  n = 10  (PBFT can guarantee safety + liveness)
//
// The intuition: each "phase" needs 2f+1 acknowledgements to be sure
// the agreement isn't being faked, and the network must still progress
// even with f silent nodes. 2f+1 (for safety) + f (to outvote bad
// quorum) = 3f+1 minimum.
//
// PBFT (Castro & Liskov, 1999) made this practical. Tendermint,
// HotStuff, Diem BFT, and many modern proof-of-stake chains
// are descendants of PBFT.

#include <stdbool.h>
#include <stdint.h>

bool can_agree(int total_nodes, int faulty) {
    return total_nodes >= 3 * faulty + 1;
}
C• • •
// Byzantine fault tolerance: with f faulty nodes you need at least
// 3f + 1 total to guarantee agreement.
//
//   f = 0  ->  n = 1   (trivial, single trusted node)
//   f = 1  ->  n = 4   (3 honest, 1 byzantine)
//   f = 2  ->  n = 7   (5 honest, 2 byzantine)
//   f = 3  ->  n = 10  (PBFT can guarantee safety + liveness)
//
// The intuition: each "phase" needs 2f+1 acknowledgements to be sure
// the agreement isn't being faked, and the network must still progress
// even with f silent nodes. 2f+1 (for safety) + f (to outvote bad
// quorum) = 3f+1 minimum.
//
// PBFT (Castro & Liskov, 1999) made this practical. Tendermint,
// HotStuff, Diem BFT, and many modern proof-of-stake chains
// are descendants of PBFT.

#include <stdbool.h>
#include <stdint.h>

bool can_agree(int total_nodes, int faulty) {
    return total_nodes >= 3 * faulty + 1;
}

State machine replication: the trick behind every cluster

The unifying idea behind every consensus protocol is state machine replication: take a deterministic state machine (a database, a key-value store, a blockchain ledger), apply the same sequence of inputs on every node, and every node ends up in the same state. The consensus protocol's only job is to agree on the order of inputs.

  • Paxos / Raft: agree on the log entries (crash-fault tolerant). Backbone of etcd, ZooKeeper, CockroachDB, Spanner.
  • PBFT / HotStuff / Tendermint: agree on the log entries (Byzantine fault tolerant). Used by Diem, Cosmos, Aptos, modern proof-of-stake chains.
  • Nakamoto consensus (Bitcoin): agree on the longest valid chain. Eventually consistent, probabilistic finality, no quorum required.

Bitcoin is a distributed system

Once you have the vocabulary above, Bitcoin slots into it cleanly:

  • The ledger everyone keeps is the blockchain.
  • Each node is a computer running Bitcoin Core (written in C++) or a compatible implementation.
  • The gossip protocol spreads transactions and blocks. inv, getdata, tx, block are message types on a TCP overlay.
  • The consensus mechanism is proof-of-work: find a nonce that hashes the block header below a target, broadcast the block, and the rest of the network either accepts it (extends the chain) or rejects it.
  • The CAP choice: Bitcoin is CP. Under a network partition, both halves keep producing blocks but only one chain will eventually be canonical. Availability of writes degrades; consistency wins.

Three blockchains, three PACELC opinions:

  • Bitcoin: CP under partition, EC otherwise. Strong eventual consistency through proof-of-work + longest chain. Latency is in tens of minutes by design.
  • Ethereum (post-Merge): CP under partition, EC otherwise. Proof-of-stake + Casper finality. Latency in seconds, deterministic finality after a few epochs.
  • Solana: leans AP under stress (occasional halts notwithstanding), EL otherwise. Single global leader at a time, very high throughput, occasional liveness loss when the leader misbehaves.

The poster: no single machine knows everything

NO SINGLE MACHINE KNOWS EVERYTHINGmany nodes, one shared truth - and the network routes around the failed oneB1B2B3N1N2N3N4N5offlineN6N7N8SHARED CHAINevery live node holds a copyN5 is offline; gossip flows around it. when it returns it asks peers for what it missed.
// connect back to the blockchain page
The blockchain page builds Bitcoin from the inside out (blocks, mining, the chain). This page is the other half of the picture: Bitcoin from the outside in, as one example of the distributed-systems pattern. The two pages share the same gossip diagram for a reason: gossip is the bridge between them.

Every previous topic, one click away

This page synthesises material from across the site. If a callback above caught you flat-footed, the original is one click away:

Where to dig in next

Distributed systems is the deepest rabbit hole on the site. Pick any of these and you can spend years:

  • Lamport's "Time, Clocks and the Ordering of Events" (1978). The paper that started rigorous distributed-systems thinking.
  • Paxos and Raft. The two canonical crash-fault-tolerant consensus algorithms. Raft was designed to be more teachable than Paxos; both are worth reading.
  • The PBFT paper (Castro & Liskov, 1999). The classic practical Byzantine algorithm; everything modern is a descendant.
  • HotStuff and the modern BFT line. Linear message complexity, chain-style consensus; the basis of Diem, Aptos, several proof-of-stake chains.
  • Designing Data-Intensive Applications (Kleppmann). Single best book on the everything of distributed databases.
  • The Raft demo at raft.github.io. An interactive simulator; play with leader elections, partitions, and log replication.
  • Jepsen reports (jepsen.io). Aphyr's tests of real distributed databases under partition. Equal parts technical and entertaining.
next up / 0x11
You can only guarantee two. The silent tradeoff every system makes.
cap theorem