root.system / 0x0D / structure

A fingerprint for
anything.

A hash function turns arbitrary input (a string, a file, a struct, a gigabyte of transactions) into a small, fixed-size number. From that one idea you get hash maps (the workhorse of every dynamic language), content-addressed storage, password hashing, Merkle trees, and the chain in blockchain. Every layer is the same trick, just applied with more or less paranoia.

Beginner// level 01

Hashing as a fingerprint function

Imagine a librarian with a peculiar talent. You hand them any book (a thin pamphlet, a thousand-page novel, a comic) and within a second they hand back a five-digit number. The same book always gets the same number. Different books almost always get different numbers. The number tells you nothing about the book itself, and you can't reconstruct the book from the number. That number is a hash, and the librarian is a hash function.

A hash function:

Takes input of arbitrary size: bytes are bytes.
Returns a fixed-size number: 32, 64, or 256 bits, depending on the function.
Is deterministic: same input → same output, every time, on every machine.
Spreads inputs uniformly: similar inputs get wildly different outputs.
Is one-way: cheap to compute forward, useless to run in reverse.

A picture of a hash function

Notice the third row: "alicf" differs from "alice" by a single character, but the output is completely different. This is the avalanche property: flip one input bit and roughly half the output bits flip. It's what makes hashes useful for both lookup tables and tamper detection.

// connect back to the arrays page

A hash map starts with the array from the arrays page: a contiguous block of buckets, addressable by index. Hashing is the trick that turns a key like "alice" into a bucket index. From there it's the same O(1) indexed access an array always gave you.

A hash map: hash → index → array slot

This is the whole magic, in one picture. To store the pair ("alice", 27):

Compute hash("alice"): say it returns the 64-bit number 0x5d1a8b….
Reduce it to a bucket index with hash % capacity, which here gives 3.
Write ("alice", 27) into buckets[3].

To look it up again, you do the same thing: hash "alice", get the same 3, read buckets[3]. No walking. No searching. The lookup time doesn't depend on how many keys are stored.

Build one in Rust and C

Rust• • •
use std::collections::HashMap;

fn main() {
    // HashMap<K, V> is Rust's hash map: an array of buckets,
    // indexed by hash(key). Insert and lookup are amortised O(1).
    let mut ages: HashMap<&str, u32> = HashMap::new();

    ages.insert("alice", 27);
    ages.insert("bob",   31);
    ages.insert("carol", 42);

    // Lookup: hash "alice", land in a bucket, return the value.
    // No walking 3 entries comparing names. The hash tells us
    // exactly which bucket to look in.
    if let Some(age) = ages.get("alice") {
        println!("alice is {age}");
    }

    // What the standard library does under the hood:
    //   1. compute hasher.finish() for "alice"            (a u64)
    //   2. reduce it to a bucket index by & (capacity - 1)
    //   3. jump straight to that bucket
    //   4. compare keys (in case of collision) until match or empty
}

C• • •
#include <stdio.h>
#include <string.h>
#include <stdint.h>

// A toy fixed-size hash map. Real ones grow and handle collisions
// far more carefully; see the intermediate section.
#define BUCKETS 8

typedef struct {
    const char *key;
    int         value;
    int         used;
} Entry;

// A small, fast (non-cryptographic) string hash: FNV-1a.
uint64_t fnv1a(const char *s) {
    uint64_t h = 0xcbf29ce484222325ULL;
    while (*s) {
        h ^= (uint8_t)*s++;
        h *= 0x100000001b3ULL;
    }
    return h;
}

int main(void) {
    Entry table[BUCKETS] = {0};

    // Insert: hash the key, mod by capacity, drop into that slot.
    const char *k = "alice";
    uint64_t h = fnv1a(k);
    int idx = h & (BUCKETS - 1);   // works because BUCKETS is a power of 2
    table[idx].key   = k;
    table[idx].value = 27;
    table[idx].used  = 1;

    // Lookup: same hash, same index, read the value.
    int q = fnv1a("alice") & (BUCKETS - 1);
    if (table[q].used && strcmp(table[q].key, "alice") == 0)
        printf("alice -> %d\n", table[q].value);

    return 0;
}

// hash maps are everywhere, even when they're invisible

Python dict, JavaScript objects and Map, Go map[K]V, Rust HashMap, Java HashMap, C# Dictionary, Lua tables, Redis itself: all hash maps under the hood. Every time you write obj.foo in JavaScript or d["key"] in Python, you're doing the diagram above. The "key" gets hashed, the hash points at a slot, and the value gets read in constant time.

Intermediate// level 02

Collisions, chaining, and the load factor

The beginner picture has a hole in it. If buckets has 8 slots and you insert 100 keys, the pigeonhole principle says at least one bucket has to hold more than one key. Two distinct keys whose hashes land in the same bucket is called a collision, and every real hash map needs a strategy for handling them.

Strategy 1: Separate chaining

Each bucket isn't a single slot; it's a list of entries. To insert, you hash, find the bucket, walk the list, append (or update if the key already exists). To look up, you hash, find the bucket, walk the list comparing keys.

This is where the previous two pages pay off. A chained hash map is literally an array of linked lists. The arrays page gives you O(1) bucket lookup. The linked-list page gives you O(1) splicing of new entries into a bucket. Both data structures, both doing what they're best at, side by side.

// connect back to the linked-list page

The chain at buckets[3] is a singly linked list, the same shape as the diagram on the linked-list page. The good news: chains are usually tiny. With a sensible load factor (next section), the average chain length is well under 2. The bad news: each chain node is a separate heap allocation, with all the cache penalty the linked-list page warned about. That's why modern implementations prefer open addressing.

Rust• • •
// A hash map with separate chaining, built by hand.
// Buckets is a Vec; each bucket is a Vec of (key, value) entries.
// This is exactly Vec<LinkedList<(K, V)>> in shape: array + linked-list
// chains from the two previous pages, working together.
struct ChainMap<K, V> {
    buckets: Vec<Vec<(K, V)>>,
}

impl<K: std::hash::Hash + Eq, V> ChainMap<K, V> {
    fn new(cap: usize) -> Self {
        Self { buckets: (0..cap).map(|_| Vec::new()).collect() }
    }

    fn bucket_for(&self, key: &K) -> usize {
        use std::collections::hash_map::DefaultHasher;
        use std::hash::{Hash, Hasher};
        let mut h = DefaultHasher::new();
        key.hash(&mut h);
        (h.finish() as usize) % self.buckets.len()
    }

    fn insert(&mut self, key: K, value: V) {
        let i = self.bucket_for(&key);
        let bucket = &mut self.buckets[i];
        // Update if key already present, otherwise push.
        for (k, v) in bucket.iter_mut() {
            if *k == key { *v = value; return; }
        }
        bucket.push((key, value));
    }

    fn get(&self, key: &K) -> Option<&V> {
        let i = self.bucket_for(key);
        self.buckets[i].iter().find(|(k, _)| k == key).map(|(_, v)| v)
    }
}

C• • •
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

// Same idea in C: array of bucket-heads, each one a singly linked list.
typedef struct Entry {
    char         *key;
    int           value;
    struct Entry *next;       // chain pointer, the linked-list page
} Entry;

#define BUCKETS 256
static Entry *table[BUCKETS] = {0};

uint64_t fnv1a(const char *s);   // from the previous snippet

void put(const char *key, int value) {
    uint64_t h = fnv1a(key) & (BUCKETS - 1);

    // Walk the chain in case the key already exists.
    for (Entry *e = table[h]; e; e = e->next) {
        if (strcmp(e->key, key) == 0) { e->value = value; return; }
    }

    // New entry: prepend to the bucket's chain (O(1)).
    Entry *e = malloc(sizeof *e);
    e->key   = strdup(key);
    e->value = value;
    e->next  = table[h];
    table[h] = e;
}

Strategy 2: Open addressing

The alternative: no linked lists at all. The bucket array is the storage. When a bucket is taken, probe the next one, then the next, until you find an empty slot. To look up a key, hash to the start position and walk forward (using the same probe pattern) until you find the key, or hit an empty slot, or wrap around.

This is the strategy used by Rust's std::collections::HashMap, Google's Abseil (SwissTable), Python dict (since 3.6), and most other modern implementations. Why? Cache locality. Everything lives in one contiguous Vec. The CPU's cache prefetcher loves it. The linked-list version, with its scattered heap allocations, doesn't.

Rust• • •
// Open addressing: NO linked lists. When a bucket is taken, probe the
// next slot. Modern hash maps (Swiss tables, Rust's std HashMap,
// Google's Abseil) use this. Cache-friendly because everything stays
// in one contiguous Vec, exactly what the memory page would approve of.
struct OpenMap<K, V> {
    buckets: Vec<Option<(K, V)>>,
}

impl<K: std::hash::Hash + Eq + Clone, V: Clone> OpenMap<K, V> {
    fn insert(&mut self, key: K, value: V) {
        let cap = self.buckets.len();
        let mut i = self.hash(&key) % cap;
        loop {
            match &self.buckets[i] {
                None => { self.buckets[i] = Some((key, value)); return; }
                Some((k, _)) if *k == key => {
                    self.buckets[i] = Some((key, value));
                    return;
                }
                _ => i = (i + 1) % cap, // linear probe to the next slot
            }
        }
    }

    fn hash(&self, key: &K) -> usize {
        use std::collections::hash_map::DefaultHasher;
        use std::hash::{Hash, Hasher};
        let mut h = DefaultHasher::new();
        key.hash(&mut h);
        h.finish() as usize
    }
}

// Why this beats chaining on real machines:
//   - one contiguous array → every probe step hits prefetched cache lines
//   - no per-entry allocation, no pointer chases
//   - the linked-list page's cache penalty applies to chaining, not here

C• • •
// Open addressing: NO linked lists. When a bucket is taken, probe the
// next slot. Modern hash maps (Swiss tables, Rust's std HashMap,
// Google's Abseil) use this. Cache-friendly because everything stays
// in one contiguous Vec, exactly what the memory page would approve of.
struct OpenMap<K, V> {
    buckets: Vec<Option<(K, V)>>,
}

impl<K: std::hash::Hash + Eq + Clone, V: Clone> OpenMap<K, V> {
    fn insert(&mut self, key: K, value: V) {
        let cap = self.buckets.len();
        let mut i = self.hash(&key) % cap;
        loop {
            match &self.buckets[i] {
                None => { self.buckets[i] = Some((key, value)); return; }
                Some((k, _)) if *k == key => {
                    self.buckets[i] = Some((key, value));
                    return;
                }
                _ => i = (i + 1) % cap, // linear probe to the next slot
            }
        }
    }

    fn hash(&self, key: &K) -> usize {
        use std::collections::hash_map::DefaultHasher;
        use std::hash::{Hash, Hasher};
        let mut h = DefaultHasher::new();
        key.hash(&mut h);
        h.finish() as usize
    }
}

// Why this beats chaining on real machines:
//   - one contiguous array → every probe step hits prefetched cache lines
//   - no per-entry allocation, no pointer chases
//   - the linked-list page's cache penalty applies to chaining, not here

Chaining vs open addressing, in one table

	chaining	open addressing
collision handling	linked list per bucket	probe the next slot
memory layout	array of buckets + heap-allocated chains	one contiguous array
cache friendliness	poor: pointer chases through the heap	excellent: every probe hits prefetched cache lines
delete	easy: unlink the node	trickier: tombstones, or shift entries back
used by	Java `HashMap`, classic textbooks	Rust, Python 3.6+, Swiss tables, Abseil

The load factor and resizing

How full a hash map is matters enormously. The load factor is num_entries / num_buckets. As it rises, collisions get more frequent, and the "constant-time" lookup gets slower. As it falls, you waste memory.

The standard trick: once the load factor passes some threshold (typically 0.7 to 0.9), resize. Allocate a new bucket array, usually twice as big, and re-hash every entry into it. The resize is O(n), but it happens rarely enough that the average insert is still amortised O(1), exactly the same argument as Vec::push from the arrays page.

// the worst case is still O(n)

If an attacker can choose keys that all hash to the same bucket, your O(1) hash map degrades to an O(n) linked-list walk. This isn't theoretical: "hash flooding" was a real DoS vector in PHP, Java, and others. Modern languages defend against it by randomly seeding their hash functions at process start (SipHash in Rust, Python, etc.). The hash is unpredictable across runs, so an attacker can't precompute a collision set.

Advanced// level 03

Cryptographic hashes, Merkle trees, and the blockchain

When 'good enough' isn't enough

The hash functions inside HashMap (FNV-1a, SipHash, fxhash, wyhash) are fast. A few nanoseconds per key. They have to be; you'd never use one of them if it cost a microsecond. They distribute keys well, they're hard for an attacker to flood, and that's where their job ends.

Cryptographic hashes (SHA-256, BLAKE3, Keccak) are a different beast. They're slower (10s of nanoseconds per byte, not per key) and they add three more properties:

Pre-image resistance: given a hash h, you cannot find an input x such that hash(x) = h. (Better than that: you cannot find one in less than ~2²⁵⁶ attempts.)
Collision resistance: you cannot find two distinct inputs that produce the same hash, in any reasonable amount of compute.
Avalanche: change one bit of input, roughly half the bits of the output flip, chaotically and unpredictably.

These three properties turn the hash into something stronger than an index. They turn it into an identity.

Rust• • •
// Cryptographic hashes are like map hashes, but stronger.
//
// Properties HashMap doesn't need but blockchains do:
//   1. Pre-image resistance: given H(x), can't find x.
//   2. Collision resistance: can't find x, y with H(x) == H(y).
//   3. Avalanche: flip one bit of x → ~half the bits of H(x) flip.
//
// The price is speed: SHA-256 takes ~10 ns per byte; FNV-1a takes
// fractions of a nanosecond. You'd never use SHA-256 in a HashMap.
use sha2::{Sha256, Digest};

fn main() {
    let block_data = b"prev_hash || merkle_root || nonce=42";
    let mut hasher = Sha256::new();
    hasher.update(block_data);
    let digest = hasher.finalize();

    // 32 bytes (256 bits) regardless of input length.
    println!("{:x}", digest);

    // Bitcoin block-mining loop, in 4 lines:
    //   for nonce in 0.. {
    //       let h = sha256(block_with(nonce));
    //       if h.starts_with_zero_bits(target) { break }
    //   }
}

C• • •
// Cryptographic hashes are like map hashes, but stronger.
//
// Properties HashMap doesn't need but blockchains do:
//   1. Pre-image resistance: given H(x), can't find x.
//   2. Collision resistance: can't find x, y with H(x) == H(y).
//   3. Avalanche: flip one bit of x → ~half the bits of H(x) flip.
//
// The price is speed: SHA-256 takes ~10 ns per byte; FNV-1a takes
// fractions of a nanosecond. You'd never use SHA-256 in a HashMap.
use sha2::{Sha256, Digest};

fn main() {
    let block_data = b"prev_hash || merkle_root || nonce=42";
    let mut hasher = Sha256::new();
    hasher.update(block_data);
    let digest = hasher.finalize();

    // 32 bytes (256 bits) regardless of input length.
    println!("{:x}", digest);

    // Bitcoin block-mining loop, in 4 lines:
    //   for nonce in 0.. {
    //       let h = sha256(block_with(nonce));
    //       if h.starts_with_zero_bits(target) { break }
    //   }
}

// content-addressed storage

Git, IPFS, Docker image layers, Nix package store: every one of these stores its objects under the hash of their contents. Two files with the same contents have the same hash, which means they're literally the same file in the store. Tampering with a file changes its hash, which means tampering is instantly detectable. The whole architecture rests on one assumption: nobody can find two inputs with the same hash. So far, for SHA-256, nobody has.

Merkle trees: hashing, but recursive

You have a million transactions. You want a single 32-byte fingerprint that proves none of them have been tampered with, and you want it to be efficient to verify any one transaction's membership in the set without re-hashing the whole million.

The Merkle tree solves this. Hash each transaction. Pair the leaf hashes and hash those pairs. Repeat until one hash is left: the Merkle root.

The root is a 32-byte fingerprint of the entire tree. Two superpowers fall out of this structure:

If any leaf changes (flip a single bit on tx2), every hash on the path from that leaf to the root changes. The root becomes unrecognisable.
To prove tx2 is part of the tree, you don't need the other million transactions. You need H(tx1), the sibling hash one level up, and the sibling hash above that. That's O(log n) hashes: a few dozen, even for trees of billions of leaves. This is the Merkle proof, and it's how light wallets, Certificate Transparency, BitTorrent, and Git all prove inclusion cheaply.

Rust• • •
// A tiny Merkle root, in 20 lines.
// Each leaf is hashed. Pairs of hashes are concatenated and hashed.
// Repeat until one hash remains: the root.
use sha2::{Sha256, Digest};

fn h(bytes: &[u8]) -> Vec<u8> {
    Sha256::digest(bytes).to_vec()
}

fn merkle_root(leaves: &[&[u8]]) -> Vec<u8> {
    // Hash all leaves.
    let mut layer: Vec<Vec<u8>> = leaves.iter().map(|l| h(l)).collect();

    // Combine pairs until only the root is left.
    while layer.len() > 1 {
        let mut next = Vec::with_capacity(layer.len() / 2 + 1);
        for pair in layer.chunks(2) {
            let mut combined = pair[0].clone();
            // If odd number of nodes, duplicate the last one (Bitcoin's rule).
            combined.extend_from_slice(pair.get(1).unwrap_or(&pair[0]));
            next.push(h(&combined));
        }
        layer = next;
    }
    layer.pop().unwrap()
}

// Change any leaf, even by one bit, and the root changes completely.
// That single 32-byte root is enough to prove the integrity of
// gigabytes of transactions underneath it.

C• • •
// A tiny Merkle root, in 20 lines.
// Each leaf is hashed. Pairs of hashes are concatenated and hashed.
// Repeat until one hash remains: the root.
use sha2::{Sha256, Digest};

fn h(bytes: &[u8]) -> Vec<u8> {
    Sha256::digest(bytes).to_vec()
}

fn merkle_root(leaves: &[&[u8]]) -> Vec<u8> {
    // Hash all leaves.
    let mut layer: Vec<Vec<u8>> = leaves.iter().map(|l| h(l)).collect();

    // Combine pairs until only the root is left.
    while layer.len() > 1 {
        let mut next = Vec::with_capacity(layer.len() / 2 + 1);
        for pair in layer.chunks(2) {
            let mut combined = pair[0].clone();
            // If odd number of nodes, duplicate the last one (Bitcoin's rule).
            combined.extend_from_slice(pair.get(1).unwrap_or(&pair[0]));
            next.push(h(&combined));
        }
        layer = next;
    }
    layer.pop().unwrap()
}

// Change any leaf, even by one bit, and the root changes completely.
// That single 32-byte root is enough to prove the integrity of
// gigabytes of transactions underneath it.

The chain in blockchain

A blockchain is a hash-linked list. Take the linked-list page, replace the next pointer with a cryptographic hash of the previous block's contents, and you have it.

Every block contains, among other things:

A prev-hash: the SHA-256 of the previous block's header.
A Merkle root: a fingerprint of every transaction in this block.
A nonce: a number whose only purpose is to make the block's own hash come out a certain way.

Tamper with any historical transaction → its leaf hash changes → the Merkle root of its block changes → that block's overall hash changes → the next block's prev-hash no longer matches → the chain breaks. And it breaks not just here but for every block after the tampered one. Re-faking that requires re-mining every subsequent block, faster than the rest of the network is mining new ones. That's the security argument, in one sentence.

// proof-of-work, in one paragraph

Bitcoin's mining is just: keep trying nonces until sha256(block_header) starts with enough zero bits. Because SHA-256 has the avalanche property, the output is effectively random, and the probability of a random 256-bit number starting with N zero bits is 1/2^N. So finding a valid block takes ~2^N hash attempts on average. That's the work in "proof of work". The hash function does all the heavy lifting; everything else is just a knob to dial the difficulty.

Where hashing actually lives in your stack

every dynamic language

dict / object / map

Python <code>dict</code>, JavaScript object property access, Ruby Hash, Lua tables, Go maps. Every <code>obj.foo</code> or <code>d[key]</code> is a hash + bucket lookup. Hash maps are the most-used data structure in software, by a wide margin.

content-addressed systems

Git, IPFS, Docker, Nix

Every Git commit, file, and tree is stored by its SHA-1 (now SHA-256) hash. Docker image layers, IPFS objects, Nix store paths: same pattern. The address <em>is</em> the contents' fingerprint, so deduplication, integrity, and synchronisation all fall out of the same primitive.

blockchains and crypto

Bitcoin, Ethereum, signatures

Block headers, Merkle roots, transaction IDs, addresses, proof-of-work, proof-of-stake: every layer of every blockchain is layered hash functions. Digital signatures hash the message before signing it; certificate transparency hashes certs into Merkle trees; ZK proofs hash everything in sight.

system internals

Bloom filters, dedup, caches

Bloom filters (probabilistic set membership) in databases and CDNs. Block-level dedup in filesystems like ZFS and Btrfs. Cache keys in browsers and CDNs. Password storage (bcrypt, scrypt, Argon2, all built on cryptographic hashes). Even <code>rsync</code> uses hashes to detect changed file ranges.

The connection back to the whole site

Step back and the picture is clean. A hash map is an array of linked lists, with a function for picking the right slot. Cryptographic hashes are the same function, slower and harder. A Merkle tree is that function applied recursively. A blockchain is a Merkle-rooted linked list with hashes for next-pointers, secured by a brute-force search the CPU page would recognise as embarrassingly parallel work.

Every layer is the same trick, applied with more or less paranoia, against different threat models. From the home-page topic cards' bottom row up: pointers became linked lists became hash maps; arrays plus linked lists plus hashes became the architecture of every database, every package manager, every cryptocurrency. The pieces don't change. They just stack.

Where to dig in next

Hashing is the topic with the longest tail of follow-ons. If any of these caught your eye:

SipHash, wyhash, BLAKE3. The cutting edge of fast, attack-resistant non-cryptographic and cryptographic hashes respectively.
Bloom filters and cuckoo filters. Probabilistic set membership in a fraction of the memory of a hash map. The basis of every "have I seen this before?" check at scale.
Consistent hashing. The trick that lets distributed systems (Cassandra, DynamoDB, memcached clusters) reshard when nodes come and go, without remapping every key.
Distributed hash tables. Kademlia, BitTorrent's mainline DHT, Chord. How peer-to-peer networks find each other without a central index.
Zero-knowledge proofs (SNARKs, STARKs). Cryptographic hashes are at the core of every modern zk system: Merkle trees of computation traces, hashed into proofs you can verify in milliseconds.
Robin Hood and Hopscotch hashing. Open-addressing variants that bound the worst-case probe distance, used in some of the fastest production hash maps.

next up / 0x0E

One word, three scales. What 'node' means in data structures, networking, and blockchain.

nodes →