A fingerprint for
anything.
A hash function turns arbitrary input (a string, a file, a struct, a gigabyte of transactions) into a small, fixed-size number. From that one idea you get hash maps (the workhorse of every dynamic language), content-addressed storage, password hashing, Merkle trees, and the chain in blockchain. Every layer is the same trick, just applied with more or less paranoia.
Hashing as a fingerprint function
Imagine a librarian with a peculiar talent. You hand them any book (a thin pamphlet, a thousand-page novel, a comic) and within a second they hand back a five-digit number. The same book always gets the same number. Different books almost always get different numbers. The number tells you nothing about the book itself, and you can't reconstruct the book from the number. That number is a hash, and the librarian is a hash function.
A hash function:
- Takes input of arbitrary size: bytes are bytes.
- Returns a fixed-size number: 32, 64, or 256 bits, depending on the function.
- Is deterministic: same input → same output, every time, on every machine.
- Spreads inputs uniformly: similar inputs get wildly different outputs.
- Is one-way: cheap to compute forward, useless to run in reverse.
A picture of a hash function
Notice the third row: "alicf" differs from "alice" by a single character, but the output is completely different. This is the avalanche property: flip one input bit and roughly half the output bits flip. It's what makes hashes useful for both lookup tables and tamper detection.
"alice" into a bucket index. From there it's the same O(1) indexed access an array always gave you.A hash map: hash → index → array slot
This is the whole magic, in one picture. To store the pair ("alice", 27):
- Compute
hash("alice"): say it returns the 64-bit number0x5d1a8b…. - Reduce it to a bucket index with
hash % capacity, which here gives 3. - Write
("alice", 27)intobuckets[3].
To look it up again, you do the same thing: hash "alice", get the same 3, read buckets[3]. No walking. No searching. The lookup time doesn't depend on how many keys are stored.
Build one in Rust and C
use std::collections::HashMap;
fn main() {
// HashMap<K, V> is Rust's hash map: an array of buckets,
// indexed by hash(key). Insert and lookup are amortised O(1).
let mut ages: HashMap<&str, u32> = HashMap::new();
ages.insert("alice", 27);
ages.insert("bob", 31);
ages.insert("carol", 42);
// Lookup: hash "alice", land in a bucket, return the value.
// No walking 3 entries comparing names. The hash tells us
// exactly which bucket to look in.
if let Some(age) = ages.get("alice") {
println!("alice is {age}");
}
// What the standard library does under the hood:
// 1. compute hasher.finish() for "alice" (a u64)
// 2. reduce it to a bucket index by & (capacity - 1)
// 3. jump straight to that bucket
// 4. compare keys (in case of collision) until match or empty
}#include <stdio.h>
#include <string.h>
#include <stdint.h>
// A toy fixed-size hash map. Real ones grow and handle collisions
// far more carefully; see the intermediate section.
#define BUCKETS 8
typedef struct {
const char *key;
int value;
int used;
} Entry;
// A small, fast (non-cryptographic) string hash: FNV-1a.
uint64_t fnv1a(const char *s) {
uint64_t h = 0xcbf29ce484222325ULL;
while (*s) {
h ^= (uint8_t)*s++;
h *= 0x100000001b3ULL;
}
return h;
}
int main(void) {
Entry table[BUCKETS] = {0};
// Insert: hash the key, mod by capacity, drop into that slot.
const char *k = "alice";
uint64_t h = fnv1a(k);
int idx = h & (BUCKETS - 1); // works because BUCKETS is a power of 2
table[idx].key = k;
table[idx].value = 27;
table[idx].used = 1;
// Lookup: same hash, same index, read the value.
int q = fnv1a("alice") & (BUCKETS - 1);
if (table[q].used && strcmp(table[q].key, "alice") == 0)
printf("alice -> %d\n", table[q].value);
return 0;
}dict, JavaScript objects and Map, Go map[K]V, Rust HashMap, Java HashMap, C# Dictionary, Lua tables, Redis itself: all hash maps under the hood. Every time you write obj.foo in JavaScript or d["key"] in Python, you're doing the diagram above. The "key" gets hashed, the hash points at a slot, and the value gets read in constant time.Collisions, chaining, and the load factor
The beginner picture has a hole in it. If buckets has 8 slots and you insert 100 keys, the pigeonhole principle says at least one bucket has to hold more than one key. Two distinct keys whose hashes land in the same bucket is called a collision, and every real hash map needs a strategy for handling them.
Strategy 1: Separate chaining
Each bucket isn't a single slot; it's a list of entries. To insert, you hash, find the bucket, walk the list, append (or update if the key already exists). To look up, you hash, find the bucket, walk the list comparing keys.
This is where the previous two pages pay off. A chained hash map is literally an array of linked lists. The arrays page gives you O(1) bucket lookup. The linked-list page gives you O(1) splicing of new entries into a bucket. Both data structures, both doing what they're best at, side by side.
buckets[3] is a singly linked list, the same shape as the diagram on the linked-list page. The good news: chains are usually tiny. With a sensible load factor (next section), the average chain length is well under 2. The bad news: each chain node is a separate heap allocation, with all the cache penalty the linked-list page warned about. That's why modern implementations prefer open addressing.// A hash map with separate chaining, built by hand.
// Buckets is a Vec; each bucket is a Vec of (key, value) entries.
// This is exactly Vec<LinkedList<(K, V)>> in shape: array + linked-list
// chains from the two previous pages, working together.
struct ChainMap<K, V> {
buckets: Vec<Vec<(K, V)>>,
}
impl<K: std::hash::Hash + Eq, V> ChainMap<K, V> {
fn new(cap: usize) -> Self {
Self { buckets: (0..cap).map(|_| Vec::new()).collect() }
}
fn bucket_for(&self, key: &K) -> usize {
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
let mut h = DefaultHasher::new();
key.hash(&mut h);
(h.finish() as usize) % self.buckets.len()
}
fn insert(&mut self, key: K, value: V) {
let i = self.bucket_for(&key);
let bucket = &mut self.buckets[i];
// Update if key already present, otherwise push.
for (k, v) in bucket.iter_mut() {
if *k == key { *v = value; return; }
}
bucket.push((key, value));
}
fn get(&self, key: &K) -> Option<&V> {
let i = self.bucket_for(key);
self.buckets[i].iter().find(|(k, _)| k == key).map(|(_, v)| v)
}
}#include <stdlib.h>
#include <string.h>
#include <stdint.h>
// Same idea in C: array of bucket-heads, each one a singly linked list.
typedef struct Entry {
char *key;
int value;
struct Entry *next; // chain pointer, the linked-list page
} Entry;
#define BUCKETS 256
static Entry *table[BUCKETS] = {0};
uint64_t fnv1a(const char *s); // from the previous snippet
void put(const char *key, int value) {
uint64_t h = fnv1a(key) & (BUCKETS - 1);
// Walk the chain in case the key already exists.
for (Entry *e = table[h]; e; e = e->next) {
if (strcmp(e->key, key) == 0) { e->value = value; return; }
}
// New entry: prepend to the bucket's chain (O(1)).
Entry *e = malloc(sizeof *e);
e->key = strdup(key);
e->value = value;
e->next = table[h];
table[h] = e;
}Strategy 2: Open addressing
The alternative: no linked lists at all. The bucket array is the storage. When a bucket is taken, probe the next one, then the next, until you find an empty slot. To look up a key, hash to the start position and walk forward (using the same probe pattern) until you find the key, or hit an empty slot, or wrap around.
This is the strategy used by Rust's std::collections::HashMap, Google's Abseil (SwissTable), Python dict (since 3.6), and most other modern implementations. Why? Cache locality. Everything lives in one contiguous Vec. The CPU's cache prefetcher loves it. The linked-list version, with its scattered heap allocations, doesn't.
// Open addressing: NO linked lists. When a bucket is taken, probe the
// next slot. Modern hash maps (Swiss tables, Rust's std HashMap,
// Google's Abseil) use this. Cache-friendly because everything stays
// in one contiguous Vec, exactly what the memory page would approve of.
struct OpenMap<K, V> {
buckets: Vec<Option<(K, V)>>,
}
impl<K: std::hash::Hash + Eq + Clone, V: Clone> OpenMap<K, V> {
fn insert(&mut self, key: K, value: V) {
let cap = self.buckets.len();
let mut i = self.hash(&key) % cap;
loop {
match &self.buckets[i] {
None => { self.buckets[i] = Some((key, value)); return; }
Some((k, _)) if *k == key => {
self.buckets[i] = Some((key, value));
return;
}
_ => i = (i + 1) % cap, // linear probe to the next slot
}
}
}
fn hash(&self, key: &K) -> usize {
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
let mut h = DefaultHasher::new();
key.hash(&mut h);
h.finish() as usize
}
}
// Why this beats chaining on real machines:
// - one contiguous array → every probe step hits prefetched cache lines
// - no per-entry allocation, no pointer chases
// - the linked-list page's cache penalty applies to chaining, not here// Open addressing: NO linked lists. When a bucket is taken, probe the
// next slot. Modern hash maps (Swiss tables, Rust's std HashMap,
// Google's Abseil) use this. Cache-friendly because everything stays
// in one contiguous Vec, exactly what the memory page would approve of.
struct OpenMap<K, V> {
buckets: Vec<Option<(K, V)>>,
}
impl<K: std::hash::Hash + Eq + Clone, V: Clone> OpenMap<K, V> {
fn insert(&mut self, key: K, value: V) {
let cap = self.buckets.len();
let mut i = self.hash(&key) % cap;
loop {
match &self.buckets[i] {
None => { self.buckets[i] = Some((key, value)); return; }
Some((k, _)) if *k == key => {
self.buckets[i] = Some((key, value));
return;
}
_ => i = (i + 1) % cap, // linear probe to the next slot
}
}
}
fn hash(&self, key: &K) -> usize {
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
let mut h = DefaultHasher::new();
key.hash(&mut h);
h.finish() as usize
}
}
// Why this beats chaining on real machines:
// - one contiguous array → every probe step hits prefetched cache lines
// - no per-entry allocation, no pointer chases
// - the linked-list page's cache penalty applies to chaining, not hereChaining vs open addressing, in one table
| chaining | open addressing | |
|---|---|---|
| collision handling | linked list per bucket | probe the next slot |
| memory layout | array of buckets + heap-allocated chains | one contiguous array |
| cache friendliness | poor: pointer chases through the heap | excellent: every probe hits prefetched cache lines |
| delete | easy: unlink the node | trickier: tombstones, or shift entries back |
| used by | Java HashMap, classic textbooks | Rust, Python 3.6+, Swiss tables, Abseil |
The load factor and resizing
How full a hash map is matters enormously. The load factor is num_entries / num_buckets. As it rises, collisions get more frequent, and the "constant-time" lookup gets slower. As it falls, you waste memory.
The standard trick: once the load factor passes some threshold (typically 0.7 to 0.9), resize. Allocate a new bucket array, usually twice as big, and re-hash every entry into it. The resize is O(n), but it happens rarely enough that the average insert is still amortised O(1), exactly the same argument as Vec::push from the arrays page.
SipHash in Rust, Python, etc.). The hash is unpredictable across runs, so an attacker can't precompute a collision set.Cryptographic hashes, Merkle trees, and the blockchain
When 'good enough' isn't enough
The hash functions inside HashMap (FNV-1a, SipHash, fxhash, wyhash) are fast. A few nanoseconds per key. They have to be; you'd never use one of them if it cost a microsecond. They distribute keys well, they're hard for an attacker to flood, and that's where their job ends.
Cryptographic hashes (SHA-256, BLAKE3, Keccak) are a different beast. They're slower (10s of nanoseconds per byte, not per key) and they add three more properties:
- Pre-image resistance: given a hash
h, you cannot find an inputxsuch thathash(x) = h. (Better than that: you cannot find one in less than ~2256 attempts.) - Collision resistance: you cannot find two distinct inputs that produce the same hash, in any reasonable amount of compute.
- Avalanche: change one bit of input, roughly half the bits of the output flip, chaotically and unpredictably.
These three properties turn the hash into something stronger than an index. They turn it into an identity.
// Cryptographic hashes are like map hashes, but stronger.
//
// Properties HashMap doesn't need but blockchains do:
// 1. Pre-image resistance: given H(x), can't find x.
// 2. Collision resistance: can't find x, y with H(x) == H(y).
// 3. Avalanche: flip one bit of x → ~half the bits of H(x) flip.
//
// The price is speed: SHA-256 takes ~10 ns per byte; FNV-1a takes
// fractions of a nanosecond. You'd never use SHA-256 in a HashMap.
use sha2::{Sha256, Digest};
fn main() {
let block_data = b"prev_hash || merkle_root || nonce=42";
let mut hasher = Sha256::new();
hasher.update(block_data);
let digest = hasher.finalize();
// 32 bytes (256 bits) regardless of input length.
println!("{:x}", digest);
// Bitcoin block-mining loop, in 4 lines:
// for nonce in 0.. {
// let h = sha256(block_with(nonce));
// if h.starts_with_zero_bits(target) { break }
// }
}// Cryptographic hashes are like map hashes, but stronger.
//
// Properties HashMap doesn't need but blockchains do:
// 1. Pre-image resistance: given H(x), can't find x.
// 2. Collision resistance: can't find x, y with H(x) == H(y).
// 3. Avalanche: flip one bit of x → ~half the bits of H(x) flip.
//
// The price is speed: SHA-256 takes ~10 ns per byte; FNV-1a takes
// fractions of a nanosecond. You'd never use SHA-256 in a HashMap.
use sha2::{Sha256, Digest};
fn main() {
let block_data = b"prev_hash || merkle_root || nonce=42";
let mut hasher = Sha256::new();
hasher.update(block_data);
let digest = hasher.finalize();
// 32 bytes (256 bits) regardless of input length.
println!("{:x}", digest);
// Bitcoin block-mining loop, in 4 lines:
// for nonce in 0.. {
// let h = sha256(block_with(nonce));
// if h.starts_with_zero_bits(target) { break }
// }
}Merkle trees: hashing, but recursive
You have a million transactions. You want a single 32-byte fingerprint that proves none of them have been tampered with, and you want it to be efficient to verify any one transaction's membership in the set without re-hashing the whole million.
The Merkle tree solves this. Hash each transaction. Pair the leaf hashes and hash those pairs. Repeat until one hash is left: the Merkle root.
The root is a 32-byte fingerprint of the entire tree. Two superpowers fall out of this structure:
- If any leaf changes (flip a single bit on
tx2), every hash on the path from that leaf to the root changes. The root becomes unrecognisable. - To prove
tx2is part of the tree, you don't need the other million transactions. You needH(tx1), the sibling hash one level up, and the sibling hash above that. That'sO(log n)hashes: a few dozen, even for trees of billions of leaves. This is the Merkle proof, and it's how light wallets, Certificate Transparency, BitTorrent, and Git all prove inclusion cheaply.
// A tiny Merkle root, in 20 lines.
// Each leaf is hashed. Pairs of hashes are concatenated and hashed.
// Repeat until one hash remains: the root.
use sha2::{Sha256, Digest};
fn h(bytes: &[u8]) -> Vec<u8> {
Sha256::digest(bytes).to_vec()
}
fn merkle_root(leaves: &[&[u8]]) -> Vec<u8> {
// Hash all leaves.
let mut layer: Vec<Vec<u8>> = leaves.iter().map(|l| h(l)).collect();
// Combine pairs until only the root is left.
while layer.len() > 1 {
let mut next = Vec::with_capacity(layer.len() / 2 + 1);
for pair in layer.chunks(2) {
let mut combined = pair[0].clone();
// If odd number of nodes, duplicate the last one (Bitcoin's rule).
combined.extend_from_slice(pair.get(1).unwrap_or(&pair[0]));
next.push(h(&combined));
}
layer = next;
}
layer.pop().unwrap()
}
// Change any leaf, even by one bit, and the root changes completely.
// That single 32-byte root is enough to prove the integrity of
// gigabytes of transactions underneath it.// A tiny Merkle root, in 20 lines.
// Each leaf is hashed. Pairs of hashes are concatenated and hashed.
// Repeat until one hash remains: the root.
use sha2::{Sha256, Digest};
fn h(bytes: &[u8]) -> Vec<u8> {
Sha256::digest(bytes).to_vec()
}
fn merkle_root(leaves: &[&[u8]]) -> Vec<u8> {
// Hash all leaves.
let mut layer: Vec<Vec<u8>> = leaves.iter().map(|l| h(l)).collect();
// Combine pairs until only the root is left.
while layer.len() > 1 {
let mut next = Vec::with_capacity(layer.len() / 2 + 1);
for pair in layer.chunks(2) {
let mut combined = pair[0].clone();
// If odd number of nodes, duplicate the last one (Bitcoin's rule).
combined.extend_from_slice(pair.get(1).unwrap_or(&pair[0]));
next.push(h(&combined));
}
layer = next;
}
layer.pop().unwrap()
}
// Change any leaf, even by one bit, and the root changes completely.
// That single 32-byte root is enough to prove the integrity of
// gigabytes of transactions underneath it.The chain in blockchain
A blockchain is a hash-linked list. Take the linked-list page, replace the next pointer with a cryptographic hash of the previous block's contents, and you have it.
Every block contains, among other things:
- A prev-hash: the SHA-256 of the previous block's header.
- A Merkle root: a fingerprint of every transaction in this block.
- A nonce: a number whose only purpose is to make the block's own hash come out a certain way.
Tamper with any historical transaction → its leaf hash changes → the Merkle root of its block changes → that block's overall hash changes → the next block's prev-hash no longer matches → the chain breaks. And it breaks not just here but for every block after the tampered one. Re-faking that requires re-mining every subsequent block, faster than the rest of the network is mining new ones. That's the security argument, in one sentence.
sha256(block_header) starts with enough zero bits. Because SHA-256 has the avalanche property, the output is effectively random, and the probability of a random 256-bit number starting with N zero bits is 1/2N. So finding a valid block takes ~2N hash attempts on average. That's the work in "proof of work". The hash function does all the heavy lifting; everything else is just a knob to dial the difficulty.Where hashing actually lives in your stack
The connection back to the whole site
Step back and the picture is clean. A hash map is an array of linked lists, with a function for picking the right slot. Cryptographic hashes are the same function, slower and harder. A Merkle tree is that function applied recursively. A blockchain is a Merkle-rooted linked list with hashes for next-pointers, secured by a brute-force search the CPU page would recognise as embarrassingly parallel work.
Every layer is the same trick, applied with more or less paranoia, against different threat models. From the home-page topic cards' bottom row up: pointers became linked lists became hash maps; arrays plus linked lists plus hashes became the architecture of every database, every package manager, every cryptocurrency. The pieces don't change. They just stack.
Where to dig in next
Hashing is the topic with the longest tail of follow-ons. If any of these caught your eye:
- SipHash, wyhash, BLAKE3. The cutting edge of fast, attack-resistant non-cryptographic and cryptographic hashes respectively.
- Bloom filters and cuckoo filters. Probabilistic set membership in a fraction of the memory of a hash map. The basis of every "have I seen this before?" check at scale.
- Consistent hashing. The trick that lets distributed systems (Cassandra, DynamoDB, memcached clusters) reshard when nodes come and go, without remapping every key.
- Distributed hash tables. Kademlia, BitTorrent's mainline DHT, Chord. How peer-to-peer networks find each other without a central index.
- Zero-knowledge proofs (SNARKs, STARKs). Cryptographic hashes are at the core of every modern zk system: Merkle trees of computation traces, hashed into proofs you can verify in milliseconds.
- Robin Hood and Hopscotch hashing. Open-addressing variants that bound the worst-case probe distance, used in some of the fastest production hash maps.