Bits in a loop.
A machine that runs.
You have bits, you have encodings, you have gates. Wire them into a circuit that fetches a bit pattern from memory, decides what it means, and acts on it. Then put that whole thing on a clock and let it loop. That's a CPU. This page builds one, conceptually, from the parts you already have.
Fetch, decode, execute
Your CPU has one job. One. It reads an instruction, it executes it, it moves to the next one. That is the entire CPU.
But here is what nobody tells you about that one job. That instruction is not English. It is not Python. It is not even assembly. It is this: 10110000 01100001. Raw binary. Two bytes. Sixteen switches, on and off.
Your CPU reads those 16 bits, decodes them through logic gates, and in a single clock tick executes the operation they describe. Do you see what is happening? Everything connects. Transistors built the logic gates. Logic gates built the circuits. Circuits decode the binary. Binary carries the instruction. The instruction changes the state.
All of that, in one clock cycle, at four billion cycles per second. And it has been doing this since the moment you powered on. Without stopping. Without resting. Without ever asking what it all means.
A CPU has exactly one job: it sits in a loop. Each tick of its clock, it does three things:
- Fetch the next instruction (a bit pattern) from memory.
- Decode it: figure out what those bits mean.
- Execute the operation, possibly updating memory or a register.
That's the whole machine. Every program (your browser, this page, the OS scheduler) is a sequence of those instructions, fetched one by one, billions of times a second.
What's an instruction?
An instruction is just a byte (or a few bytes) where different bits mean different things. Same logic as the ASCII table on page 2, except instead of "this byte is the letter A," the convention is "this byte is the operation ADD, on registers 0 and 1."
A real x86 instruction can be 1 to 15 bytes; RISC-V is a clean 4 bytes; ARM Thumb is 2. They all share the same shape: opcode + operands, packed into bits.
Step through a program
Decoding an instruction is just bit shifts and masks, the same bitwise operators from the binary page. (ins >> 5) & 0b111 extracts the opcode; (ins >> 3) & 0b11 extracts the register. The CPU reads instructions exactly the way you learned to read bits. ← see: binary
| bits | field | meaning |
|---|---|---|
| 7..5 | opcode | 011 = ADD, 001 = LOAD, … |
| 4..3 | dest reg | which register receives the result |
| 2..0 | src reg / immediate | the other operand |
(x >> 3) & 1.A 6-instruction CPU you can read in 5 minutes
Here's a complete software simulation of a tiny CPU. It has 4 registers, 6 opcodes, and a single byte per instruction. The whole loop fits on one screen, and every concept is real. Every line maps to something a real chip does.
// A toy CPU with 4 registers and 6 ops.
// Each instruction is a single byte:
// bits 7..5 = opcode (3 bits)
// bits 4..3 = destination register (2 bits)
// bits 2..0 = source register OR small immediate
const HALT: u8 = 0b000;
const LOAD: u8 = 0b001; // dst = imm
const MOV : u8 = 0b010; // dst = src
const ADD : u8 = 0b011; // dst = dst + src
const SUB : u8 = 0b100; // dst = dst - src
const PRT : u8 = 0b101; // print dst
fn run(program: &[u8]) {
let mut reg = [0u8; 4];
let mut pc = 0usize;
loop {
// FETCH
let ins = program[pc];
pc += 1;
// DECODE: pure bit-shifts, the same trick from page 1.
let op = (ins >> 5) & 0b111;
let dst = ((ins >> 3) & 0b11) as usize;
let src = (ins & 0b111) as usize;
// EXECUTE
match op {
x if x == HALT => break,
x if x == LOAD => reg[dst] = src as u8,
x if x == MOV => reg[dst] = reg[src & 0b11],
x if x == ADD => reg[dst] = reg[dst].wrapping_add(reg[src & 0b11]),
x if x == SUB => reg[dst] = reg[dst].wrapping_sub(reg[src & 0b11]),
x if x == PRT => println!("r{} = {}", dst, reg[dst]),
_ => panic!("bad opcode"),
}
}
}
fn main() {
// r0 = 5; r1 = 7; r0 = r0 + r1; print r0; halt
let program = [
0b001_00_101, // LOAD r0, 5
0b001_01_111, // LOAD r1, 7
0b011_00_001, // ADD r0, r1
0b101_00_000, // PRT r0 → 12
0b000_00_000, // HALT
];
run(&program);
}// Same toy CPU in C.
#include <stdio.h>
#include <stdint.h>
#define HALT 0
#define LOAD 1
#define MOV 2
#define ADD 3
#define SUB 4
#define PRT 5
void run(const uint8_t *program) {
uint8_t reg[4] = {0};
size_t pc = 0;
for (;;) {
// FETCH
uint8_t ins = program[pc++];
// DECODE
uint8_t op = (ins >> 5) & 0x7;
uint8_t dst = (ins >> 3) & 0x3;
uint8_t src = ins & 0x7;
// EXECUTE
switch (op) {
case HALT: return;
case LOAD: reg[dst] = src; break;
case MOV : reg[dst] = reg[src & 0x3]; break;
case ADD : reg[dst] += reg[src & 0x3]; break;
case SUB : reg[dst] -= reg[src & 0x3]; break;
case PRT : printf("r%u = %u\n", dst, reg[dst]); break;
}
}
}
int main(void) {
uint8_t program[] = {
0b00100101, // LOAD r0, 5
0b00101111, // LOAD r1, 7
0b01100001, // ADD r0, r1
0b10100000, // PRT r0 → 12
0b00000000, // HALT
};
run(program);
return 0;
}Registers, ALU, control unit: the parts of a CPU
Open up the toy simulator above and ask: where does that loop live in silicon? Each piece maps to a physical block of gates from the previous page.
The ALU is the full adder you built on the logic gates page, repeated 64 times side by side for a 64-bit processor. Every ADD instruction your program executes is those XOR and AND gates firing in sequence. ← see: logic gates
The clock: what makes it all step forward
A CPU is a sea of combinational gates plus banks of flip-flops. Combinational gates have no memory; their output is a pure function of input. So nothing changes on its own. What drives the loop forward is the clock: a square-wave signal toggling at a fixed rate (3 GHz = 3 billion times a second). On every rising edge, the flip-flops latch their inputs and the next state begins.
"Faster CPU" mostly means "shorter clock period," which only works if every signal can propagate through every gate before the next tick. Get that wrong and you read garbage out of the latches.
Memory: where the program actually lives
Registers are tiny and fast (a few hundred bytes, accessed in 1 cycle). Real programs need more, gigabytes more. So the CPU is wired to a separate chip called RAM, addressed by a number. Each instruction fetch is really:
Compared to a register access, RAM is slow. Hundreds of cycles. The whole subject of computer architecture is, mostly, about hiding that latency.
Every value the CPU reads from memory is binary bytes at an address. That address is a hex number, 0x7fff5fbff8a4, a pointer: a number that means somewhere. The CPU follows millions of these every second. ← see: pointers
Pipelining, caches & branch prediction
Why a 3 GHz CPU runs more than 3 billion ops/s
Naively, fetch-decode-execute takes at least three clock cycles per instruction. So a 3 GHz CPU should top out at 1 billion instructions per second. In practice, modern CPUs sustain several instructions per cycle. How?
Pipelining: an assembly line for instructions
The trick: while instruction N is executing, instruction N+1 can be decoding, and N+2 can be fetching. Each pipeline stage runs in parallel, like a factory assembly line. A 5-stage pipeline finishes (ideally) one instruction per cycle.
Real CPUs go further. Modern x86 cores have 14 to 20 pipeline stages and issue 4+ instructions per cycle (superscalar, with multiple ALUs).
Caches: bridging the RAM gap
RAM is far away (electrically) and slow (hundreds of cycles). To hide that, the CPU keeps recently-used bytes in small, fast SRAM banks right next to the cores: caches. Most programs touch the same bytes repeatedly, so this works extraordinarily well.
| level | size | latency | where |
|---|---|---|---|
| L1 | 32-64 KB | ~4 cycles | per-core, split into instruction and data |
| L2 | 256 KB to 1 MB | ~12 cycles | per-core |
| L3 | 8-64 MB | ~40 cycles | shared between cores |
| RAM | GBs | ~200+ cycles | off-chip |
This is why cache-friendly code matters. A linear walk over an array hits L1 on every access. A pointer-chase through a linked list, where every node sits in a different cache line, pays the full RAM tax. Same algorithmic complexity, 50× difference in wall time.
Branch prediction
Pipelining fights one big enemy: the branch. When the CPU hits an if, it doesn't yet know which way to go, but the pipeline needs to keep fetching something. So the CPU guesses, using a tiny on-chip lookup table that tracks the recent history of every branch. Right guess: full speed. Wrong guess: flush the pipeline and start over (10 to 20 cycle penalty).
Modern predictors hit 95%+. They're so good that data layout can change performance dramatically. Sort an array first, and the same hot loop runs several times faster, because the same branch now goes the same way for many iterations in a row.
// The classic branch-prediction demo.
// Sorting the array makes the branch predictable, so the same
// loop runs ~3-6× faster on real hardware.
use std::time::Instant;
fn main() {
let mut data: Vec<i32> = (0..32_768).map(|i| (i * 1103515245 + 12345) % 256).collect();
// Run once UNSORTED, once SORTED.
for label in ["unsorted", "sorted"] {
if label == "sorted" { data.sort(); }
let t = Instant::now();
let mut sum: u64 = 0;
for _ in 0..1_000 {
for &x in &data {
if x >= 128 { // ← the unpredictable branch
sum += x as u64;
}
}
}
println!("{label:>9}: sum={sum} in {:.2?}", t.elapsed());
}
}#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static int cmp(const void *a, const void *b) {
return *(const int*)a - *(const int*)b;
}
int main(void) {
enum { N = 32768 };
int *data = malloc(N * sizeof *data);
for (int i = 0; i < N; i++)
data[i] = (int)((i * 1103515245u + 12345u) % 256u);
for (int phase = 0; phase < 2; phase++) {
if (phase == 1) qsort(data, N, sizeof *data, cmp);
clock_t t = clock();
unsigned long long sum = 0;
for (int k = 0; k < 1000; k++)
for (int i = 0; i < N; i++)
if (data[i] >= 128) // ← the unpredictable branch
sum += (unsigned)data[i];
double secs = (double)(clock() - t) / CLOCKS_PER_SEC;
printf("%s: sum=%llu in %.2fs\n",
phase ? " sorted" : "unsorted", sum, secs);
}
free(data);
return 0;
}The CPU inside Bitcoin miners
Every Bitcoin miner on Earth is a CPU running one function, over and over, as fast as possible, forever. That function is SHA-256. And SHA-256 is pure CPU work. Here is what a miner is actually computing, one round of the 64 that make up a single hash:
// One SHA-256 round: pure CPU operations
// on registers a through h.
fn sha256_round(
a: u32, b: u32, c: u32, d: u32,
e: u32, f: u32, g: u32, h: u32,
k: u32, w: u32,
) -> (u32, u32, u32, u32, u32, u32, u32, u32) {
// Every line below is an ALU operation.
let s1 = e.rotate_right(6)
^ e.rotate_right(11)
^ e.rotate_right(25);
let ch = (e & f) ^ (!e & g); // AND, XOR, NOT
let temp1 = h
.wrapping_add(s1)
.wrapping_add(ch)
.wrapping_add(k)
.wrapping_add(w);
let s0 = a.rotate_right(2)
^ a.rotate_right(13)
^ a.rotate_right(22);
let maj = (a & b) ^ (a & c) ^ (b & c);
let temp2 = s0.wrapping_add(maj);
(
temp1.wrapping_add(temp2), // new a
a, b, c,
d.wrapping_add(temp1), // new e
e, f, g,
)
}#include <stdint.h>
static inline uint32_t rotr(uint32_t x, int n) {
return (x >> n) | (x << (32 - n));
}
void sha256_round(
uint32_t *a, uint32_t *b,
uint32_t *c, uint32_t *d,
uint32_t *e, uint32_t *f,
uint32_t *g, uint32_t *h,
uint32_t k, uint32_t w
) {
uint32_t s1 = rotr(*e,6) ^ rotr(*e,11) ^ rotr(*e,25);
uint32_t ch = (*e & *f) ^ (~*e & *g);
uint32_t t1 = *h + s1 + ch + k + w;
uint32_t s0 = rotr(*a,2) ^ rotr(*a,13) ^ rotr(*a,22);
uint32_t maj = (*a & *b) ^ (*a & *c) ^ (*b & *c);
uint32_t t2 = s0 + maj;
*h = *g; *g = *f; *f = *e;
*e = *d + t1;
*d = *c; *c = *b; *b = *a;
*a = t1 + t2;
}Every operation in those functions is a single CPU instruction:
rotate_right/rotr: a barrel shifter in the ALU.&,^,!: AND, XOR, NOT gates firing.wrapping_add: the full adder circuit from the logic gates page.
SHA-256 runs 64 of these rounds per hash attempt. Bitcoin miners attempt trillions of hashes. Each attempt is 64 rounds; each round is dozens of ALU operations; each ALU operation is logic gates; each gate is transistors. A modern Bitcoin ASIC runs tens of trillions of SHA-256 hashes per second.
All of it is the fetch-decode-execute loop you learned on this page, running as fast as physics allows. This is what secures hundreds of billions of dollars. Not cryptographic magic. Not financial engineering. Just a CPU, doing one job, very very fast.
SHA-256 is a CPU workload: 64 rounds of rotations, AND, XOR, NOT, and additions per hash, run trillions of times per second by mining hardware to find a hash with enough leading zeros. ← see: hashing
The full stack, again
- Electrons gated by transistors form logic gates.
- Gates form adders, latches, multiplexers.
- Those compose into registers, ALUs, control units.
- A clock drives them through fetch-decode-execute.
- Bits in memory encode instructions (this page) and data.
- Data bits encode numbers (page 1) and characters (page 2).
- Sequences of instructions become a program.
- Run a few billion of them per second and you get the experience of using a computer.
Where to dig in next
You've seen the whole stack now. Natural deep-dives:
- RISC-V: a clean, open ISA you can read in an afternoon.
- nand2tetris: build a CPU from NAND gates and run software on it.
- Agner Fog's microarchitecture manuals, the canonical reference on how real x86 cores actually work.
- Compiler Explorer (godbolt.org): see what bytes your high-level code actually becomes.
Where the CPU appears in BitRoot
The CPU is the engine the whole site runs on. Every other topic either feeds it instructions or runs on top of its loop: