root.system / 0x05 / cpu

Bits in a loop.
A machine that runs.

You have bits, you have encodings, you have gates. Wire them into a circuit that fetches a bit pattern from memory, decides what it means, and acts on it. Then put that whole thing on a clock and let it loop. That's a CPU. This page builds one, conceptually, from the parts you already have.

Beginner// level 01

Fetch, decode, execute

Your CPU has one job. One. It reads an instruction, it executes it, it moves to the next one. That is the entire CPU.

But here is what nobody tells you about that one job. That instruction is not English. It is not Python. It is not even assembly. It is this: 10110000 01100001. Raw binary. Two bytes. Sixteen switches, on and off.

Your CPU reads those 16 bits, decodes them through logic gates, and in a single clock tick executes the operation they describe. Do you see what is happening? Everything connects. Transistors built the logic gates. Logic gates built the circuits. Circuits decode the binary. Binary carries the instruction. The instruction changes the state.

All of that, in one clock cycle, at four billion cycles per second. And it has been doing this since the moment you powered on. Without stopping. Without resting. Without ever asking what it all means.

A CPU has exactly one job: it sits in a loop. Each tick of its clock, it does three things:

  1. Fetch the next instruction (a bit pattern) from memory.
  2. Decode it: figure out what those bits mean.
  3. Execute the operation, possibly updating memory or a register.

That's the whole machine. Every program (your browser, this page, the OS scheduler) is a sequence of those instructions, fetched one by one, billions of times a second.

What's an instruction?

An instruction is just a byte (or a few bytes) where different bits mean different things. Same logic as the ASCII table on page 2, except instead of "this byte is the letter A," the convention is "this byte is the operation ADD, on registers 0 and 1."

A real x86 instruction can be 1 to 15 bytes; RISC-V is a clean 4 bytes; ARM Thumb is 2. They all share the same shape: opcode + operands, packed into bits.

Step through a program

// step through a program
program counter: 0x00
memory
0x0000100101LOAD r0, 5
0x0100101111LOAD r1, 7
0x0201100001ADD r0, r1
0x0310100000PRT r0
0x0400000000HALT
cpu
fetchPC = 0x00reading from memory…
decodeopcode: 001 = LOADdst: 00 = r0 · imm: 101 = 5
executer0 ← 5r0 = 00000000
registers
r000000000(0)
r100000000(0)
r200000000(0)
r300000000(0)
event log
press step to begin.

the exact program from the toy-CPU code below: LOAD r0 5, LOAD r1 7, ADD, print, halt. step through it one instruction at a time and watch fetch, decode, and execute light up in turn.

Decoding an instruction is just bit shifts and masks, the same bitwise operators from the binary page. (ins >> 5) & 0b111 extracts the opcode; (ins >> 3) & 0b11 extracts the register. The CPU reads instructions exactly the way you learned to read bits. ← see: binary

bitsfieldmeaning
7..5opcode011 = ADD, 001 = LOAD, …
4..3dest regwhich register receives the result
2..0src reg / immediatethe other operand
// remember from page 1?
Decoding an instruction is just bit shifts and masks: exactly the bitwise operators from the binary page. The CPU pulls fields out of a byte the same way you tested individual bits in (x >> 3) & 1.

A 6-instruction CPU you can read in 5 minutes

Here's a complete software simulation of a tiny CPU. It has 4 registers, 6 opcodes, and a single byte per instruction. The whole loop fits on one screen, and every concept is real. Every line maps to something a real chip does.

Rust• • •
// A toy CPU with 4 registers and 6 ops.
// Each instruction is a single byte:
//   bits 7..5 = opcode (3 bits)
//   bits 4..3 = destination register (2 bits)
//   bits 2..0 = source register OR small immediate

const HALT: u8 = 0b000;
const LOAD: u8 = 0b001; // dst = imm
const MOV : u8 = 0b010; // dst = src
const ADD : u8 = 0b011; // dst = dst + src
const SUB : u8 = 0b100; // dst = dst - src
const PRT : u8 = 0b101; // print dst

fn run(program: &[u8]) {
    let mut reg = [0u8; 4];
    let mut pc  = 0usize;

    loop {
        // FETCH
        let ins = program[pc];
        pc += 1;

        // DECODE: pure bit-shifts, the same trick from page 1.
        let op  = (ins >> 5) & 0b111;
        let dst = ((ins >> 3) & 0b11) as usize;
        let src = (ins & 0b111) as usize;

        // EXECUTE
        match op {
            x if x == HALT => break,
            x if x == LOAD => reg[dst] = src as u8,
            x if x == MOV  => reg[dst] = reg[src & 0b11],
            x if x == ADD  => reg[dst] = reg[dst].wrapping_add(reg[src & 0b11]),
            x if x == SUB  => reg[dst] = reg[dst].wrapping_sub(reg[src & 0b11]),
            x if x == PRT  => println!("r{} = {}", dst, reg[dst]),
            _ => panic!("bad opcode"),
        }
    }
}

fn main() {
    // r0 = 5; r1 = 7; r0 = r0 + r1; print r0; halt
    let program = [
        0b001_00_101, // LOAD r0, 5
        0b001_01_111, // LOAD r1, 7
        0b011_00_001, // ADD  r0, r1
        0b101_00_000, // PRT  r0   → 12
        0b000_00_000, // HALT
    ];
    run(&program);
}
C• • •
// Same toy CPU in C.
#include <stdio.h>
#include <stdint.h>

#define HALT 0
#define LOAD 1
#define MOV  2
#define ADD  3
#define SUB  4
#define PRT  5

void run(const uint8_t *program) {
    uint8_t reg[4] = {0};
    size_t pc = 0;

    for (;;) {
        // FETCH
        uint8_t ins = program[pc++];

        // DECODE
        uint8_t op  = (ins >> 5) & 0x7;
        uint8_t dst = (ins >> 3) & 0x3;
        uint8_t src = ins & 0x7;

        // EXECUTE
        switch (op) {
            case HALT: return;
            case LOAD: reg[dst] = src; break;
            case MOV : reg[dst] = reg[src & 0x3]; break;
            case ADD : reg[dst] += reg[src & 0x3]; break;
            case SUB : reg[dst] -= reg[src & 0x3]; break;
            case PRT : printf("r%u = %u\n", dst, reg[dst]); break;
        }
    }
}

int main(void) {
    uint8_t program[] = {
        0b00100101, // LOAD r0, 5
        0b00101111, // LOAD r1, 7
        0b01100001, // ADD  r0, r1
        0b10100000, // PRT  r0   → 12
        0b00000000, // HALT
    };
    run(program);
    return 0;
}
// takeaway
The fetch-decode-execute loop is the entire machine. Every other CPU feature (caches, pipelines, predictors) is just an optimization on top of those three steps.
Intermediate// level 02

Registers, ALU, control unit: the parts of a CPU

Open up the toy simulator above and ask: where does that loop live in silicon? Each piece maps to a physical block of gates from the previous page.

block 01
Registers
A bank of flip-flops, exactly the SR-latch + clock circuits from the logic-gates page. 32 flip-flops side by side = one 32-bit register.
block 02
ALU
The arithmetic-logic unit. The full adder you built from XOR + AND + OR is right here, repeated 32 times for a 32-bit add.
block 03
Control unit
Reads the opcode field and decides which wires to enable. Pure combinational logic: a tree of NAND gates routing bits to the right block.
block 04
Program counter
A register that holds the address of the next instruction. After each fetch, an adder bumps it by the instruction size.

The ALU is the full adder you built on the logic gates page, repeated 64 times side by side for a 64-bit processor. Every ADD instruction your program executes is those XOR and AND gates firing in sequence. ← see: logic gates

The clock: what makes it all step forward

A CPU is a sea of combinational gates plus banks of flip-flops. Combinational gates have no memory; their output is a pure function of input. So nothing changes on its own. What drives the loop forward is the clock: a square-wave signal toggling at a fixed rate (3 GHz = 3 billion times a second). On every rising edge, the flip-flops latch their inputs and the next state begins.

"Faster CPU" mostly means "shorter clock period," which only works if every signal can propagate through every gate before the next tick. Get that wrong and you read garbage out of the latches.

Memory: where the program actually lives

Registers are tiny and fast (a few hundred bytes, accessed in 1 cycle). Real programs need more, gigabytes more. So the CPU is wired to a separate chip called RAM, addressed by a number. Each instruction fetch is really:

PCaddress busRAMdata businstructionregisterdecode

Compared to a register access, RAM is slow. Hundreds of cycles. The whole subject of computer architecture is, mostly, about hiding that latency.

Every value the CPU reads from memory is binary bytes at an address. That address is a hex number, 0x7fff5fbff8a4, a pointer: a number that means somewhere. The CPU follows millions of these every second. ← see: pointers

// the loop, restated
PC tells RAM which bits to send. The control unit decodes them. The ALU crunches numbers. The registers remember results. The clock keeps everything in lockstep. That's it. That's a CPU.
Advanced// level 03

Pipelining, caches & branch prediction

Why a 3 GHz CPU runs more than 3 billion ops/s

Naively, fetch-decode-execute takes at least three clock cycles per instruction. So a 3 GHz CPU should top out at 1 billion instructions per second. In practice, modern CPUs sustain several instructions per cycle. How?

Pipelining: an assembly line for instructions

The trick: while instruction N is executing, instruction N+1 can be decoding, and N+2 can be fetching. Each pipeline stage runs in parallel, like a factory assembly line. A 5-stage pipeline finishes (ideally) one instruction per cycle.

Real CPUs go further. Modern x86 cores have 14 to 20 pipeline stages and issue 4+ instructions per cycle (superscalar, with multiple ALUs).

Caches: bridging the RAM gap

RAM is far away (electrically) and slow (hundreds of cycles). To hide that, the CPU keeps recently-used bytes in small, fast SRAM banks right next to the cores: caches. Most programs touch the same bytes repeatedly, so this works extraordinarily well.

levelsizelatencywhere
L132-64 KB~4 cyclesper-core, split into instruction and data
L2256 KB to 1 MB~12 cyclesper-core
L38-64 MB~40 cyclesshared between cores
RAMGBs~200+ cyclesoff-chip

This is why cache-friendly code matters. A linear walk over an array hits L1 on every access. A pointer-chase through a linked list, where every node sits in a different cache line, pays the full RAM tax. Same algorithmic complexity, 50× difference in wall time.

Branch prediction

Pipelining fights one big enemy: the branch. When the CPU hits an if, it doesn't yet know which way to go, but the pipeline needs to keep fetching something. So the CPU guesses, using a tiny on-chip lookup table that tracks the recent history of every branch. Right guess: full speed. Wrong guess: flush the pipeline and start over (10 to 20 cycle penalty).

Modern predictors hit 95%+. They're so good that data layout can change performance dramatically. Sort an array first, and the same hot loop runs several times faster, because the same branch now goes the same way for many iterations in a row.

Rust• • •
// The classic branch-prediction demo.
// Sorting the array makes the branch predictable, so the same
// loop runs ~3-6× faster on real hardware.
use std::time::Instant;

fn main() {
    let mut data: Vec<i32> = (0..32_768).map(|i| (i * 1103515245 + 12345) % 256).collect();

    // Run once UNSORTED, once SORTED.
    for label in ["unsorted", "sorted"] {
        if label == "sorted" { data.sort(); }

        let t = Instant::now();
        let mut sum: u64 = 0;
        for _ in 0..1_000 {
            for &x in &data {
                if x >= 128 {        // ← the unpredictable branch
                    sum += x as u64;
                }
            }
        }
        println!("{label:>9}: sum={sum}  in {:.2?}", t.elapsed());
    }
}
C• • •
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

static int cmp(const void *a, const void *b) {
    return *(const int*)a - *(const int*)b;
}

int main(void) {
    enum { N = 32768 };
    int *data = malloc(N * sizeof *data);
    for (int i = 0; i < N; i++)
        data[i] = (int)((i * 1103515245u + 12345u) % 256u);

    for (int phase = 0; phase < 2; phase++) {
        if (phase == 1) qsort(data, N, sizeof *data, cmp);

        clock_t t = clock();
        unsigned long long sum = 0;
        for (int k = 0; k < 1000; k++)
            for (int i = 0; i < N; i++)
                if (data[i] >= 128)        // ← the unpredictable branch
                    sum += (unsigned)data[i];

        double secs = (double)(clock() - t) / CLOCKS_PER_SEC;
        printf("%s: sum=%llu  in %.2fs\n",
               phase ? "  sorted" : "unsorted", sum, secs);
    }
    free(data);
    return 0;
}
// the famous side channel
Branch prediction speculates ahead, and the speculative path leaves traces in the cache even when it's discarded. That's Spectre: a 2018 vulnerability that turned a performance optimization into a way to read protected memory. Every CPU shipped before mid-2018 was affected. Mitigations cost real performance.

The CPU inside Bitcoin miners

Every Bitcoin miner on Earth is a CPU running one function, over and over, as fast as possible, forever. That function is SHA-256. And SHA-256 is pure CPU work. Here is what a miner is actually computing, one round of the 64 that make up a single hash:

Rust• • •
// One SHA-256 round: pure CPU operations
// on registers a through h.
fn sha256_round(
    a: u32, b: u32, c: u32, d: u32,
    e: u32, f: u32, g: u32, h: u32,
    k: u32, w: u32,
) -> (u32, u32, u32, u32, u32, u32, u32, u32) {
    // Every line below is an ALU operation.
    let s1 = e.rotate_right(6)
           ^ e.rotate_right(11)
           ^ e.rotate_right(25);

    let ch = (e & f) ^ (!e & g); // AND, XOR, NOT

    let temp1 = h
        .wrapping_add(s1)
        .wrapping_add(ch)
        .wrapping_add(k)
        .wrapping_add(w);

    let s0 = a.rotate_right(2)
           ^ a.rotate_right(13)
           ^ a.rotate_right(22);

    let maj = (a & b) ^ (a & c) ^ (b & c);

    let temp2 = s0.wrapping_add(maj);

    (
        temp1.wrapping_add(temp2), // new a
        a, b, c,
        d.wrapping_add(temp1),     // new e
        e, f, g,
    )
}
C• • •
#include <stdint.h>

static inline uint32_t rotr(uint32_t x, int n) {
    return (x >> n) | (x << (32 - n));
}

void sha256_round(
    uint32_t *a, uint32_t *b,
    uint32_t *c, uint32_t *d,
    uint32_t *e, uint32_t *f,
    uint32_t *g, uint32_t *h,
    uint32_t k,  uint32_t w
) {
    uint32_t s1  = rotr(*e,6) ^ rotr(*e,11) ^ rotr(*e,25);
    uint32_t ch  = (*e & *f) ^ (~*e & *g);
    uint32_t t1  = *h + s1 + ch + k + w;
    uint32_t s0  = rotr(*a,2) ^ rotr(*a,13) ^ rotr(*a,22);
    uint32_t maj = (*a & *b) ^ (*a & *c) ^ (*b & *c);
    uint32_t t2  = s0 + maj;

    *h = *g; *g = *f; *f = *e;
    *e = *d + t1;
    *d = *c; *c = *b; *b = *a;
    *a = t1 + t2;
}

Every operation in those functions is a single CPU instruction:

  • rotate_right / rotr: a barrel shifter in the ALU.
  • &, ^, !: AND, XOR, NOT gates firing.
  • wrapping_add: the full adder circuit from the logic gates page.

SHA-256 runs 64 of these rounds per hash attempt. Bitcoin miners attempt trillions of hashes. Each attempt is 64 rounds; each round is dozens of ALU operations; each ALU operation is logic gates; each gate is transistors. A modern Bitcoin ASIC runs tens of trillions of SHA-256 hashes per second.

All of it is the fetch-decode-execute loop you learned on this page, running as fast as physics allows. This is what secures hundreds of billions of dollars. Not cryptographic magic. Not financial engineering. Just a CPU, doing one job, very very fast.

SHA-256 is a CPU workload: 64 rounds of rotations, AND, XOR, NOT, and additions per hash, run trillions of times per second by mining hardware to find a hash with enough leading zeros. ← see: hashing

The full stack, again

// from electron to executable, with the loop on top
  1. Electrons gated by transistors form logic gates.
  2. Gates form adders, latches, multiplexers.
  3. Those compose into registers, ALUs, control units.
  4. A clock drives them through fetch-decode-execute.
  5. Bits in memory encode instructions (this page) and data.
  6. Data bits encode numbers (page 1) and characters (page 2).
  7. Sequences of instructions become a program.
  8. Run a few billion of them per second and you get the experience of using a computer.

Where to dig in next

You've seen the whole stack now. Natural deep-dives:

  • RISC-V: a clean, open ISA you can read in an afternoon.
  • nand2tetris: build a CPU from NAND gates and run software on it.
  • Agner Fog's microarchitecture manuals, the canonical reference on how real x86 cores actually work.
  • Compiler Explorer (godbolt.org): see what bytes your high-level code actually becomes.

Where the CPU appears in BitRoot

The CPU is the engine the whole site runs on. Every other topic either feeds it instructions or runs on top of its loop:

0x02 / binary
Instructions are binary
Every CPU instruction is binary: opcode, registers, and operands packed into bits. A running program is a stream of binary flowing through the fetch-decode-execute loop.
0x01 / number systems
Hex for humans, binary for silicon
Instruction addresses are hex, register values are binary, immediates are decimal in source. The CPU sees only binary; everything else is human convenience.
0x03 / ascii
Letters are just integers
When you type a character the CPU processes its ASCII code as a binary integer. 72 for H, 65 for A. The CPU has no idea it is working with letters.
0x04 / logic gates
The CPU is gates
The ALU is logic gates, the control unit is logic gates, the registers are flip-flops (gates with memory). The entire CPU is gates organised to compute.
0x06 / memory
The CPU waits on RAM
Every instruction fetch and data access is a memory operation. The CPU spends most of its time waiting for memory; caches exist because RAM is too slow to wait on.
0x07 / operating system
The OS shares the CPU
The OS schedules which program gets CPU time. Your process runs for a slice, gets paused, another runs, then yours again. The CPU just keeps looping.
0x09 / pointers
The PC is a pointer
The program counter points at the next instruction. Every memory access the CPU makes follows a binary address (a pointer) to find its data.
0x0D / hashing
SHA-256 is a CPU workload
Every hash is dozens of ALU operations. Bitcoin miners run SHA-256 trillions of times per second. The CPU is the engine of every blockchain in existence.
0x0F / networking
Your NIC has its own CPU
Your network card runs its own fetch-decode-execute loop while the main CPU handles everything else. Parallel computation, two CPUs sharing one machine.
0x12 / blockchain
Mining is the loop at its limit
Bitcoin mining is the fetch-decode-execute loop pushed to its physical limit. An ASIC runs SHA-256 trillions of times per second to find one hash with enough leading zeros. The CPU is why mining costs electricity.
next up / 0x06
Where state lives between cycles: memory
memory