root.system / 0x03 / encoding

Numbers become
language.

You learned that everything is bits. But how does 01000001 become the letter A? Through a convention: a shared agreement that says "this number means that letter." That agreement is called an encoding, and the most famous one is ASCII.

Beginner// level 01

What is ASCII?

In 1963 engineers had a problem. IBM's computers used one code for the letter A. Honeywell used a different one. They literally could not talk to each other.

So a committee sat down and built a universal dictionary. 128 characters. One number each. Agreed on by everyone. Forever.

They called it the American Standard Code for Information Interchange. ASCII.

Your computer has never read a single letter in its entire life. It only ever reads numbers. ASCII is how numbers pretend to be language.

ASCII stands for American Standard Code for Information Interchange. It's a lookup table from 1963 that maps numbers 0 to 127 to characters: letters, digits, punctuation, and a handful of control codes for old teletype machines.

Why 0 to 127? Because that's exactly what fits in 7 bits (2⁷ = 128). The 8th bit was originally used for parity error-checking. Today most computers use the full 8-bit byte, with the upper half left for extensions. That's where the modern world's encodings (UTF-8 included) take over.

The famous letters

characterdecimalbinaryhex
A65010000010x41
B66010000100x42
a97011000010x61
048001100000x30
(space)32001000000x20
\n (newline)10000010100x0A

Notice A = 65 and a = 97. Exactly 32 apart. Their binary forms differ by one bit (bit 5). That's why uppercase ↔ lowercase conversion is a single XOR operation: 'A' ^ 0x20 == 'a'. Cleverness baked right into the table.

Try it: one character, one byte

// character explorer — type any letter
H
decimal72
hex0x48
binary01001000
20·
2164
20·
20·
2³18
2²0·
2¹0·
20·

8 transistors in your CPU, in one of 256 patterns.

this is the first byte of 'Hi'. the next page shows how your CPU processes this exact bit pattern. · see: cpu
// text encoder — every character becomes 8 bits
H
7201001000
e
10101100101
l
10801101100
l
10801101100
o
11101101111

5 characters40 bits → 5 bytes in memory.

Each of those bytes lives at a specific memory address in RAM. The OS allocated that space when your program started. ← see: memory

Print "Hi" character by character

Rust• • •
fn main() {
    let msg = "Hi";
    for b in msg.bytes() {
        println!("'{}' = {} = 0x{:02X} = {:08b}",
                 b as char, b, b, b);
    }
    // 'H' = 72 = 0x48 = 01001000
    // 'i' = 105 = 0x69 = 01101001
}
C• • •
#include <stdio.h>

int main(void) {
    const char *msg = "Hi";
    for (int i = 0; msg[i] != '\0'; i++) {
        unsigned char b = msg[i];
        printf("'%c' = %d = 0x%02X = ", b, b, b);
        for (int j = 7; j >= 0; j--)
            putchar((b >> j) & 1 ? '1' : '0');
        putchar('\n');
    }
    return 0;
}
// the leap
A string is just a sequence of bytes. The screen draws letters because someone, somewhere, agreed that byte 0x48 would mean "H". No magic. Just convention.
Intermediate// level 02

The full table & control codes

ASCII is split into printable characters (32 to 126) and control codes (0 to 31, plus 127). Control codes don't draw glyphs. They were instructions for printers and teletypes: ring a bell, move the carriage, start a new line. Many are obsolete. Some are still everywhere.

Control codes you still see today

decnameescapestill used?
0NUL\0String terminator in C
7BEL\aTerminal beep
8BS\bBackspace
9HT\tTab
10LF\nUnix newline
13CR\rWindows uses CRLF (\r\n)
27ESC\eStart of ANSI escape sequences (terminal colors!)
127DEL(none)Delete

The ESC character (27) powers every terminal color you have ever seen. \x1b[31m turns text red. \x1b[0m resets it. Your terminal is just a stream of ASCII bytes with ESC sequences as the control channel. ← see: operating system

The printable ASCII table (32 to 126)

hover or tap any cell. each shows the character and its decimal value.

Working with ASCII in code

Because characters are numbers, you can do arithmetic on them. The classic example: converting a digit character ('0' to '9') to its integer value.

Rust• • •
fn main() {
    let ch: char = '7';
    let digit = ch as u8 - b'0';
    println!("{} → {}", ch, digit); // 7 → 7

    // uppercase ↔ lowercase via the bit-5 trick
    let upper = b'a' ^ 0x20;             // 'A'
    let lower = b'A' | 0x20;             // 'a'
    println!("{} {}", upper as char, lower as char);

    // ANSI escape: red text in the terminal
    println!("\x1b[31mERROR\x1b[0m");
}
C• • •
#include <stdio.h>

int main(void) {
    char ch = '7';
    int digit = ch - '0';
    printf("%c%d\n", ch, digit); // 7 → 7

    // uppercase ↔ lowercase via the bit-5 trick
    char upper = 'a' ^ 0x20;          // 'A'
    char lower = 'A' | 0x20;          // 'a'
    printf("%c %c\n", upper, lower);

    // ANSI escape: red text in the terminal
    printf("\x1b[31mERROR\x1b[0m\n");
    return 0;
}
// trivia worth keeping
The ESC control code (27) is the gateway to ANSI escape sequences. That's how every CLI tool, from git to htop, draws colors and moves the cursor. They're literally just bytes: ESC [ 31 m = "switch to red".

HTTP, the protocol your browser uses, sends its headers as plain ASCII text. GET /index.html HTTP/1.1 and Host: bitroot.dev are ASCII bytes wrapped in a TCP packet and sent as binary across the internet. ← see: networking

Advanced// level 03

Beyond ASCII: UTF-8 & the world's text

ASCII has 128 slots. The world has more than 100,000 characters in active use: Devanagari, Mandarin, Arabic, emoji, math symbols, ancient scripts. Unicode is the modern standard that gives every character a unique number called a code point (e.g. U+0905 for अ). UTF-8 is one way to encode those code points as bytes.

The brilliance of UTF-8

UTF-8 was designed by Ken Thompson and Rob Pike on a placemat in a New Jersey diner in 1992. It's a variable-length encoding: 1 to 4 bytes per code point, with two crucial properties:

  1. ASCII compatibility. Any valid ASCII file is also a valid UTF-8 file. The first 128 code points encode as a single byte, identical to ASCII.
  2. Self-synchronizing. You can drop into any byte stream and immediately tell whether you're at the start of a character or in the middle of one, just by looking at the high bits.
code point rangebytesbyte pattern
U+0000 to U+007F10xxxxxxx
U+0080 to U+07FF2110xxxxx 10xxxxxx
U+0800 to U+FFFF31110xxxx 10xxxxxx 10xxxxxx
U+10000 to U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The leading bits act as a length tag. Continuation bytes always start with 10. That's the self-synchronization: if you see a byte starting with 10, you know you're mid-character; back up until you find a byte that doesn't.

"नमस्ते" in bytes

Rust• • •
fn main() {
    let s = "नमस्ते";

    println!("chars: {}", s.chars().count()); // 6 (with combining)
    println!("bytes: {}", s.len());            // 18

    for b in s.bytes() {
        print!("{:02X} ", b);
    }
    // E0 A4 A8 E0 A4 AE E0 A4 B8 ...
    // each devanagari char = 3 bytes
}
C• • •
#include <stdio.h>
#include <string.h>

int main(void) {
    // C strings are just byte arrays;
    // the compiler stores UTF-8 verbatim.
    const char *s = "नमस्ते";

    printf("bytes: %zu\n", strlen(s));    // 18
    for (size_t i = 0; s[i]; i++)
        printf("%02X ", (unsigned char)s[i]);
    putchar('\n');
    // strlen counts BYTES, not characters!
    return 0;
}
// the trap
In C, strlen("नमस्ते") returns 18, not 6. str[0] gives you a single byte, which is half a character. Slicing UTF-8 strings naively will corrupt them. Rust's &str guarantees valid UTF-8 at the type level; that's one of the language's quiet superpowers.

ASCII in blockchain and networking

ASCII shows up everywhere in the infrastructure that runs Bitcoin. When your Bitcoin node connects to another node it sends a handshake message. That message header is ASCII text.

Bitcoin Core uses ASCII command names in its network protocol: version, verack, inv, tx, block. Each command is a 12-byte ASCII string padded with null bytes (0x00) to fill the field. NUL, the very first control code in ASCII, is still doing its job inside the Bitcoin network protocol sixty years after ASCII was invented.

Rust• • •
// Same header, in Rust.
#[repr(C)]
struct MessageHeader {
    magic:    u32,        // 0xD9B4BEF9 for mainnet
    command:  [u8; 12],   // ASCII, NUL-padded
    length:   u32,        // payload size
    checksum: [u8; 4],    // first 4 bytes of SHA256d
}

// "version" command name as a 12-byte ASCII literal.
const VERSION_COMMAND: [u8; 12] = *b"version\0\0\0\0\0";
// b"..."  creates a byte array;
// each character is its ASCII value;
// \0 is the NUL control code as a padding byte.

// 76 65 72 73 69 6F 6E 00 00 00 00 00 = the bytes on the wire.
C• • •
#include <stdint.h>

// The Bitcoin P2P network message header,
// from Bitcoin Core's primary header file.
struct MessageHeader {
    uint32_t magic;        // 0xD9B4BEF9 for mainnet
    char     command[12];  // ASCII, NUL-padded
    uint32_t length;       // payload size
    uint32_t checksum;     // first 4 bytes of SHA256d
};

// "version" command name as 12 ASCII bytes:
//   76 65 72 73 69 6F 6E 00 00 00 00 00
//   v  e  r  s  i  o  n  \0 \0 \0 \0 \0
//
// NUL (0x00, the first control code in ASCII)
// pads the command name to fill the field.

And the checksum in that header? SHA-256, applied twice. The same hash function built from AND gates and XOR gates that you will see on the hashing page.

ASCII named the commands. Binary carries the bytes. SHA-256 verifies the integrity. TCP/IP delivers the packet. All four concepts. One message header.

Where ASCII appears in BitRoot

ASCII is the most quoted page on the site. Every later topic uses it for something.

0x02 / binary
Letters as bit patterns
ASCII codes are binary numbers. 'A' = 65 = 01000001. Seven bits that carry the weight of an entire alphabet.
0x01 / number systems
Three masks, one number
ASCII codes are decimal (65), hex (0x41), and binary (01000001). The same number in three masks.
0x04 / logic gates
Case toggle is one XOR
Uppercase to lowercase is one XOR operation. 'A' ^ 0x20 = 'a'. XOR is a logic gate. A logic gate is transistors. The alphabet runs on silicon.
0x06 / memory
Strings live in RAM
A string is a sequence of bytes at consecutive memory addresses. 'Hello' is five bytes starting at one address, ending five addresses later.
0x09 / pointers
char* is just an address
In C a string is a pointer. char* str = "Hello" makes str the address of the H. The string only exists because the pointer knows where it starts.
0x0B / arrays
char arrays + NUL
A string is a char array. Each element one ASCII byte. C strings end with NUL (0x00), the first control code, still working after sixty years.
0x0D / hashing
Bytes in, hash out
SHA-256 hashes strings as bytes. 'Hello' becomes its ASCII bytes (72 101 108 108 111) then gets hashed to 256 bits. The input is always ASCII or UTF-8 bytes.
0x0F / networking
HTTP is ASCII text
HTTP headers are ASCII text. 'GET / HTTP/1.1' is ASCII. Every web request you have ever made started as ASCII characters converted to binary wrapped in a TCP packet.
0x11 / blockchain
12-byte ASCII commands
Bitcoin network commands are 12-byte ASCII strings. 'version', 'tx', 'block', NUL-padded to fill the field. ASCII is inside the protocol that moves every Bitcoin transaction.

Connecting back to bits

Step back and notice the layering. A character (अ) is a Unicode code point (U+0905). That code point gets encoded as bytes (E0 A4 85) by UTF-8. Each byte is 8 bits. Each bit is a voltage (high or low) sitting on a wire connected to a transistor. The next page is where we finally get to that wire.

next up / 0x04
Bits become physical: transistors & logic gates
logic gates