Numbers become
language.
You learned that everything is bits. But how does 01000001 become the letter A? Through a convention: a shared agreement that says "this number means that letter." That agreement is called an encoding, and the most famous one is ASCII.
What is ASCII?
In 1963 engineers had a problem. IBM's computers used one code for the letter A. Honeywell used a different one. They literally could not talk to each other.
So a committee sat down and built a universal dictionary. 128 characters. One number each. Agreed on by everyone. Forever.
They called it the American Standard Code for Information Interchange. ASCII.
Your computer has never read a single letter in its entire life. It only ever reads numbers. ASCII is how numbers pretend to be language.
ASCII stands for American Standard Code for Information Interchange. It's a lookup table from 1963 that maps numbers 0 to 127 to characters: letters, digits, punctuation, and a handful of control codes for old teletype machines.
Why 0 to 127? Because that's exactly what fits in 7 bits (2⁷ = 128). The 8th bit was originally used for parity error-checking. Today most computers use the full 8-bit byte, with the upper half left for extensions. That's where the modern world's encodings (UTF-8 included) take over.
The famous letters
| character | decimal | binary | hex |
|---|---|---|---|
A | 65 | 01000001 | 0x41 |
B | 66 | 01000010 | 0x42 |
a | 97 | 01100001 | 0x61 |
0 | 48 | 00110000 | 0x30 |
(space) | 32 | 00100000 | 0x20 |
\n (newline) | 10 | 00001010 | 0x0A |
Notice A = 65 and a = 97. Exactly 32 apart. Their binary forms differ by one bit (bit 5). That's why uppercase ↔ lowercase conversion is a single XOR operation: 'A' ^ 0x20 == 'a'. Cleverness baked right into the table.
Try it: one character, one byte
Each of those bytes lives at a specific memory address in RAM. The OS allocated that space when your program started. ← see: memory
Print "Hi" character by character
fn main() {
let msg = "Hi";
for b in msg.bytes() {
println!("'{}' = {} = 0x{:02X} = {:08b}",
b as char, b, b, b);
}
// 'H' = 72 = 0x48 = 01001000
// 'i' = 105 = 0x69 = 01101001
}#include <stdio.h>
int main(void) {
const char *msg = "Hi";
for (int i = 0; msg[i] != '\0'; i++) {
unsigned char b = msg[i];
printf("'%c' = %d = 0x%02X = ", b, b, b);
for (int j = 7; j >= 0; j--)
putchar((b >> j) & 1 ? '1' : '0');
putchar('\n');
}
return 0;
}0x48 would mean "H". No magic. Just convention.The full table & control codes
ASCII is split into printable characters (32 to 126) and control codes (0 to 31, plus 127). Control codes don't draw glyphs. They were instructions for printers and teletypes: ring a bell, move the carriage, start a new line. Many are obsolete. Some are still everywhere.
Control codes you still see today
| dec | name | escape | still used? |
|---|---|---|---|
| 0 | NUL | \0 | String terminator in C |
| 7 | BEL | \a | Terminal beep |
| 8 | BS | \b | Backspace |
| 9 | HT | \t | Tab |
| 10 | LF | \n | Unix newline |
| 13 | CR | \r | Windows uses CRLF (\r\n) |
| 27 | ESC | \e | Start of ANSI escape sequences (terminal colors!) |
| 127 | DEL | (none) | Delete |
The ESC character (27) powers every terminal color you have ever seen. \x1b[31m turns text red. \x1b[0m resets it. Your terminal is just a stream of ASCII bytes with ESC sequences as the control channel. ← see: operating system
The printable ASCII table (32 to 126)
hover or tap any cell. each shows the character and its decimal value.
Working with ASCII in code
Because characters are numbers, you can do arithmetic on them. The classic example: converting a digit character ('0' to '9') to its integer value.
fn main() {
let ch: char = '7';
let digit = ch as u8 - b'0';
println!("{} → {}", ch, digit); // 7 → 7
// uppercase ↔ lowercase via the bit-5 trick
let upper = b'a' ^ 0x20; // 'A'
let lower = b'A' | 0x20; // 'a'
println!("{} {}", upper as char, lower as char);
// ANSI escape: red text in the terminal
println!("\x1b[31mERROR\x1b[0m");
}#include <stdio.h>
int main(void) {
char ch = '7';
int digit = ch - '0';
printf("%c → %d\n", ch, digit); // 7 → 7
// uppercase ↔ lowercase via the bit-5 trick
char upper = 'a' ^ 0x20; // 'A'
char lower = 'A' | 0x20; // 'a'
printf("%c %c\n", upper, lower);
// ANSI escape: red text in the terminal
printf("\x1b[31mERROR\x1b[0m\n");
return 0;
}ESC control code (27) is the gateway to ANSI escape sequences. That's how every CLI tool, from git to htop, draws colors and moves the cursor. They're literally just bytes: ESC [ 31 m = "switch to red".HTTP, the protocol your browser uses, sends its headers as plain ASCII text. GET /index.html HTTP/1.1 and Host: bitroot.dev are ASCII bytes wrapped in a TCP packet and sent as binary across the internet. ← see: networking
Beyond ASCII: UTF-8 & the world's text
ASCII has 128 slots. The world has more than 100,000 characters in active use: Devanagari, Mandarin, Arabic, emoji, math symbols, ancient scripts. Unicode is the modern standard that gives every character a unique number called a code point (e.g. U+0905 for अ). UTF-8 is one way to encode those code points as bytes.
The brilliance of UTF-8
UTF-8 was designed by Ken Thompson and Rob Pike on a placemat in a New Jersey diner in 1992. It's a variable-length encoding: 1 to 4 bytes per code point, with two crucial properties:
- ASCII compatibility. Any valid ASCII file is also a valid UTF-8 file. The first 128 code points encode as a single byte, identical to ASCII.
- Self-synchronizing. You can drop into any byte stream and immediately tell whether you're at the start of a character or in the middle of one, just by looking at the high bits.
| code point range | bytes | byte pattern |
|---|---|---|
| U+0000 to U+007F | 1 | 0xxxxxxx |
| U+0080 to U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800 to U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000 to U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
The leading bits act as a length tag. Continuation bytes always start with 10. That's the self-synchronization: if you see a byte starting with 10, you know you're mid-character; back up until you find a byte that doesn't.
"नमस्ते" in bytes
fn main() {
let s = "नमस्ते";
println!("chars: {}", s.chars().count()); // 6 (with combining)
println!("bytes: {}", s.len()); // 18
for b in s.bytes() {
print!("{:02X} ", b);
}
// E0 A4 A8 E0 A4 AE E0 A4 B8 ...
// each devanagari char = 3 bytes
}#include <stdio.h>
#include <string.h>
int main(void) {
// C strings are just byte arrays;
// the compiler stores UTF-8 verbatim.
const char *s = "नमस्ते";
printf("bytes: %zu\n", strlen(s)); // 18
for (size_t i = 0; s[i]; i++)
printf("%02X ", (unsigned char)s[i]);
putchar('\n');
// strlen counts BYTES, not characters!
return 0;
}strlen("नमस्ते") returns 18, not 6. str[0] gives you a single byte, which is half a character. Slicing UTF-8 strings naively will corrupt them. Rust's &str guarantees valid UTF-8 at the type level; that's one of the language's quiet superpowers.ASCII in blockchain and networking
ASCII shows up everywhere in the infrastructure that runs Bitcoin. When your Bitcoin node connects to another node it sends a handshake message. That message header is ASCII text.
Bitcoin Core uses ASCII command names in its network protocol: version, verack, inv, tx, block. Each command is a 12-byte ASCII string padded with null bytes (0x00) to fill the field. NUL, the very first control code in ASCII, is still doing its job inside the Bitcoin network protocol sixty years after ASCII was invented.
// Same header, in Rust.
#[repr(C)]
struct MessageHeader {
magic: u32, // 0xD9B4BEF9 for mainnet
command: [u8; 12], // ASCII, NUL-padded
length: u32, // payload size
checksum: [u8; 4], // first 4 bytes of SHA256d
}
// "version" command name as a 12-byte ASCII literal.
const VERSION_COMMAND: [u8; 12] = *b"version\0\0\0\0\0";
// b"..." creates a byte array;
// each character is its ASCII value;
// \0 is the NUL control code as a padding byte.
// 76 65 72 73 69 6F 6E 00 00 00 00 00 = the bytes on the wire.#include <stdint.h>
// The Bitcoin P2P network message header,
// from Bitcoin Core's primary header file.
struct MessageHeader {
uint32_t magic; // 0xD9B4BEF9 for mainnet
char command[12]; // ASCII, NUL-padded
uint32_t length; // payload size
uint32_t checksum; // first 4 bytes of SHA256d
};
// "version" command name as 12 ASCII bytes:
// 76 65 72 73 69 6F 6E 00 00 00 00 00
// v e r s i o n \0 \0 \0 \0 \0
//
// NUL (0x00, the first control code in ASCII)
// pads the command name to fill the field.And the checksum in that header? SHA-256, applied twice. The same hash function built from AND gates and XOR gates that you will see on the hashing page.
ASCII named the commands. Binary carries the bytes. SHA-256 verifies the integrity. TCP/IP delivers the packet. All four concepts. One message header.
Where ASCII appears in BitRoot
ASCII is the most quoted page on the site. Every later topic uses it for something.
Connecting back to bits
Step back and notice the layering. A character (अ) is a Unicode code point (U+0905). That code point gets encoded as bytes (E0 A4 85) by UTF-8. Each byte is 8 bits. Each bit is a voltage (high or low) sitting on a wire connected to a transistor. The next page is where we finally get to that wire.