How every letter, digit, and emoji becomes a number — and then binary
Every text message you have ever sent was stored as a sequence of numbers. The word "Hello" — five characters — is stored as five numbers: 72, 101, 108, 108, 111. When engineers at CERN invented the World Wide Web in 1989, the first-ever web page was purely ASCII text — just 128 numbers, each mapped to a character. Today, a single WhatsApp message might span English letters, Arabic script, Chinese hanzi, and a 😊 emoji — all encoded in Unicode, which represents over 140,000 characters from every writing system on Earth. The story of text encoding is the story of computing going global. Before Unicode, Japanese computers used one encoding, Russian computers used another, Arabic computers a third — and sharing files between them was a nightmare. One standard fixed everything.
Computers can only store binary numbers. To store text, every character — letters, digits, punctuation, spaces — is assigned a unique character code: a whole number. That number is then stored in binary. Two major standards define these mappings.
| Feature | ASCII | Unicode |
|---|---|---|
| Full name | American Standard Code for Information Interchange | Universal Character Set — covers all world scripts |
| Bits per character | 7-bit original (128 chars) · 8-bit extended (256) | Variable: UTF-8 uses 8–32 bits · UTF-16 uses 16–32 |
| Characters supported | 128 standard · 256 extended | Over 140,000 from 150+ writing systems |
| Languages | English and basic Western European only | Every human writing system including emoji |
| File size impact | Smaller — fewer bits per character | Larger — more bits needed per character |
| Backward compatibility | — | First 128 Unicode values are identical to ASCII |
You do not need to memorise the whole ASCII table. You need these four anchor points and the relationships between them:
Capital 'A' = 65. All uppercase letters are sequential: B=66, C=67 … Z=90. If you know one letter's code, count forward or backward to find another.
Lowercase 'a' = 97. All lowercase letters are sequential: b=98, c=99 … z=122. The rule: lowercase = uppercase + 32. This is the single most important pattern in ASCII.
The digit character '0' = 48. '1'=49, '2'=50 … '9'=57. Critical: the character '5' has code 53, not 5. The digit characters are not the same as the numbers they represent.
Space character = 32. This is no accident — 32 is the gap between uppercase and lowercase, and the space character sits exactly at that gap value.
In binary, the only bit-level difference between a capital letter and its lowercase equivalent is bit 5 (the 32-value column). Setting bit 5 to 1 converts uppercase → lowercase. Clearing it reverses the conversion.
ASCII was designed in the 1960s for English text on American teletype machines. As computers spread globally, the need for Chinese, Arabic, Japanese, Hindi and hundreds of other scripts became critical. ASCII's 128–256 character limit was completely inadequate. Unicode solves this by allocating more bits per character, allowing a vastly larger range of code points.
UTF-8 — most common format on the web. Uses 8–32 bits per character. ASCII characters (0–127) still use just 8 bits, so English text files are the same size as ASCII. Non-English characters use 16–32 bits.
UTF-16 — uses 16 or 32 bits per character. More efficient for East Asian scripts where most characters need 2 bytes. Used internally by Java, JavaScript, Windows, and Swift.
More bits = larger files. A document in UTF-32 will be roughly four times the size of the equivalent UTF-8 document (for English text). This is the key trade-off: Unicode can represent more characters, but costs more storage.
Confusing the character '5' with the number 5. The character '5' has ASCII code 53 (because '0'=48 and 48+5=53). The number 5 stored as an integer would be 00000101. These are completely different binary patterns.
Saying "Unicode uses more memory" without explaining why. You must state it uses more bits per character to represent a larger set of characters. Vague answers like "Unicode is bigger" earn zero marks.
Getting the +32 rule direction wrong. Adding 32 converts UPPER → lower (A→a). Subtracting 32 converts lower → UPPER (a→A). Mixing these up in an exam is a very common error.
Stating ASCII uses 8 bits. Original (standard) ASCII is 7 bits — 128 characters. Extended ASCII is 8 bits — 256 characters. Cambridge papers usually mean 7-bit unless they say "extended ASCII".
Cambridge almost always gives you one letter's code and asks you to derive another. Work through each part before revealing.
Write your answers, then reveal the marking scheme to check and award yourself marks.
ASCII codes, Unicode, the +32 rule, and file sizes. Complete all 5 to earn your XP and save your progress.