Character encoding: ASCII and Unicode
Computers can only store binary numbers. Character encoding systems assign a unique binary number to each character (letter, digit, punctuation, symbol) so text can be stored and transmitted digitally. OCR J277 requires understanding of ASCII and Unicode, their storage requirements and how to convert between them.
ASCII (American Standard Code for Information Interchange)
- Uses 7 bits to represent each character → 2⁷ = 128 possible characters (codes 0–127).
- Extended ASCII uses 8 bits → 256 characters, adding accented letters and special symbols.
- Covers: uppercase A–Z, lowercase a–z, digits 0–9, punctuation, control characters.
Key ASCII values to know
| Character | Decimal | Binary (8-bit) |
|---|---|---|
| 'A' | 65 | 01000001 |
| 'B' | 66 | 01000010 |
| 'Z' | 90 | 01011010 |
| 'a' | 97 | 01100001 |
| '0' | 48 | 00110000 |
| '9' | 57 | 00111001 |
Pattern: lowercase = uppercase + 32 (one bit difference in bit position 5).
Calculating text file size (ASCII)
- Each character = 1 byte (8 bits).
- "Hello" = 5 characters = 5 bytes = 40 bits.
- A 1,000-character document ≈ 1,000 bytes ≈ 1 KB.
Unicode
ASCII cannot represent characters from non-English languages (Chinese, Arabic, Hindi, emoji, etc.). Unicode was created to solve this.
- UTF-8: variable-length encoding. ASCII characters use 1 byte; other characters use 2–4 bytes. Most common on the web. Compatible with ASCII for the first 128 values.
- UTF-16: uses 2 bytes (16 bits) for most characters → 2¹⁶ = 65,536 base characters; surrogate pairs extend to 1,114,112 total.
- UTF-32: fixed 4 bytes per character → very large files but simple encoding.
How Unicode extends ASCII
- The first 128 Unicode code points (U+0000 to U+007F) are identical to ASCII.
- Unicode can represent over 1.1 million characters including all world scripts, emoji and mathematical symbols.
Storage comparison
- ASCII / UTF-8 (English): 1 byte per character.
- UTF-16: 2 bytes per character (for most characters).
- UTF-32: 4 bytes per character.
- A Unicode emoji (e.g. a smiley face) in UTF-8: 4 bytes.
Why does character encoding matter?
- Different systems using different encodings can produce garbled characters (mojibake).
- A file saved in ASCII and opened with UTF-8 usually works fine (first 128 values match).
- A file saved in UTF-8 with non-ASCII characters and opened in a system expecting Latin-1 may display incorrectly.
- Modern systems almost universally use UTF-8.
Common OCR exam mistakes
- Saying ASCII uses 8 bits — standard ASCII is 7-bit (128 characters). Extended ASCII is 8-bit.
- Saying "Unicode uses 2 bytes" — Unicode has multiple encodings (UTF-8, UTF-16, UTF-32) with different byte sizes. UTF-8 is variable-length.
- Forgetting that Unicode is backwards-compatible with ASCII for the first 128 characters.
- Confusing the character code with the binary representation — 'A' has code 65 (denary) = 01000001 (binary).
AI-generated · claude-opus-4-7 · v3-ocr-computer-science