Notes

Character encoding: ASCII and Unicode

Computers can only store binary numbers. Character encoding systems assign a unique binary number to each character (letter, digit, punctuation, symbol) so text can be stored and transmitted digitally. OCR J277 requires understanding of ASCII and Unicode, their storage requirements and how to convert between them.

ASCII (American Standard Code for Information Interchange)

Uses 7 bits to represent each character → 2⁷ = 128 possible characters (codes 0–127).
Extended ASCII uses 8 bits → 256 characters, adding accented letters and special symbols.
Covers: uppercase A–Z, lowercase a–z, digits 0–9, punctuation, control characters.

Key ASCII values to know

Character	Decimal	Binary (8-bit)
'A'	65	01000001
'B'	66	01000010
'Z'	90	01011010
'a'	97	01100001
'0'	48	00110000
'9'	57	00111001

Pattern: lowercase = uppercase + 32 (one bit difference in bit position 5).

Calculating text file size (ASCII)

Each character = 1 byte (8 bits).
"Hello" = 5 characters = 5 bytes = 40 bits.
A 1,000-character document ≈ 1,000 bytes ≈ 1 KB.

Unicode

ASCII cannot represent characters from non-English languages (Chinese, Arabic, Hindi, emoji, etc.). Unicode was created to solve this.

UTF-8: variable-length encoding. ASCII characters use 1 byte; other characters use 2–4 bytes. Most common on the web. Compatible with ASCII for the first 128 values.
UTF-16: uses 2 bytes (16 bits) for most characters → 2¹⁶ = 65,536 base characters; surrogate pairs extend to 1,114,112 total.
UTF-32: fixed 4 bytes per character → very large files but simple encoding.

How Unicode extends ASCII

The first 128 Unicode code points (U+0000 to U+007F) are identical to ASCII.
Unicode can represent over 1.1 million characters including all world scripts, emoji and mathematical symbols.

Storage comparison

ASCII / UTF-8 (English): 1 byte per character.
UTF-16: 2 bytes per character (for most characters).
UTF-32: 4 bytes per character.
A Unicode emoji (e.g. a smiley face) in UTF-8: 4 bytes.

Why does character encoding matter?

Different systems using different encodings can produce garbled characters (mojibake).
A file saved in ASCII and opened with UTF-8 usually works fine (first 128 values match).
A file saved in UTF-8 with non-ASCII characters and opened in a system expecting Latin-1 may display incorrectly.
Modern systems almost universally use UTF-8.

Common OCR exam mistakes

Saying ASCII uses 8 bits — standard ASCII is 7-bit (128 characters). Extended ASCII is 8-bit.
Saying "Unicode uses 2 bytes" — Unicode has multiple encodings (UTF-8, UTF-16, UTF-32) with different byte sizes. UTF-8 is variable-length.
Forgetting that Unicode is backwards-compatible with ASCII for the first 128 characters.
Confusing the character code with the binary representation — 'A' has code 65 (denary) = 01000001 (binary).

AI-generated · claude-opus-4-7 · v3-ocr-computer-science

Practice questions

Try each before peeking at the worked solution.

Question 14 marks
ASCII representation
The letter 'A' has an ASCII code of 65.

(a) State the ASCII code for the letter 'B'. [1]
(b) Write the 8-bit binary representation of 'A'. [1]
(c) A student types the word "Cat". Calculate the storage needed in bytes using ASCII encoding. [2]
Ask AI about this
AI-generated · claude-opus-4-7 · v3-ocr-computer-science
Question 24 marks
ASCII vs Unicode
Give two reasons why Unicode was developed to replace ASCII for representing text. [4 marks]
Ask AI about this
AI-generated · claude-opus-4-7 · v3-ocr-computer-science
Question 32 marks
Storage calculation
A text file contains 2,500 characters. Calculate the file size in bytes if the file uses:
(a) ASCII encoding. [1]
(b) UTF-16 encoding. [1]
Ask AI about this
AI-generated · claude-opus-4-7 · v3-ocr-computer-science

Flashcards

1.2.5 — Character encoding: ASCII (7-bit) and Unicode; representing text and storage requirements

8-card SR deck for OCR Computer Science (J277) topic 1.2.5

8 cards · spaced repetition (SM-2)

1.2.5Character encoding: ASCII (7-bit) and Unicode; representing text and storage requirements