Data compression
Files take up storage and bandwidth — both finite. Compression shrinks file size by removing redundancy. The two main approaches at GCSE are lossless and lossy compression.
Why compress?
- Storage — fit more in the same space.
- Bandwidth — faster downloads / streaming.
- Cost — less hosting and traffic.
Lossless compression
The compressed file decompresses back exactly identical to the original — no bits lost. Used when every bit matters: text, source code, executables, lossless image formats (PNG, BMP), zip archives.
How it works (in spirit):
- Run-length encoding (RLE). Replace runs of repeated values with the value and a count.
AAAAAA→6A. Excellent for images with large flat areas. - Dictionary encoding (LZ77, Huffman). Build a dictionary of frequently occurring patterns, replace each with a short code. Used in ZIP, GZIP, PNG.
- Huffman coding. Common characters get short codes; rare characters get long codes.
Typical ratio: 30-70% size reduction for text; less for already-random data.
Lossy compression
Discards data the human eye/ear can't easily detect. Smaller files at the cost of permanent quality loss — once decompressed, the original is gone.
- JPEG for photos: discards fine colour detail, blocks of similar pixels are merged.
- MP3, AAC, Opus for audio: removes inaudible frequencies (e.g. above 16 kHz) and masked sounds.
- MP4, H.264, H.265 for video: combines lossy image + audio compression with motion prediction.
Typical ratio: 90% size reduction for photos and audio.
Which to use?
| Scenario | Best choice | Why |
|---|---|---|
| Source code, legal documents | Lossless | Every bit must be exact |
| Photo album for the web | Lossy | Small files, eye-friendly |
| CT scan medical images | Lossless | Detail must survive |
| Music streaming | Lossy | Bandwidth matters more than perfection |
| Logo / icon | Lossless | Sharp edges and colours preserved |
| Camera raw → social post | Lossy | Acceptable quality loss for size |
✦Worked example— Worked example — RLE
Compress AAAABBBCCDAA using RLE.
Result: 4A3B2C1D2A — 10 characters → 10 characters here (no saving for short runs).
Now: AAAAAAAAAAAAAAAA (16 A's) → 16A — 3 characters from 16. Excellent.
RLE works only when there are long runs. Random text actually grows under RLE.
✦Worked example— Worked example — Huffman idea
Imagine a text where 'E' appears often, 'Z' rarely. Huffman might assign:
- 'E' → 10 (2 bits)
- 'Z' → 1110011 (7 bits)
Average bits per character drops, even though some get longer.
⚠Common mistakes— Pitfalls
- Confusing lossy and lossless. Lossless = identical reconstruction; lossy = irreversible discard.
- Believing "compressed = smaller always". Already-compressed data (a JPEG, an MP3) doesn't compress further with general-purpose tools.
- Re-compressing lossy. Each save loses more quality (generation loss).
- Using lossy for the wrong content. A scanned legal document needs every pixel.
- Confusing compression with encryption. Compression makes files smaller; encryption makes them unreadable without a key.
Visual: lossy is one-way
ORIGINAL ─→ compress ─→ SMALLER LOSSY FILE ─→ decompress ─→ APPROXIMATION
You can't recover the original from the compressed version.
✦Worked example— Worked example — choose wisely
A web designer needs to put a 12 MP photo on a webpage. Should they use PNG (lossless) or JPEG (lossy)?
For a photo on a webpage:
- File size matters (faster page load, less data for visitors).
- Slight quality loss is invisible at typical viewing distances.
- Choose JPEG for an order-of-magnitude smaller file.
For a logo with sharp edges and few colours:
- Lossy compression creates ugly artefacts around edges.
- File would already be small without compression.
- Choose PNG for crisp output.
➜Try this— Quick check
State whether each is lossless or lossy:
- ZIP archive — lossless.
- JPEG — lossy.
- MP3 — lossy.
- PNG — lossless.
- MP4 video — lossy.
- FLAC audio — lossless.
AI-generated · claude-opus-4-7 · v3-deep-computer-science