Sampling — populations, samples and bias
Statistics is the science of saying useful things about a population (everyone or everything you care about) using a sample (a subset you actually measure).
Population vs sample
- Population: the entire group you want to learn about (e.g. all GCSE students in the UK).
- Sample: the actual subset you collect data from (e.g. 200 students from one school).
You compute statistics (mean, median, etc.) on the sample and use them to estimate the corresponding parameters of the population.
Why sample at all?
Studying the entire population is usually too expensive, slow or impossible (think nationwide surveys, infinite continuous data, destructive testing). Sampling lets you make defensible claims using a fraction of the data.
What makes a good sample?
A good sample is:
- Random — every member of the population has a known, non-zero chance of being chosen.
- Representative — reflects the structure of the population.
- Large enough — bigger samples give more reliable estimates (P5 / Law of Large Numbers).
Sampling methods
- Simple random sampling: every member has equal probability of selection (e.g. names from a hat, random number generator).
- Systematic sampling: pick every k-th member from an ordered list (e.g. every 10th name on the register).
- Stratified sampling: divide the population into strata (e.g. year groups) and sample proportionally from each.
- Cluster sampling: pick whole clusters (classes, schools) and sample everyone in them.
- Convenience sampling: ask whoever happens to be around — quick but typically biased.
Bias — when samples mislead
A biased sample systematically over- or under-represents some part of the population. Common sources:
- Selection bias: only certain people are reachable (e.g. online survey misses people without internet).
- Self-selection bias: only people who care strongly respond (e.g. complaint surveys).
- Survivorship bias: only "survivors" are visible (e.g. studying successful start-ups).
- Non-response bias: people who refuse differ from those who participate.
A biased sample can give a confident but wrong answer no matter how big it is.
✦Worked example— Example: comparing methods
To estimate the mean shoe size of a school of 600 students:
- Convenience: ask the football team. Will over-represent larger sizes — biased.
- Stratified: pick numbers proportional to year groups. Reflects the school structure — usually best.
- Simple random: random pick of 60 from the register. Fine, but a small chance of an unrepresentative draw.
Sample size
Bigger is better, but not without limits.
- For a yes/no proportion to within a few percent: ≈ 400 is usually plenty.
- Trade-offs: cost, time, response rate, and diminishing returns. Doubling the sample size cuts the random error by only about 30% (1/√2).
⚠Common mistakes— Common mistakes (examiner traps)
- Equating "large sample" with "representative". A million convenience-sampled responses can still be biased.
- Confusing sample mean with population mean. Use
x̄for sample, μ for population. - Ignoring non-response. Reporting only respondents distorts the picture.
- Sampling from the wrong population. A sample of GCSE pupils tells you nothing about A-Level students.
- Using a tiny pilot sample as the final answer.
➜Try this— Quick check
A school wants to estimate the average daily commute time of its 800 pupils. Suggest a stratified-sample plan that involves 40 pupils across 4 year groups.
Year groups all of equal size (200 each). Stratify by year: pick 10 pupils at random from each. Total = 40, balanced across years.
AI-generated · claude-opus-4-7 · v3-deep-statistics