Scatter graphs, correlation and causation
When you have paired numerical data (e.g. height & weight per person), a scatter graph shows the relationship between the two variables. From it you can describe correlation and, with care, predict.
Plotting a scatter graph
Each data point is a single (x, y) pair. The x-variable goes on the horizontal axis (the "explanatory" or "input"), the y-variable on the vertical (the "response" or "output").
Example: heights and weights of 8 students. Plot each student as one point.
Correlation — what to describe
When you "describe the correlation" you give:
- Direction: positive (y rises as x rises), negative (y falls as x rises), or none.
- Strength: strong (points cluster tightly around a line), moderate, or weak.
- Form: usually linear; occasionally curved.
Examples:
- Height & weight: strong positive correlation.
- Hours of TV & exam mark: weak / moderate negative correlation.
- Shoe size & maths mark: no correlation.
Line of best fit
Draw a straight line that approximately balances the points above and below it. Use it to:
- Estimate y for a given x (or vice versa).
- Compute a rough slope/intercept (gradient = approximate change in y per unit change in x).
⚠ Don't extrapolate far beyond the data. The relationship may break down outside the observed range.
Interpolation vs extrapolation
- Interpolation: estimating within the range of the data → usually safe.
- Extrapolation: estimating outside → risky; the trend might not continue.
If the data covers heights 1.5–1.9 m, predicting weight at 2.5 m is extrapolation and not justified.
Correlation ≠ causation
A correlation says two variables move together; it does NOT say one causes the other.
Reasons:
- Common cause (third variable): ice-cream sales correlate with drowning rates because both depend on hot weather, not because ice cream causes drowning.
- Reverse causation: A may cause B, or B may cause A — without further evidence we can't tell.
- Coincidence: in small data sets, a correlation may be a fluke.
To establish causation you typically need a controlled experiment or strong domain knowledge.
Examiner-style correlation phrasing
"There is a strong positive correlation between hours studied and test score, indicating that students who studied more tended to score higher. However, this does not prove that studying causes higher marks — there may be other factors (e.g. interest in the subject, prior knowledge) influencing both."
Outliers in scatter
A point well away from the line of best fit may indicate:
- A measurement or recording error.
- A genuinely unusual case.
Comment, but don't silently delete.
⚠Common mistakes— Common mistakes (examiner traps)
- Saying "correlation" when there's none — sometimes the answer really is "no correlation".
- Confusing direction with strength. Negative ≠ weak.
- Using the line of best fit far beyond the data.
- Inferring causation from correlation alone.
- Treating one outlier as the whole story — comment, but report the broader pattern.
➜Try this— Quick check
A scatter of "ice-cream sales" vs "shark attacks per beach day" shows strong positive correlation. Does eating ice cream cause shark attacks? Why or why not?
No — the correlation is real, but both variables likely depend on a third (hot weather → more swimmers AND more ice-cream consumption). Classic confounding variable example.
AI-generated · claude-opus-4-7 · v3-deep-statistics