Statistics

KS4

MA-KS4-D006

Collecting, interpreting and comparing data using graphical representations, measures of average and spread, sampling methods, and correlation.

National Curriculum context

Statistics at KS4 develops pupils' statistical literacy — the ability to collect, represent, interpret and evaluate data in order to answer questions, make decisions and identify bias. Pupils work with a wider range of graphical representations than at KS3, including cumulative frequency graphs, box plots and histograms with unequal class widths, and apply formal measures of spread including interquartile range. The curriculum requires pupils to compare distributions using summary statistics and graphical displays, to understand correlation and use it to make predictions (while recognising the distinction from causation), and to critique statistical claims and the quality of data collection methods. Sampling — including random, systematic and stratified methods — is introduced to understand the relationship between sample and population. Higher tier pupils additionally work with estimating the mean from grouped frequency tables and interpolating median from cumulative frequency.

4

Concepts

3

Clusters

2

Prerequisites

4

With difficulty levels

AI Direct: 2
AI Facilitated: 2

Lesson Clusters

1

Understand sampling methods and collect data appropriately

introduction Curated

Data collection and sampling (random, systematic, stratified; bias identification) is the foundational GCSE statistics concept — understanding how data is gathered before it can be analysed.

1 concepts Evidence and Argument
2

Construct and interpret statistical diagrams and calculate averages

practice Curated

Statistical diagrams (histograms, cumulative frequency, box plots) and averages/measures of spread (mean, median, mode, IQR) are the core GCSE statistical analysis cluster.

2 concepts Patterns
3

Describe and interpret bivariate data using scatter graphs and correlation

practice Curated

Scatter graphs, correlation (positive/negative/none) and lines of best fit represent the bivariate data and relationship strand, distinct from univariate distribution analysis.

1 concepts Patterns

Prerequisites

Concepts from other domains that pupils should know before this domain.

Domain Vocabulary

46 terms across 4 concepts (46 domain-specific)(5 shared)

Domain-specific (46)
Concept
T3

association(noun)

A relationship or connection between two variables shown in statistical data.

T3

average(noun)

A single value that represents a typical or central value of a data set; usually refers to the mean.

T3

bar chart(noun)

A graph that uses rectangular bars of different heights to compare quantities across categories.

T3

bias(noun)

A systematic error in data collection that makes results unrepresentative of the population.

T3

bivariate(adjective)

Involving two variables; bivariate data has paired values for two different measurements.

T3

box plot(noun)

A diagram showing the distribution of data using the minimum, lower quartile, median, upper quartile, and maximum.

T3

causation(noun)

A direct cause-and-effect relationship where one variable actually produces a change in another.

T3

census(noun)

A survey that collects data from every member of a population, not just a sample.

T3

class width(noun)

The range covered by a single class interval, found by subtracting the lower boundary from the upper boundary.

T3

correlation(noun)

A statistical relationship between two variables shown on a scatter graph; can be positive, negative, or none.

T3

cumulative frequency(noun)

A running total of frequencies, showing how many data values fall at or below each class boundary.

T3

data collection(noun)

The process of gathering information in a structured way for statistical analysis.

T3

extrapolation(noun)

Estimating a value outside the range of known data by extending a trend, which is less reliable than interpolation.

T3

frequency density(noun)

Frequency divided by class width; used as the y-axis in histograms with unequal class widths.

T3

frequency polygon(noun)

A line graph drawn by connecting the midpoints of the tops of bars in a frequency diagram.

T3

frequency table(noun)

A table showing how often each value or range of values occurs in a data set.

T3

grouped data(noun)

Data organised into class intervals rather than listed as individual values.

T3

histogram(noun)

A graph for continuous data where bars have no gaps and the area of each bar represents the frequency.

T3

hypothesis(noun)

A proposed explanation or prediction to be tested using data or mathematical reasoning.

T3

interpolation(noun)

Estimating a value within the range of known data, which is more reliable than extrapolation.

T3

interquartile range(noun)

The difference between the upper quartile (Q3) and lower quartile (Q1); measures the spread of the middle 50% of data.

Shared by 2 concepts

T3

line of best fit(noun)

A straight line drawn through the middle of data points on a scatter graph, showing the general trend.

T3

mean(noun)

A type of average found by adding all values in a data set and dividing by the number of values.

T3

measure of central tendency(noun)

A single value representing the centre or typical value of a data set: mean, median, or mode.

T3

median(noun)

The middle value when all data values are arranged in order from smallest to largest.

Shared by 2 concepts

T3

midpoint(noun)

The exact middle point between two positions, values, or coordinates.

T3

mode(noun)

The value that appears most frequently in a data set.

T3

negative correlation(noun)

A relationship where one variable increases as the other decreases, shown by a downward trend on a scatter graph.

T3

no correlation(noun)

No apparent relationship between two variables; scattered points on a scatter graph with no trend.

T3

outlier(noun)

A data value that is significantly different from the rest of the data set.

Shared by 2 concepts

T3

pie chart(noun)

A circular chart divided into sectors where each sector represents a proportion of the whole data set.

T3

population(noun)

The entire group about which statistical conclusions are to be drawn.

T3

positive correlation(noun)

A relationship where both variables increase together, shown by an upward trend on a scatter graph.

T3

quartile(noun)

A value that divides ordered data into four equal parts: Q1 (25%), Q2/median (50%), Q3 (75%).

Shared by 2 concepts

T3

questionnaire(noun)

A structured set of questions used to collect data from respondents.

T3

random sample(noun)

A sample where every member of the population has an equal chance of being selected.

T3

range(noun)

The difference between the largest and smallest values in a data set, showing how spread out the data is.

Shared by 2 concepts

T3

representative sample(noun)

A sample that accurately reflects the characteristics of the whole population.

T3

sample(noun)

A subset of a population selected for study, used to make inferences about the whole group.

T3

scatter graph(noun)

A graph plotting paired data as individual points to show the relationship between two variables.

T3

skew(noun)

When data distribution is not symmetrical; values are concentrated more on one side.

T3

spread(noun)

How widely data values are distributed; a data set with a large range has a wide spread.

T3

strata(noun)

Distinct subgroups within a population (e.g. year groups, genders) used in stratified sampling.

T3

stratified sampling(noun)

A sampling method where the population is divided into strata and a proportional number is randomly selected from each.

T3

systematic sampling(noun)

A sampling method where items are selected at regular intervals from an ordered list (e.g. every 10th person).

T3

upper class boundary(noun)

The highest value that can belong to a class interval in grouped data, used when plotting cumulative frequency.

Concepts (4)

Data Collection and Sampling

knowledge AI Direct

MA-KS4-C030

Understanding and applying sampling methods (random, systematic, stratified); identifying sources of bias; designing data collection methods including questionnaires and observation.

Teaching guidance

Use real-world contexts (election polling, quality control, medical trials) to motivate proper sampling. The distinction between population and sample should be made explicit from the beginning. Stratified sampling requires proportional allocation — pupils should calculate sample sizes from each stratum using the population ratio. Questionnaire design flaws (leading questions, ambiguous response categories) are worth analysing critically.

Vocabulary (12 terms)
bias T3 new — A systematic error in data collection that makes results unrepresentative of the population.
census T3 new — A survey that collects data from every member of a population, not just a sample.
data collection T3 new — The process of gathering information in a structured way for statistical analysis.
hypothesis T3 — A proposed explanation or prediction to be tested using data or mathematical reasoning.
population T3 new — The entire group about which statistical conclusions are to be drawn.
questionnaire T3 new — A structured set of questions used to collect data from respondents.
random sample T3 new — A sample where every member of the population has an equal chance of being selected.
representative sample T3 new — A sample that accurately reflects the characteristics of the whole population.
sample T3 new — A subset of a population selected for study, used to make inferences about the whole group.
strata T3 new — Distinct subgroups within a population (e.g. year groups, genders) used in stratified sampling.
stratified sampling T3 new — A sampling method where the population is divided into strata and a proportional number is randomly selected from each.
systematic sampling T3 new — A sampling method where items are selected at regular intervals from an ordered list (e.g. every 10th person).
Common misconceptions

Pupils confuse stratified and systematic sampling — stratified is proportional across groups, systematic is every nth member. Sample bias is difficult to identify without seeing the sampling process; pupils tend to assess bias only from the data. Many pupils believe a larger sample automatically eliminates bias, not recognising that a biased method amplifies with scale.

Difficulty levels

Emerging

Understands the difference between a population and a sample, and can identify potential sources of bias in data collection.

Example task

A school wants to find out students' favourite lunch option. They survey 30 students from the football team. Explain why this sample may be biased.

Model response: The football team may not be representative of the whole school — they might prefer higher-calorie meals. A better sample would be a random selection from all year groups.

Developing

Selects and applies appropriate sampling methods (random, systematic, stratified) and designs data collection tools including questionnaires.

Example task

A school has 600 students: 200 in Year 7, 180 in Year 8, 120 in Year 9, 100 in Year 10. A stratified sample of 50 is needed. How many from each year?

Model response: Year 7: (200/600) x 50 = 16.7, round to 17. Year 8: (180/600) x 50 = 15. Year 9: (120/600) x 50 = 10. Year 10: (100/600) x 50 = 8.3, round to 8. Total: 17 + 15 + 10 + 8 = 50.

Secure

Evaluates sampling methods, identifies limitations of data collection, and designs strategies to minimise bias and improve validity.

Example task

A researcher wants to find out how much time teenagers spend on social media. Compare using an online survey versus face-to-face interviews. Which is better?

Model response: Online survey: larger sample, cheaper, but self-selection bias (only those online will respond, and they may use social media more). Respondents may also exaggerate or underestimate. Face-to-face: more accurate responses (can clarify questions), but smaller sample, more expensive, and social desirability bias (respondents may understate usage). A better approach might be combining a stratified random online survey with follow-up interviews for a subsample.

Mastery

Designs complete statistical investigations with appropriate hypotheses, sampling strategies, and evaluates the reliability and validity of conclusions drawn from data.

Example task

Design a statistical investigation to test whether students who eat breakfast perform better in morning tests. Include hypothesis, sampling, data collection and potential confounders.

Model response: Hypothesis: Students who eat breakfast score higher on morning tests than those who do not. Sampling: Stratified random sample across year groups, aiming for 100+ students to ensure adequate power. Data collection: Anonymous questionnaire on breakfast habits (eaten/not eaten, type), plus morning test scores from school records. Confounders: sleep quality, prior attainment, socioeconomic status (linked to both breakfast and attainment). These should be recorded and controlled for in analysis. Limitation: This is observational, not experimental — correlation does not imply causation. Students who eat breakfast may also have other advantages.

Delivery rationale

Secondary maths concept — abstract, procedural, and objectively assessable.

Statistical Diagrams

process AI Facilitated

MA-KS4-C031

Constructing and interpreting bar charts, pie charts, histograms (equal and unequal class widths), cumulative frequency graphs, and box plots.

Teaching guidance

Frequency density = frequency / class width is the key concept for histograms with unequal class widths — pupils must understand that it is the area, not the height, of each bar that represents frequency. Cumulative frequency graphs should always be plotted at the upper class boundary. Box plots represent the five-number summary (min, Q1, median, Q3, max) and allow distribution comparison visually.

Vocabulary (13 terms)
bar chart T3 — A graph that uses rectangular bars of different heights to compare quantities across categories.
box plot T3 new — A diagram showing the distribution of data using the minimum, lower quartile, median, upper quartile, and maximum.
class width T3 — The range covered by a single class interval, found by subtracting the lower boundary from the upper boundary.
cumulative frequency T3 new — A running total of frequencies, showing how many data values fall at or below each class boundary.
frequency density T3 — Frequency divided by class width; used as the y-axis in histograms with unequal class widths.
frequency polygon T3 new — A line graph drawn by connecting the midpoints of the tops of bars in a frequency diagram.
histogram T3 — A graph for continuous data where bars have no gaps and the area of each bar represents the frequency.
interquartile range T3 — The difference between the upper quartile (Q3) and lower quartile (Q1); measures the spread of the middle 50% of data.
median T3 — The middle value when all data values are arranged in order from smallest to largest.
pie chart T3 — A circular chart divided into sectors where each sector represents a proportion of the whole data set.
quartile T3 — A value that divides ordered data into four equal parts: Q1 (25%), Q2/median (50%), Q3 (75%).
range T3 — The difference between the largest and smallest values in a data set, showing how spread out the data is.
upper class boundary T3 new — The highest value that can belong to a class interval in grouped data, used when plotting cumulative frequency.
Common misconceptions

Pupils draw histograms with equal-width bars and label the y-axis 'frequency' even when class widths are unequal — frequency density is not intuitive. Cumulative frequency is plotted at midpoints rather than upper class boundaries. Box plots are confused with bar charts; the box width has no meaning, only the positions of the lines matter.

Difficulty levels

Emerging

Constructs and interprets bar charts, pie charts and pictograms accurately, including reading scales and comparing categories.

Example task

A pie chart shows exam results: A* = 36 degrees, A = 72 degrees, B = 108 degrees, C = 90 degrees, D = 54 degrees. There are 200 students. How many got a B?

Model response: B sector = 108 degrees. Fraction = 108/360 = 3/10. Number of students = 3/10 x 200 = 60.

Developing

Constructs and interprets frequency polygons, cumulative frequency graphs (plotting at upper class boundaries), and reads quartiles from cumulative frequency.

Example task

Draw a cumulative frequency graph from: 0-10 (freq 4), 10-20 (freq 12), 20-30 (freq 18), 30-40 (freq 6). Estimate the median.

Model response: Cumulative frequencies: 4, 16, 34, 40. Plot at upper boundaries: (10, 4), (20, 16), (30, 34), (40, 40). Total = 40, so median at 20th value. Reading from the graph: approximately 23.

Secure

Constructs and interprets histograms with unequal class widths using frequency density, and compares distributions using box plots.

Example task

A histogram has these bars: class 0-5 with frequency density 2, class 5-15 with frequency density 3.5, class 15-20 with frequency density 4. Find the frequency for each class.

Model response: Frequency = frequency density x class width. Class 0-5: 2 x 5 = 10. Class 5-15: 3.5 x 10 = 35. Class 15-20: 4 x 5 = 20. Total = 65.

Mastery

Interprets and compares complex statistical diagrams critically, including population pyramids, misleading graphs, and composite representations. Constructs histograms from raw grouped data and uses area to estimate probabilities.

Example task

A histogram shows journey times. The class 10-20 has frequency density 3.2 and the class 20-25 has frequency density 4.0. Estimate the probability that a randomly chosen journey takes between 15 and 25 minutes.

Model response: Area for 15-20 (half of the 10-20 bar): 3.2 x 5 = 16. Area for 20-25: 4.0 x 5 = 20. Estimated frequency for 15-25: 16 + 20 = 36. Need total frequency (sum of all bar areas) to compute probability. If total = 120, then P(15 to 25) = 36/120 = 0.3. The key insight is that area represents frequency in a histogram, so we can estimate probabilities from areas.

Delivery rationale

Secondary maths process concept — problem-solving benefits from structured AI delivery with facilitator for extended reasoning.

Averages and Measures of Spread

process AI Facilitated

MA-KS4-C032

Calculating mean, median, mode and range for ungrouped data; estimating mean from grouped frequency tables; finding quartiles and interquartile range from cumulative frequency.

Teaching guidance

Mean from grouped data requires finding midpoints and computing Σfm/Σf — the mean is an estimate because we do not know exact values within classes. Interquartile range (Q3 − Q1) measures spread while being resistant to outliers, unlike range — this statistical property is worth emphasising. Pupils should understand when each average is appropriate: mode for categorical data, median for skewed distributions, mean for symmetric data.

Vocabulary (14 terms)
average T3 — A single value that represents a typical or central value of a data set; usually refers to the mean.
frequency table T3 — A table showing how often each value or range of values occurs in a data set.
grouped data T3 — Data organised into class intervals rather than listed as individual values.
interquartile range T3 — The difference between the upper quartile (Q3) and lower quartile (Q1); measures the spread of the middle 50% of data.
mean T3 — A type of average found by adding all values in a data set and dividing by the number of values.
measure of central tendency T3 new — A single value representing the centre or typical value of a data set: mean, median, or mode.
median T3 — The middle value when all data values are arranged in order from smallest to largest.
midpoint T3 — The exact middle point between two positions, values, or coordinates.
mode T3 — The value that appears most frequently in a data set.
outlier T3 — A data value that is significantly different from the rest of the data set.
quartile T3 — A value that divides ordered data into four equal parts: Q1 (25%), Q2/median (50%), Q3 (75%).
range T3 — The difference between the largest and smallest values in a data set, showing how spread out the data is.
skew T3 — When data distribution is not symmetrical; values are concentrated more on one side.
spread T3 — How widely data values are distributed; a data set with a large range has a wide spread.
Common misconceptions

The mean from grouped data is always estimated — pupils often state it as an exact value. Median from a frequency table requires finding the middle value by cumulative frequency, not by halving the total frequency and reading off. Interquartile range is sometimes computed as Q2 − Q1 (not Q3 − Q1) by pupils who confuse quartile numbering.

Difficulty levels

Emerging

Calculates mean, median, mode and range for small ungrouped data sets and knows which average each one represents.

Example task

Find the mean, median and mode of: 3, 5, 5, 6, 8, 9, 12.

Model response: Mean = (3+5+5+6+8+9+12)/7 = 48/7 = 6.86 (2 d.p.). Median = 6 (middle of 7 ordered values). Mode = 5 (appears twice).

Developing

Calculates the mean from a frequency table using sum of fx / sum of f, and chooses the most appropriate average for a given data set.

Example task

Find the mean from: Score 1 (freq 5), Score 2 (freq 8), Score 3 (freq 12), Score 4 (freq 3), Score 5 (freq 2).

Model response: Sum of fx = 1(5) + 2(8) + 3(12) + 4(3) + 5(2) = 5 + 16 + 36 + 12 + 10 = 79. Sum of f = 30. Mean = 79/30 = 2.63 (2 d.p.).

Secure

Estimates the mean from a grouped frequency table using midpoints, finds the modal class, and estimates the median class. Understands why these are estimates.

Example task

Estimate the mean from: 0 < x <= 10 (freq 6), 10 < x <= 20 (freq 14), 20 < x <= 30 (freq 8), 30 < x <= 50 (freq 2).

Model response: Midpoints: 5, 15, 25, 40. Sum of fm = 6(5) + 14(15) + 8(25) + 2(40) = 30 + 210 + 200 + 80 = 520. Sum of f = 30. Estimated mean = 520/30 = 17.3 (1 d.p.). This is an estimate because we assume all values in each class equal the midpoint.

Mastery

Uses cumulative frequency to estimate quartiles, interquartile range and percentiles. Compares distributions using summary statistics and evaluates which measures are most appropriate.

Example task

Two factories produce bolts. Factory A: mean diameter 10.02 mm, IQR 0.03 mm. Factory B: mean diameter 10.00 mm, IQR 0.12 mm. The target is 10.00 mm. Which factory is more reliable?

Model response: Factory B has the mean closer to target (10.00 vs 10.02) but much larger variability (IQR 0.12 vs 0.03). Factory A is slightly off-target but very consistent — most bolts are within 0.03 mm of each other. Factory B is centred correctly but produces bolts with wide variation — some could be far from target. Factory A is more reliable for quality control because consistency matters more than being exactly centred (which can be recalibrated). The small systematic error in A is preferable to the large random error in B.

Delivery rationale

Secondary maths process concept — problem-solving benefits from structured AI delivery with facilitator for extended reasoning.

Scatter Graphs and Correlation

knowledge AI Direct

MA-KS4-C033

Plotting and interpreting scatter graphs of bivariate data; describing correlation (positive, negative, none); drawing and using lines of best fit; distinguishing correlation from causation.

Teaching guidance

Require pupils to describe correlation in context, not just label it positive/negative — 'as height increases, weight tends to increase' is more statistically meaningful than 'positive correlation'. Lines of best fit should pass through the mean point (x̄, ȳ) and pupils should use this property to check their line. The causation-correlation distinction is critical for statistical literacy and should be reinforced with counter-intuitive examples.

Vocabulary (12 terms)
association T3 new — A relationship or connection between two variables shown in statistical data.
bivariate T3 new — Involving two variables; bivariate data has paired values for two different measurements.
causation T3 new — A direct cause-and-effect relationship where one variable actually produces a change in another.
correlation T3 — A statistical relationship between two variables shown on a scatter graph; can be positive, negative, or none.
extrapolation T3 new — Estimating a value outside the range of known data by extending a trend, which is less reliable than interpolation.
interpolation T3 new — Estimating a value within the range of known data, which is more reliable than extrapolation.
line of best fit T3 — A straight line drawn through the middle of data points on a scatter graph, showing the general trend.
negative correlation T3 — A relationship where one variable increases as the other decreases, shown by a downward trend on a scatter graph.
no correlation T3 — No apparent relationship between two variables; scattered points on a scatter graph with no trend.
outlier T3 — A data value that is significantly different from the rest of the data set.
positive correlation T3 — A relationship where both variables increase together, shown by an upward trend on a scatter graph.
scatter graph T3 — A graph plotting paired data as individual points to show the relationship between two variables.
Common misconceptions

Pupils believe strong correlation implies causation — the most important misconception in statistics. Lines of best fit are frequently drawn from (0,0) or through the most extreme points rather than balancing the data. Extrapolation beyond the data range is treated as equally reliable as interpolation within it.

Difficulty levels

Emerging

Plots scatter graphs from paired data and describes the overall trend informally (going up, going down, no pattern).

Example task

Plot these points on a scatter graph: (150, 45), (155, 50), (160, 55), (165, 52), (170, 60), (175, 58), (180, 65). Describe the trend.

Model response: The points show an upward trend: as height increases, weight tends to increase. The relationship is not perfect — not all points lie on a straight line.

Developing

Identifies and describes the type and strength of correlation (strong/weak positive, strong/weak negative, none) and draws a line of best fit by eye.

Example task

A scatter graph shows that as the age of a car increases, its value decreases. There is a clear downward trend with points close to a line. Describe the correlation.

Model response: There is a strong negative correlation: as the age of the car increases, its value tends to decrease. A line of best fit would slope downward. The strong correlation means the points are close to the line, so age is a good predictor of value.

Secure

Uses a line of best fit to make predictions, distinguishing between interpolation (reliable, within data range) and extrapolation (unreliable, outside data range).

Example task

A line of best fit for height (cm) vs shoe size has equation y = 0.1x - 7, valid for heights 150-190 cm. Estimate the shoe size for (a) height 170 cm, (b) height 210 cm.

Model response: (a) y = 0.1(170) - 7 = 17 - 7 = size 10. This is interpolation (170 is within the data range 150-190) so the estimate is reliable. (b) y = 0.1(210) - 7 = 21 - 7 = size 14. This is extrapolation (210 is outside the data range) so the estimate is unreliable — the linear relationship may not hold for very tall people.

Mastery

Critically evaluates bivariate data, distinguishes correlation from causation, identifies lurking variables, and interprets the equation and gradient of a line of best fit in context.

Example task

A study finds a strong positive correlation between ice cream sales and drowning incidents. A newspaper headline says 'Ice cream causes drowning'. Evaluate this claim.

Model response: The claim is false — it confuses correlation with causation. The lurking (confounding) variable is temperature/season: hot weather causes both increased ice cream sales and increased swimming (leading to more drowning incidents). The two variables are associated because they share a common cause, not because one causes the other. To establish causation, you would need a controlled experiment (impossible here for ethical reasons). This is a classic example of a spurious correlation driven by a confounding variable.

Delivery rationale

Secondary maths concept — abstract, procedural, and objectively assessable.