Student Achievement

"Stats are like bikinis, they don't reveal everything." Bobby Knight

People like to take international results ... and focus on high performers and pick out areas of policy that support the policies that they support ... I never expect tests like these to tell us what works in education. That’s like taking a thermometer to explain why it’s cold outside.” Jack Buckley, The Commissioner - U.S. Dep of Education’s National Center for Education Statistics.

What is student achievement?

Hattie regularly cites the New Zealand government extensive study by Alton-Lee (2003) which states (p8),

"Quality teaching cannot be defined without reference to outcomes for students"

Hattie agrees, in his 2005 ACER lecture, he states, "The latest craze, begun by the OECD is to include key competencies or ‘essence’ statements and this seems, at long last, to get closer to the core of what students need. Key competencies include thinking, making meaning, managing self, relation to others, and participating and contributing. Indeed, such powerful discussions must ensue around the nature of what are ‘student outcomes’ as this should inform what kinds of data need to be collected to thence enhance teaching and learning" (p14).

Compare this to Alton-Lee (2003) (p7), "The term 'achievement' encompasses achievement in the essential learning areas, the essential skills, including social and co-operative skills, the commonly held values including the development of respect for others, tolerance (rangimärie), non-racist behaviour, fairness, caring or compassion (aroha), diligence and hospitality or generosity (manaakitanga). Educational outcomes include attitudes to learning, and behaviours and other outcomes demonstrating the shared values. Educational outcomes include cultural identity, well-being, whanau spirit and preparation for democratic and global citizenship... Along with ... the outcome goals for Mäori students:

Goal 1: to live as Mäori. 
Goal 2: to actively participate as citizens of the world.
Goal 3: to enjoy good health and a high standard of living."

I'm not aware that these 'powerful discussions' have come to an agreement on the data that needs to be collected. The student outcomes still seem to differ so much, from one Country to another, one jurisdiction to another, one school to another, one class to another, even one student to another (differentiated learning). Yet, Hattie's work is mostly based on 'standardised tests'.


Hattie introduced his book with this limitation, "Of course, there  are many  outcomes of schooling, such as attitudes, physical outcomes, belongingness, respect, citizenship, and the love of learning. This book focuses on student achievement, and that is a limitation of this review" (p6).

Breakspear (2014) in his analysis 'How does PISA shape education policy making?' states the importance of what we measure, "If the educational narrative is dominated by the performance of 15-year-olds in PISA, other important educational goals such as social and emotional development, interpersonal and intrapersonal skills, civics, health and wellbeing, will be held at the margins of the debate. In the end, systems will focus on optimising what they measure" (p14).

Also, on page 11, he warns, "Narrow indicators should not be equated with the end-goals of education."

We are a long way from agreeing on student outcomes and consistently defining what achievement is let alone measuring it. 

In his collaboration with Clifton (2004), Identifying Accomplished Teachers, Hattie states,

"An obvious and simple method would be to investigate the effects of passing and failing NBPTS teachers on student test scores. Despite the simplicity of this notion, we do not support this approach" (p319).

"... student test scores depend on multiple factors, many of which are out of the control of the teacher.

... if test scores are to be used in teacher effectiveness studies, it is important to recognise that:

a. student achievement must be related to a particular teacher instruction, as students can achieve as a consequence of many other forms of instruction (e.g., via homework, television, reading books at home, interacting with parents and peers, lessons from other teachers);

b. a teacher can impart much information that may be unrelated to the curriculum; 

c. students are likely to differ in their initial levels of knowledge and skills, and the prescribed curricula may be better matched to the needs of some than others; 

d. differential achievement may be related to varying levels of out-of-school support for learning; 

e. teachers can be provided with different levels and types of instructional support, teaching loads, quality of materials, school climates, and peer cultures; 

f. instructional support from other teachers can take a variety of forms (particularly in middle and high schools where students constantly move between teachers); and 

g. teaching artefacts, such as the match of test content to prescribed curricula, levels of student motivation, and test-wiseness can distort the effects of teacher effects. 

Such a list is daunting and makes the task of relating teaching effectiveness to student outcomes most difficult. The present study does not, therefore, use student test scores to assess the impact of the NBCTs on student learning " (p320).

In 2000 Hattie was part of team of 4 academics to run a validity study for the US National Board Certification System, he rejected the use of student test scores as a measure of teacher performance, claiming,

“It is not too much of an exaggeration to state that such measures have been cited as a cause of all of the nation’s considerable problems in educating our youth. . . . It is in their uses as measures of individual teacher effectiveness and quality that such measures are particularly inappropriate.”

In a remarkable turnaround from his earlier work, Visible Learning DOES USE the simplistic approach of student tests and ignores his own principles (as outlined above), thus ignoring the impact of other significant effects on student achievement.

How is Student Achievement Measured?

The focus of VL is student achievement and in Hattie's 2005 ACER presentation, he warned 'not generalised surrogates for ability measures'. However, he has not rigorously applied this to his synthesis, as each meta-analysis use a wide range of different measures for student achievement, ranging from, standardised or teacher tests through to rallying a tennis ball against a wall in 30 seconds, to a mother's rating of her child on a scale of 1 - 5.

Worse, many measure something elsee.g., hyperactivity, behaviour, attention, concentration, engagement, and IQ. This raises significant questions regarding the validity of averaging and comparing studies that are measuring different things. This is the classic comparing apples with oranges problem, that many scholars state is a major weakness of meta-analyses.

Lipsey, et al, (2007) warn - “Different ways of measuring the same outcome construct – for example, achievement – may result in different effect size estimates even when the interventions and samples are similar" (p10).

Standardised Tests:

Once again in, Identifying Accomplished Teachers (2004), Hattie & Clifton point out the issues of standardised testing.

"The standardised tests that are usually administered often do not reflect the classroom curriculum and learning objectives of the teachers at a particular time" (p320).

Also, many of Hattie's researchers identify major issues with using standardised or norm reference tests. For example, Hacke (2010) "The results of this meta-analysis are based on standardised achievement scores, which may provide a misleading estimate of National Board Certified Teacher effectiveness because of validity issues. Validity refers to how appropriate and meaningful the inferences are that can be drawn from assessment results based on their intended use" (p100).

Hacke also illustrates the inconsistency of using the different type of tests: criterion- referenced tests are intended to measure how well a person has learned a specific body of knowledge and skills, whereas norm-referenced tests are developed to compare students with each other and are designed to produce a variance in scores. She cites a unique study by Harris and Sass (2007), who examined the influence of teacher certification (NBC) using two different types of assessment data from the state of Florida, which gives both norm-referenced and criterion referenced tests. Harris and Sass compared the results which revealed that the effect of NBC was negative for both reading and mathematics using the norm-referenced test, whereas for the criterion-referenced assessments they were positive (p109).

Professor Dylan William also comments on this problem,
"Effect sizes depend on how sensitive the assessments used to measure student progress are the things that teachers are changing. In one study, Maria Ruiz-Primo and Min Li found that when the effects of feedback in STEM subjects was measured with tests that measured what the students had actually been learning, the effect sizes were five times greater than when achievement was measured with standardised tests. Which of these is the “correct” figure is obviously a matter of debate. The important point is that assessments differ in their sensitivity to the effects of teaching, which therefore affects effect sizes."

A good overview of the difference between PISA (standardised tests) versus TIMSS (curriculum based tests) - click here.

Competency based assessment:

I live in the State of Victoria, Australia (the same as Hattie). The final year of schooling, Year 12, is an interesting case study in comparing different methods of assessment. 

Students can take a range of programs from the traditional 'academic' subjects through to more practical areas based on competency - trades, dance, sports, music, hairdressing, fashion, etc. 

The academic subjects are assessed via standardised type testing, designed to RANK students and to produce a variance in scores. This is mostly for the purpose of University entry. The major problem is how do you compare students in one subject to another? Which is akin to comparing Kevin Lisch, who was ranked No.1 in the Australian Basketball Competition with Stephen Curry, who was MVP in the USA competition? Is it meaningful to do so?

Whereas, competency based subjects are assessed by whether a student CAN or CAN'T satisfy a particular criterion, .e.g., for the Carpentry trade - 'can construct a mortise and tenon joint.'

Standardised and competency based assessments are NEVER compared.

Published problems with the meta-analysis methodology:

The Encyclopedia of Measurement and Statistics outlines the problems with meta-analyses (p595) which ALL apply to Hattie's synthesis:

(a)  mixes apples and oranges in combining the findings of studies with varying methodological quality.

(b) aggregates the findings of poor studies, thus setting low standards of judgment for the quality of outcome study.

(c) is problematic in light of shortcomings and flaws in the published literature (e.g., selective reporting, bias, insufficient reporting of primary data).

(d) has used “lumpy,” nonindependent data (i.e., multiple uses of the same data in several studies).

(e) the meta-analysis approach has given equal weighting to all studies regardless of methodological rigour.

Dr Jonathan Becker states, "In the field of measurement, trustworthiness is operationalized in terms of validity and reliability... validity is about the accuracy of a measurement (are you measuring what you think you’re measuring?) and reliability is about consistency (would you get the same score on multiple occasions?).  I won’t write a whole treatise here about measurement validity and reliability... Suffice it to say that none of the measures of student achievement ... are either valid or reliable."

Dr Neil Hooley's review casts doubt about the validity of Hattie's research, "Have problems of validity been discussed concerning original testing procedures on which meta-analysis is based? ... It has been assumed that test procedures are accurate and measure what they are said to measure. This is a major assumption ... As with all research the quality of the findings depend on the quality of the research design - including the instruments (measures that assess students) and the sampling design;(criteria for selection, size and representativeness of the sample, etc)" (p43).


Tom Loveless has written an excellent comparison of the two international testing regimes for maths here.

"TIMSS and Program for International Student Assessment (PISA) are quite different. TIMSS is curriculum-based, reflecting the skills and knowledge taught in schools. PISA assesses whether students can apply what they’ve learned to solve “real world” problems. PISA tests an age-based sample (15 year olds). PIRLS and TIMSS are grade-based (4th and 8th graders)."

"On PISA, New Zealand scores within 27 points of Korea (519 vs. 546). On TIMSS, New Zealand and Korea are separated by a whopping 125 points (488 vs. 613), a difference of nearly one full standard deviation between the two tests!  Chinese Taipei outscores Finland by only 2 points on PISA (543 vs.541, the scores are statistically indistinguishable)—but by 95 points on TIMSS (609 vs.514)."

More questions about PISA rankings

"A new education research brief from public school advocates, Save Our Schools (SOS), says PISA international league table rankings are misleading due to a large proportion of the 15-year-old population in many countries not being in school.

The brief pointed to Vietnam – one of the standout performers in the results from PISA 2015 – as an example.

It said that while Vietnam achieved a ranking of 8th in science, half of its 15-year-old population was not covered because they were not in school.

The OECD’s PISA 2015 report shows that the sample of Vietnam students participating in PISA was taken from only 48.5% of Vietnam’s 15-year-olds. This was the lowest coverage of 15-year-olds in all the countries participating in PISA."

Another Elephant walks into the Achievement Room

Hattie promotes Bereiter’s model of learning, “Knowledge building includes thinking of alternatives, thinking of criticisms, proposing experimental tests, deriving one object from another, proposing a problem, proposing a solution, and criticising the solution … “ (VL p27).

There needs to be a major shift, therefore, from an over-reliance on surface information (the first world) and a misplaced assumption that the goal of education is deep understanding or development of thinking skills (the second world), towards a balance of surface and deep learning leading to students more successfully constructing defensible theories of knowing and reality (the third world)” (p28).

the focus of this book is on achievement outcomes. Now this notion has been expanded to achievement outcomes across the three worlds of understanding” (p29). [Note- I have not found a study in Hattie's synthesis that measures outcomes in the so-called 'second world', let alone the 'third world'!]

The most popular TED talk by Ken Robinson - "do schools kill creativity", emphasises that 'creativity' should be an important outcome of education. While other's emphasise 'problem-solving' or 'teamwork' or something else. For example, in his summary 2012 VL, Hattie emphasises 'critical skills' (p4).

As discussed above, there are major problems in Hattie's synthesis of comparing disparate achievement measures. Yet, he has not even begun to address the outcomes of his learning model like 'critical thinking' or 'managing self'; let alone all the other outcomes like 'creativity' or the Mäori cultural values such as 'fairness', 'compassion' or 'tolerance'.

I'm not aware of anyone who can meaningfully measure any of these things. Certainly, 'creativity' assessment is a major issue. Even the skill of 'problem-solving', which PISA has recently focused on, is very difficult to measure.

An Example of the difficulties of assessing and ranking:

Tom Brady has become what many say is the best Quarterback (QB) in US sports history. QB skills are more precisely defined and measured than academic skills. QB skills are measured from high school through to college - accuracy of throwing, reaction time, peripheral vision and previous history on a range of game measures. With all this assessment data many bad assessments are still made. Brady is one example, he was picked/ranked at 199, yet really through chance when his more highly ranked team-mates were injured, he got his opportunity.