Student Achievement

Rømer (2016) in Criticism of Hattie's theory about Visible learning,
"On the whole, Visible Learning is not a theory of learning in its own right, nor is it an educational theory. Visible learning, on the other hand, is what happens when pedagogy and learning are exposed to a relatively unexplained evaluation theory" (p. 1, translated from Danish).
Larsen (2014) Know thy impact – blind spots in John Hattie’s evidence credo.
"The first among several flaws in Hattie's book: what is an effect?  
John Hattie never explains what the substance of an effect is. What is an effect’s ontology, its way of being in the world? Does it consist of something as simple as a correct answer on a multiple-choice task, the absence of arithmetic and spelling errors? And may all the power of teaching and learning processes (including abstract and imaginative thinking, history of ideas and concepts, historical knowledge, dedicated experiments, hands-on insights, sudden lucidity, social and language criticism, profound existential discussions, social bonding, and personal, social, and cultural challenges) all translate into an effect score without loss? Such basic and highly important philosophical and methodological questions do not seem to concern the evidence preaching practitioner and missionary Hattie" (p. 3).
Jack Buckley, The Commissioner - U.S. Dep of Education’s National Center for Education Statistics.
"People like to take international results ... and focus on high performers and pick out areas of policy that support the policies that they support ... I never expect tests like these to tell us what works in education. That’s like taking a thermometer to explain why it’s cold outside."
Dwayne Sacker summarising Noel Wilson's seminal 1997 dissertation “Educational Standards and the Problem of Error”, where Wilson delves into the nature of assessment,
"Now since there is no agreement on a standard unit of learning, there is no exemplar of that standard unit and there is no measuring device calibrated against the said non-existent standard unit, how is it possible to “measure the non-observable”? 
THE TESTS MEASURE NOTHING for how is it possible to “measure” the non-observable with a non-existing measuring device that is not calibrated against a non-existing standard unit of learning??

What is student achievement?

Hattie regularly cites the New Zealand Government's extensive study by Alton-Lee (2003, p. 8) which states,
"Quality teaching cannot be defined without reference to outcomes for students."
Hattie agrees, in his 2005 ACER lecture, he states, 
"The latest craze, begun by the OECD is to include key competencies or ‘essence’ statements and this seems, at long last, to get closer to the core of what students need. Key competencies include thinking, making meaning, managing self, relation to others, and participating and contributing. Indeed, such powerful discussions must ensue around the nature of what are ‘student outcomes’ as this should inform what kinds of data need to be collected to thence enhance teaching and learning" (p. 14).
Compare this to Alton-Lee (2003, p. 7)
"The term 'achievement' encompasses achievement in the essential learning areas, the essential skills, including social and co-operative skills, the commonly held values including the development of respect for others, tolerance (rangimärie), non-racist behaviour, fairness, caring or compassion (aroha), diligence and hospitality or generosity (manaakitanga). Educational outcomes include attitudes to learning, and behaviours and other outcomes demonstrating the shared \ values. Educational outcomes include cultural identity, well-being, whanau spirit and preparation for democratic and global citizenship... Along with ... the outcome goals for Mäori students:
Goal 1: to live as Mäori. 
Goal 2: to actively participate as citizens of the world. 
Goal 3: to enjoy good health and a high standard of living."
The Director-General of the high achieving Finnish system, Pasi Sahlberg, Finnish Lessons 2.0, outlines Finnish values (p. 101),
"one purpose of formal schooling is to transfer cultural heritage, values, and aspirations from one generation to another. Teachers are, according to their own opinions, essential players in building the Finnish welfare society. As in countries around the world, teachers in Finland have served as critical transmitters of culture ..."
Then in the Australian Teacher Magazine March 2019, p. 13, Sahlberg says,
"...policy makers the world over have become incarcerated by their own narrow assessment of what true student 'growth' entails... 
If you look at the Gonski 2.0, it's all about measuring growth... for most people 'growth' means how you progress academically, and how you are measuring that academic growth. 
It's not about health or wellbeing, or developing children's identity or social skills or other areas... politicians and bureaucrats are kind of imprisoned by this...
They don't realise that Singapore, for example, really envies Finland because of this play [approach], ... because they realise that what they do is unsustainable."
Popular blogger Greg Ashman opens the whole debate about the purpose of Education-
"Education is about cultural enrichment. It is about knowing the world you inhabit. It’s about political engagement and performing your civic role."
Back to Hattie, i'm not aware that these powerful discussions have come to an agreement on the data that needs to be collected. The student outcomes still seem to differ so much, from one Country to another, one jurisdiction to another, one school to another, one class to another, even one student to another (differentiated learning).

How is Student Achievement Measured?

The focus of VL is student achievement and in Hattie's 2005 ACER slide presentation, he warned NOT to use surrogates for ability measures.

Hattie has not followed his own advice in VL, each meta-analysis used a wide range of surrogates for student achievement, from standardised tests through to rallying a tennis ball against a wall in 30 seconds, to a mother's rating of her child on a scale of 1 - 5.

Wecker et al. (2016, p. 28) confirms this saying, Hattie mistakenly included studies that do not measure academic performance.

Effect Size & Different tests

Even when achievement is measured it is done in many different ways. 
DuPaul & Eckert (2012), 
"It is difficult to compare effect size estimates across research design types. Not only are effect size estimates calculated differently for each research design, but there appear to be differences in the types of outcome measures used across designs" (p. 408).
Professor Dylan Wiliam also comments on this problem,
"Effect sizes depend on how sensitive the assessments used to measure student progress are the things that teachers are changing. In one study ... the effects of feedback in STEM subjects was measured with tests that measured what the students had actually been learning, the effect sizes were five times greater than when achievement was measured with standardised tests. 
Which of these is the 'correct' figure is obviously a matter of debate. The important point is that assessments differ in their sensitivity to the effects of teaching, which therefore affects effect sizes."
Prof Wiliam displays the findings of a study by Ruiz-Primo et al. (2002) showing the differences in effect size versus the type of test used to measure it. 

Ruiz-Primo et al. (2002) use the following categories:

Lipsey et al. (2007, p. 10) confirm this analysis - 
"Different ways of measuring the same outcome construct – for example, achievement – may result in different effect size estimates even when the interventions and samples are similar."
Poulsen (2014) concurs,
"Only if the learning outcomes are measured with digitized multiple-choice test of exactly the same version with the same built-in calculation models, the data can be directly compared" (p. 4, translated from Danish).
Example From Problem Based Learning (PBL)

Hattie used Gijbels et al. (2005), but the study was actually looking at the different effect sizes derived from different tests. They propose since PBL is developing problem solving skills, test which measure these skills will show higher effect sizes than facts tests. (p. 32).

They summarise their results (p. 43):

They conclude,
"...students studying in PBL classes demonstrated better understanding of the principles that link concepts (weighted average ES = 0.795) than did students who were exposed to conventional instruction... students in PBL were better at the third level of the knowledge structure (weighted average ES = 0.339) than were students in conventional classes. It is important to note that the weighted average ES of 0.795, belonging to the second level of the knowledge structure, was the only statistically significant result." (p. 44).
Other Peer Reviews

Blichfeldt (2011), on Hattie's comparing of disparate studies,
"We also get no information about how "learning outcomes" are defined or measured in the studies at different levels, what tests are used, which subjects are tested and how."
Simpson (2017) and Bergeron (2017) also confirm Wiliam's analysis. Simpson goes into more detail about tests and gives the example of an intervention in Algebra teaching. If the achievement is measured by a teacher/researcher algebra test, the effect size will be much greater than if a generalised standardised test is used.

Simpson (2017, p. 11) also shows that the effect size is very different if derived from tests of different lengths. For example, the effect size from a 20 question test will be double that from a 4 question test.

The Impact of Test Length on Effect Size

Simpson (2018b, p. 4-5), even shows problems of comparing effect sizes with 2 different standardised tests,
'... an RCT was designed involving over 500 pupils across over 40 schools. The programme involved an intensive one-to one mathematics intervention intended for children in Key Stage 1 (ages 6–7) performing at the lowest 5% level nationally...
Part of the analysis involved two different tests. The primary test was the Progress in Mathematics 6 (PIM) test conducted in January 2010. The secondary test was the Sandwell Early Numeracy Test – Revised Test B (SENT-R-B) conducted in December 2009.
The effect size... for the PIM test was 0.33 and for the SENT-R-B test was 1.11.'
Professor Robert Slavin has also written extensively on this issue and confirms Simpson's analysis.

So, once again, same intervention, but different test leads to a MASSIVE difference in effect size!

Hattie does not control for standardised or researcher designed tests, let alone test length.

But worse, many of the studies Hattie used measured something else, e.g., hyperactivity, behaviour, attention, concentration, engagement, and IQ. 

This raises significant questions regarding the validity of averaging and comparing studies that are measuring different things. This is the classic comparing 'apples with oranges' problem, that many scholars state is a major weakness of meta-analyses.

Professor Peter Blatchford also raises this issue about Hattie,
"it is odd that so much weight is attached to studies that don't directly address the topic on which the conclusions are made" (p. 13).
Wecker et al. (2016) Hattie mistakenly include studies that do not measure academic performance (p. 28).

Another Elephant walks into the Achievement Room

Hattie promotes Bereiter’s model of learning, 
"Knowledge building includes thinking of alternatives, thinking of criticisms, proposing experimental tests, deriving one object from another, proposing a problem, proposing a solution, and criticising the solution …" (VL p. 27).
"There needs to be a major shift, therefore, from an over-reliance on surface information (the first world) and a misplaced assumption that the goal of education is deep understanding or development of thinking skills (the second world), towards a balance of surface and deep learning leading to students more successfully constructing defensible theories of knowing and reality (the third world)" (p. 28).
"the focus of this book is on achievement outcomes. Now this notion has been expanded to achievement outcomes across the three worlds of understanding" (p. 29).
Note- I have not found a study in Hattie's synthesis that measures outcomes in the so-called 'second world', let alone the 'third world'!

Prof Proulx (2017), Critical essay on the work of John Hattie for teaching mathematics: Entrance from the Mathematics Education, also identifies the inherent problem here,
"ironically, Hattie self-criticizes implicitly if we rely on his book beginning affirmations, then that it affirms the importance of the three types learning in education."
He quotes Hattie from VL,
"But the task of teaching and learning best comes together when we attend to all three levels: ideas, thinking, and constructing." (VL, p. 26)
"It is critical to note that the claim is not that surface knowledge is necessarily bad and that deep knowledge is essentially good. Instead, the claim is that it is important to have the right balance: you need surface to have deep; and you need to have surface and deep knowledge and understanding in a context or set of domain knowledge. The process of learning is a journey from ideas to understanding to constructing and onwards." (VL, p. 29)
From this quote, Prof Proulx goes on to say,
"So with this comment, Hattie discredits his own work on which it bases itself to decide on what represents the good ways to teach. Indeed, since the studies he has synthesized to draw his conclusions are not going in the sense of what he himself says represent a good teaching, how can he rely on it to draw conclusions about the teaching itself?"
Nielsen & Klitmøller (2017) in 'Blind spots in Visible Learning - Critical comments on "Hattie revolution"', also discuss Hattie's promotion of Bereiter’s model of learning and the inconsistency with the way achievement is measured.

They state that Hattie's discussion of Bereiter's model is,
"a limitation of his entire project" (p. 13). 
Rømer (2016),
"The theoretical uncertainties are amplified in the empirical analyses, which are deeply colored by the lack of comprehension. No clue as far as I can see how "achievement" is operationalized, whether it is surface learning, deep learning or construction learning or, for that matter, something completely fourth that is measured?" (p. 5, translated from Danish). 
David Didau writes a clever piece -  deeper learning: it’s like learning, but deeper.


Hattie introduced his book with this limitation,
"Of course, there are many outcomes of schooling, such as attitudes, physical outcomes, belongingness, respect, citizenship, and the love of learning. This book focuses on student achievement, and that is a limitation of this review"(p. 6).
Breakspear (2014) in his analysis 'How does PISA shape education policy makingstates the importance of what we measure, 
"If the educational narrative is dominated by the performance of 15-year-olds in PISA, other important educational goals such as social and emotional development, interpersonal and intrapersonal skills, civics, health and wellbeing, will be held at the margins of the debate. In the end, systems will focus on optimising what they measure" (p. 14).
In his collaboration with his wife Clinton (2004, p. 319), Identifying Accomplished Teachers, Hattie states,
"An obvious and simple method would be to investigate the effects of passing and failing NBPTS teachers on student test scores. Despite the simplicity of this notion, we do not support this approach.
... student test scores depend on multiple factors, many of which are out of the control of the teacher. 
... if test scores are to be used in teacher effectiveness studies, it is important to recognise that:
a. student achievement must be related to a particular teacher instruction, as students can achieve as a consequence of many other forms of instruction (e.g., via homework, television, reading books at home, interacting with parents and peers, lessons from other teachers);
b. a teacher can impart much information that may be unrelated to the curriculum; 
c. students are likely to differ in their initial levels of knowledge and skills, and the prescribed curricula may be better matched to the needs of some than others;
d. differential achievement may be related to varying levels of out-of-school support for learning; 
e. teachers can be provided with different levels and types of instructional support, teaching loads, quality of materials, school climates, and peer cultures; 
f. instructional support from other teachers can take a variety of forms (particularly in middle and high schools where students constantly move between teachers); and 
g. teaching artefacts, such as the match of test content to prescribed curricula, levels of student motivation, and test-wiseness can distort the effects of teacher effects.
Such a list is daunting and makes the task of relating teaching effectiveness to student outcomes most difficult. The present study does not, therefore, use student test scores to assess the impact of the NBCTs on student learning."
In 2000 Hattie was part of a team of 4 academics to run a validity study for the US National Board Certification System, he rejected the use of student test scores as a measure of teacher performance, claiming,
"It is not too much of an exaggeration to state that such measures have been cited as a cause of all of the nation’s considerable problems in educating our youth. . . . It is in their uses as measures of individual teacher effectiveness and quality that such measures are particularly inappropriate."
In a remarkable turnaround from his earlier work, Visible Learning and his software 'asTTle' does use the simplistic approach of student tests and ignores his own principles (as outlined above), thus disregarding the impact of other significant effects on student achievement.

Standardised Tests:

Once again in, Identifying Accomplished Teachers (2004, p. 320), Hattie & Clinton point out the issues of standardised testing.
"The standardised tests that are usually administered often do not reflect the classroom curriculum and learning objectives of the teachers at a particular time."
Also, many of Hattie's researchers identify major issues with using standardised or norm reference tests, e.g., Hacke (2010, p. 100),
"The results of this meta-analysis are based on standardised achievement scores, which may provide a misleading estimate of National Board Certified Teacher effectiveness because of validity issues. Validity refers to how appropriate and meaningful the inferences are that can be drawn from assessment results based on their intended use."
Hacke also illustrates the inconsistency of using the different type of tests: criterion-referenced tests are intended to measure how well a person has learned a specific body of knowledge and skills, whereas norm-referenced tests are developed to compare students with each other and are designed to produce a variance in scores. She cites a unique study by Harris and Sass (2007, p. 109), who examined the influence of teacher certification (NBC) using two different types of assessment data from the state of Florida, which gives both norm-referenced and criterion-referenced tests. Harris and Sass compared the results which revealed that the effect of NBC was negative for both reading and mathematics using the norm-referenced test, whereas for the criterion-referenced assessments they were positive.

A good overview of the difference between PISA (standardised tests) versus TIMSS (curriculum-based tests) - click here.

Competency-based assessment:

I live in the State of Victoria, Australia (the same as Hattie). The final year of schooling, Year 12, is an interesting case study in comparing different methods of assessment. 

Students can take a range of programs from the traditional 'academic' subjects through to more practical areas based on competency - trades, dance, sports, music, hairdressing, fashion, etc. 

The academic subjects are assessed via standardised type testing, designed to RANK students and to produce a variance in scores. This is mostly for the purpose of University entry. The major problem is how do you compare students in one subject to another? Which is akin to comparing Kevin Lisch, who was ranked No.1 in the Australian Basketball Competition with Stephen Curry, who was MVP in the USA competition? Is it meaningful to do so?

Whereas, competency-based subjects are assessed by whether a student CAN or CAN'T satisfy a particular criterion, .e.g., for the Carpentry trade - 'can construct a mortise and tenon joint.'

Standardised and competency-based assessments are NEVER compared.

Published problems with the meta-analysis methodology:

The Encyclopedia of Measurement and Statistics outlines the problems with meta-analyses (p. 595) which ALL apply to Hattie's synthesis:

(a)  mixes apples and oranges in combining the findings of studies with varying methodological quality.

(b) aggregates the findings of poor studies, thus setting low standards of judgement for the quality of outcome study.

(c) is problematic in light of shortcomings and flaws in the published literature (e.g., selective reporting, bias, insufficient reporting of primary data).

(d) has used 'lumpy', nonindependent data (i.e., multiple uses of the same data in several studies).

(e) the meta-analysis approach has given equal weighting to all studies regardless of methodological rigour.

Dr. Jonathan Becker states, 
"In the field of measurement, trustworthiness is operationalized in terms of validity and reliability... validity is about the accuracy of a measurement (are you measuring what you think you’re measuring?) and reliability is about consistency (would you get the same score on multiple occasions?). I won’t write a whole treatise here about measurement validity and reliability... Suffice it to say that none of the measures of student achievement ... are either valid or reliable."
Dr. Neil Hooley's review casts doubt on the validity of Hattie's research, 
"Have problems of validity been discussed concerning original testing procedures on which meta-analysis is based? ... It has been assumed that test procedures are accurate and measure what they are said to measure. This is a major assumption ... As with all research the quality of the findings depend on the quality of the research design - including the instruments (measures that assess students) and the sampling design; (criteria for selection, size and representativeness of the sample, etc)" (p. 43).

Tom Loveless has written an excellent comparison of the two international testing regimes for maths here.
"TIMSS and Program for International Student Assessment (PISA) are quite different. TIMSS is curriculum-based, reflecting the skills and knowledge taught in schools. PISA assesses whether students can apply what they’ve learned to solve “real world” problems. PISA tests an age-based sample (15-year-olds). PIRLS and TIMSS are grade-based (4th and 8th graders)."
"On PISA, New Zealand scores within 27 points of Korea (519 vs. 546). On TIMSS, New Zealand and Korea are separated by a whopping 125 points (488 vs. 613), a difference of nearly one full standard deviation between the two tests!  Chinese Taipei outscores Finland by only 2 points on PISA (543 vs.541, the scores are statistically indistinguishable)—but by 95 points on TIMSS (609 vs.514)."
More questions about PISA rankings
"A new education research brief from public school advocates, Save Our Schools (SOS), says PISA international league table rankings are misleading due to a large proportion of the 15-year-old population in many countries not being in school. 
The brief pointed to Vietnam – one of the standout performers in the results from PISA 2015 – as an example.
It said that while Vietnam achieved a ranking of 8th in science, half of its 15-year-old population was not covered because they were not in school.
The OECD’s PISA 2015 report shows that the sample of Vietnam students participating in PISA was taken from only 48.5% of Vietnam’s 15-year-olds. This was the lowest coverage of 15-year-olds in all the countries participating in PISA."

Examples of Other Student Outcomes:

The most popular TED talk by Ken Robinson - "do schools kill creativity", emphasises that 'creativity' should be an important outcome of education. While other's emphasise 'problem-solving' or 'teamwork' or something else. For example, in his summary 2012 VL, Hattie emphasises 'critical skills' (p. 4).

As discussed above, there are major problems in Hattie's synthesis of comparing disparate achievement measures. Yet, he has not even begun to address the outcomes of his learning model like 'critical thinking' or 'managing self'; let alone all the other outcomes like 'creativity' or the Mäori cultural values such as 'fairness', 'compassion' or 'tolerance'.

I'm not aware of anyone who can meaningfully measure any of these things. Certainly, 'creativity' assessment is a major issue. Even the skill of 'problem-solving', which PISA has recently focused on, is very difficult to measure.

An Example of the difficulties of assessing and ranking:

Tom Brady has become what many say is the best Quarterback (QB) in US sports history. QB skills are more precisely defined and measured than academic skills. QB skills are measured from high school through to college - the accuracy of throwing, reaction time, peripheral vision and previous history on a range of game measures. With all this assessment data many bad assessments are still made. Brady is one example, he was picked/ranked at 199, yet really by chance when his more highly ranked team-mates were injured, he got his opportunity.

Teacher Performance Pay

Hattie has been a strong advocate of teacher performance pay and has promoted his software 'asTle' for this purpose. Based on the discussion above, 'asTTle' has a very narrow measure of student achievement which is concerning to many academics, for example,

Professor Ewald Terhardt (2011, p. 434)
"A part of the criticism on Hattie condemns his close links to the New Zealand Government and is suspicious of his own economic interests in the spread of his assessment and training programme (asTTle). 
Similarly, he is accused of advertising elements of performance-related pay of teachers and he is being criticised for the use of asTTle as the administrative tool for scaling teacher performance. His neglect of social backgrounds, inequality, racism, etc., and issues of school structure is also held against him... 
However, there is also criticism concerning Hattie’s conception of teaching and teachers. Hattie is accused of propagating a teacher-centred, highly directive form of classroom teaching, which is characterised essentially by constant performance assessments directed to the students and to teachers."
We are a long way from agreeing on student outcomes and consistently defining what achievement is let alone measuring it!

Breakspear (2014, p. 11) warns,

"Narrow indicators should not be equated with the end-goals of education."
In his excellent paper 'School Leadership and the cult of the guru: the neo-Taylorism of Hattie', Professor Scott Eacott also warns (p. 9),
"Hattie’s work has provided school administrators with a science of teaching. The teaching and learning process is no longer hidden in the minds of learners, but made visible. This sensory experience can be used as the data for the generation of data – therefore making it measurable – and evidence informed decisions on what to do. The data is an extension of educational administration, in the era of data, if there is no evidence of learning then it did not happen. Furthermore, if there is no evidence of learning, then teaching did not happen and this is a performance issue to be managed by administrators."

No comments:

Post a comment