Validity and Reliability

"Garbage in, Gospel out” Dr Gary Smith (2014)

Dr Jonathan Becker writing about Marzano's methods but which also applies to Hattie, states that,
"trustworthiness is operationalized in terms of validity and reliability... validity is about the accuracy of a measurement (are you measuring what you think you’re measuring?) and reliability is about consistency (would you get the same score on multiple occasions?). I won’t write a whole treatise here ... Suffice it to say that none of the measures of student achievement ... are either valid or reliable."
Hattie is well aware of the issues of validity and reliability as he writes about these issues in other publications, e.g., Assessing Teachers for Professional Certification, Volume 11, Chapter 4 (pp. 94 ff). However, he mentions very little about these issues as it applies to his own work in Visible Learning.

VALIDITY

We've already seen that each study measures achievement differently and many do not measure achievement at all. Hattie's over-reliance on correlation studies rather than true experiments exacerbates this problem.


Kulik and Kulik (1984), who Hattie cites a number of times (see ability grouping) comment on validity, 
'With poorly calibrated measuring instruments, investigators can hardly expect to achieve exact agreement in their results' (p. 89).
Rømer (2016),
'The relevant methodological question is, in my view, as follows: How does Hattie define "achievement outcome/learning outcomes/learning", which is his central effect concept...
We get a form of learning theoretic definition of learning outcomes in Chapter 3. Hattie believes that what he calls for "surface" and "depth" learning is contradictory to each other...
The theoretical uncertainties are amplified in the empirical analyses, which are deeply colored by the lack of comprehension. No clue as far as I can see how "achievement" is operationalized, whether it is surface learning, deep learning or construction learning or, for that matter, something completely fourth that is measured?' (p. 5, translated from Danish).
RELIABILITY

Hattie's rankings and effect sizes are not consistent, he does not find the same result when the studies are replicated. A key tenet of scientific research is that results need to be replicated for them to be reliable. See also - Other Researchers. Note: many think there is a major problem with the lack of replicability in modern research - see here.


Nilholm, Claes (2017) Is John Hattie in Blue Sword?
"Hattie provides very scarce information about his approach. This makes it very difficult to replicate his analyzes. The ability to replicate an analysis is considered by many as known as a crucial determinant of scientific work" (p. 3).
Hattie's rankings conflict with other significant organisations:

What Hattie categorises as a 'disaster': class size, problem-based learning, ability grouping, welfare and time in school; other organisations categorise as a priority. Likewise, what Hattie ranks very lowly: teachers' content knowledge, student control, problem-solving, Teacher PD and individualised learning; these organisations rank highly.

An interesting interview below where Hattie interviews the Finnish Educational Leader, Pasi Sahlberg, who contradicts virtually every tenet Hattie proposes.


Meta-analyses do not replicate results! 

A consistent pattern in Hattie's work is the meta-analyses he cites have different and contradictory results. Hattie hides the inconsistency & contradictions by averaging all the disparate studies into one effect size.

For example, Hattie's used 3 different meta-analyses for his influence 'Reducing Disruptive Behaviour'. Reid et al (2004) gets a totally contradictory effect size of -0.69 compared to the other 2 meta-analyses.

So, Reid et al (2004), apparently report that 'reducing disruptive behaviour' DECREASES student achievement by a sizeable amount! Whereas, the other 2 studies report it IMPROVES achievement by a sizeable amount.

Hattie averages the 3 results of 0.93, 0.78 and - 0.69 into one effect size = 0.34 thus hiding the inconsistency & contradiction. The -0.69 results warrants more detailed analysis here.


Another example is Hattie's influence 'Problem Based Learning' where he reports effect sizes from, -0.30 from Newman (2004) to 0.30 from Smith (2003) up to 1.13 from Mellinger (1991)

Different Averaging Methods Yield Totally Different Results

Some researchers use weighted averages which adjust for the number of students in each study.


Hattie does not report all the total students, even though he often has the data. For example, the above study Newman (2004), used 51 nurses (p. 6), while the Smith (2003) study used 12,979 medical students, yet they have the same weight!

So the negative effect of the Newman study cancels out the positive effect of the Smith study. However, if weighting is used (although another issue is why is one study, Newman (2004), being compared with a Meta-analysis?) we get Newman d= 0.001 while Smith d= 0.31. QUITE A DIFFERENCE!

See Effect Size for more examples.

Also, some researchers use the median, not the mean, e.g., Slavin (1990) (see ability grouping). 
"In pooling findings across studies, medians rather than means were used, principally to avoid giving too much weight to outliers. However, any measure of central tendency ... should be interpreted in light of the quality and consistency of the studies from which it was derived, not a finding in its own right" (p. 477).
Different selection criteria for studies: 

Different scholars use different selection or quality control criteria e.g., the 'Problem-Based Learning' studies again: most of the studies are on University medical students, many authors would reject those studies as they are not relevant for primary/high schools. 

The only study involving primary/high school students is the Hass (2005) Algebra study, as such, you could legitimately argue that this was the only study relevant for primary/high schools. d = 0.52.

You will find this issue with most of Hattie's work, another example is Feedback.

Standardised Tests

Hattie & Clinton (2008, p. 320) point out the issues of standardised testing.
"The standardised tests that are usually administered often do not reflect the classroom curriculum and learning objectives of the teachers at a particular time."
Also, many of Hattie's researchers identify major issues with using standardised or norm reference tests, e.g., Hacke (2010, p. 100),
"The results of this meta-analysis are based on standardised achievement scores, which may provide a misleading estimate of National Board Certified Teacher effectiveness because of validity issues. Validity refers to how appropriate and meaningful the inferences are that can be drawn from assessment results based on their intended use."
A good overview of the difference between PISA (standardised tests) versus TIMSS (curriculum-based tests) - click here.

PISA versus TIMMS

Tom Loveless has written an excellent comparison of the two international testing regimes for maths here.
"TIMSS and Program for International Student Assessment (PISA) are quite different. TIMSS is curriculum-based, reflecting the skills and knowledge taught in schools. PISA assesses whether students can apply what they’ve learned to solve “real world” problems. PISA tests an age-based sample (15-year-olds). PIRLS and TIMSS are grade-based (4th and 8th graders)."
"On PISA, New Zealand scores within 27 points of Korea (519 vs. 546). On TIMSS, New Zealand and Korea are separated by a whopping 125 points (488 vs. 613), a difference of nearly one full standard deviation between the two tests!  Chinese Taipei outscores Finland by only 2 points on PISA (543 vs.541, the scores are statistically indistinguishable)—but by 95 points on TIMSS (609 vs.514)."
More questions about PISA rankings
"A new education research brief from public school advocates, Save Our Schools (SOS), says PISA international league table rankings are misleading due to a large proportion of the 15-year-old population in many countries not being in school. 
The brief pointed to Vietnam – one of the standout performers in the results from PISA 2015 – as an example.
It said that while Vietnam achieved a ranking of 8th in science, half of its 15-year-old population was not covered because they were not in school.
 
The OECD’s PISA 2015 report shows that the sample of Vietnam students participating in PISA was taken from only 48.5% of Vietnam’s 15-year-olds. This was the lowest coverage of 15-year-olds in all the countries participating in PISA."
Breakspear (2014) in his analysis 'How does PISA shape education policy makingstates the importance of what we measure, 
"If the educational narrative is dominated by the performance of 15-year-olds in PISA, other important educational goals such as social and emotional development, interpersonal and intrapersonal skills, civics, health and wellbeing, will be held at the margins of the debate. In the end, systems will focus on optimising what they measure" (p. 14).
VIDEOS

1. Harvard Education Review -watch in YouTube mode and start at 50 minutes


2. High achieving Finish system educational guru, Pasi Sahlberg In a recent interview by Hattie (thanks to Kelvin Smythe)



Pasi seems to contradict many of Hattie's findings - "Talk about a clash of two worlds. Market forces, individuality, parent choice, and competition versus a community-based system based on well-paid and trained teachers; a system strong on equity, a system that values children’s health, well-being, and happiness. An inclusive system with no streaming – where the first choice comes in at 16 when students choose between general and vocational education.

Finland is based on local community control. Schooling is decentralised. Schools have lots of autonomy, responsibility, and initiative. But no parent choice until 16! Finland focuses on equity, not individuality and competition. All special education is made inclusive but helped according to need.

 For accountability, Finland relies on its schools and teachers plus NEMP-like sampling." Kelvin Smythe.

No comments:

Post a Comment