VisibleLearning: Prof Adrian Simpson

Prof Adrian Simpson has written a number of critiques of Hattie's work:

Higgins & Simpson (2011). A Review of Visible Learning.

Simpson (2017). The misdirection of public policy: comparing and combining standardised effect sizes.

Simpson (2018). Princesses are bigger than elephants: Effect size as a category error in evidence-based education.

Simpson (2018b). Unmasking the unasked: correcting the record about assessor masking as an explanation for effect size differences.

Simpson (2019). Separating arguments from conclusions: the mistaken role of effect size in educational policy research.

Princesses are Bigger than Elephants?

His 2018 analysis uses a great example of comparing the size of elephants and princesses on a photo as a proxy for their actual size. Then concluding Princesses are bigger! This is a clever analogy of comparing effect sizes across different studies,

"Adam photographs an elephant. The elephant’s image covers 0.2 of the area of the photograph.

Belinda photographs another elephant. Her elephant’s image covers 0.3 of the area of her photograph.

Simon compares these numbers, concluding that the second elephant must be larger.

Catherine collects many photographs of elephants. For each she works out the ‘photo-size’ of each elephant (the proportion of the photograph filled by the elephant’s image) and, averaged across the collection, finds a photo-size of 0.18.

Douglas collects photographs of princesses. The average photo-size of princesses in his collection is 0.24.

Tabitha compares these numbers, concluding that princesses are bigger than elephants. She cautions that this does not mean that a particular princess is bigger than a particular elephant: this is about averages.

Uri draws together Catherine’s work (with other categories. He produces a league table: politicians and microbes towards the top, ants and white rhinos towards the bottom. This league table is promoted as the ‘best bet’ for indicating which creatures are really bigger or smaller, with caveats about assumptions needed to interpret it.

This story is clearly designed to expose the argument’s absurdity. An object’s physical size plays only a partial role in photo-size, so it is a category error to treat relative photo-size as a proxy for relative physical size.

Only if Simon had reason to believe Adam and Belinda used the same camera and lens and stood the same distance away, might he legitimately argue that Belinda’s elephant is larger.

Similarly, Tabitha’s conclusion that relative averaged photo-size can act as a proxy for relative averaged actual size relies on strong assumptions: that other elements affecting photo-size are distributed equally for photographs of princesses and for photographs of elephants.

The same assumptions are needed for Uri in comparing actual average sizes of classes of creatures on the basis of relative averaged photo-sizes. Not only are these heroically strong assumptions (which Uri leaves unchecked), but it is also clear that they cannot be met. They require the design decisions of photographers to be distributed equally across areas, but photographers do not use the same cameras for microbes and politicians; nor do they stand at the same distance when photographing white rhinos and ants.

Design decisions vary systematically between areas, so the argument for using relative photo-size as relative actual size is invalid.

This article will show that identical, fundamentally flawed arguments underpin much of the ‘evidence-based education’ movement. First, effect size does not measure the effectiveness of an intervention (nor its educational importance or influence), since the intervention plays only a partial role in the calculation of effect size. Second, when comparing studies, relative effect size can be a proxy for the relative effectiveness of interventions only in the highly restricted circumstances where all other factors impacting on effect size are equal. Third, when comparing groups of studies, relative averaged effect size can be a proxy for relative average effectiveness for types of intervention only in the highly restricted circumstances where all other factors impacting on effect size are distributed identically across those groups of studies. While meta-meta-analysts may assume those circumstances hold, they do not: instead, these factors vary systematically between types of intervention" (p. 897).

Hattie has argued that these elements are factored out by averaging across studies, Simpson replies,

"While one may argue that, all other things being equal, these other elements are factored out by averaging across studies, it should be clear that all other things are not equal: tests, samples and comparison activities vary systematically between educational areas" (p. 903).

Simpson then gives specific examples of each of these variations:

1. different comparison treatments - the control group.
2. different samples of students - the standard deviation.
3. different tests - how is achievement measured?

Then shows how the calculated effect size varies enormously with the SAME intervention!

1. The Comparison Treatment - the control group.

Simpson shows the comparison treatment chosen by the researcher has a huge impact on the effect size.

"meta-analyses sometimes include studies with starkly inactive comparisons: not teaching the topic at all. For example, Steenbergen-Hu and Cooper’s (2014) meta-analysis of intelligent tutorial system interventions includes studies where the comparison activity was human tutoring (average d = -0.25) and studies where the comparison was reading computerised material (average d = 0.25). Most interestingly, a group of studies had ‘no treatment’ comparisons (average d = 0.90), including a study with a sample with no previous economics teaching, using an intervention treatment teaching economics using an intelligent tutoring system, while the comparison group were not taught economics at all; the outcome was measured with an economics test—unsurprisingly, the effect size was rather large, around d = 1.5 (Shute & Glaser, 1990).

While some may argue that researchers should know that effect sizes are relative to comparison treatments, meta-analysts show no qualms in combining very different comparison treatments in a single average (here, d = 0.35), with meta-meta-analysts using that summary value to rank order interventions (e.g. Schneider & Preckel, 2017).

The choice of comparison activity is neither arbitrary nor random; researchers select it to meet their intentions, within restrictions laid down by convention and practicality. Where possible, having a less active comparison allows increased power (the chance of detecting a group difference) and therefore effect size, without altering the intervention.

Thus, it is a category error to read relative effect sizes as relative effectiveness of interventions, let alone to assume they signal educational importance, relevance or influence" (p. 901).

2. The Sample of Students - the standard deviation.

"The choice of sample can impact on effect size for a study in two interacting ways, independent of the intervention, control treatment or test. The first follows from the definition of effect size: the more homogeneous the sample on the outcome measure, the higher the effect size. The second relates to the mechanism through which the intervention treatment works.

The first issue has long been known (e.g. FitzGibbon, 1984), albeit meta-analysts rarely address it. The second issue with the sample is that an intervention may be expected to work differently with different people—in particular, effectiveness may vary with pre-existing ability. A computer-based activity aimed at improving test-taking techniques may be effective with pupils struggling with such techniques; so, a sample consisting of these pupils may show a larger (raw) difference in mean scores on a suitable test. The same activity may be ineffective with confident test takers; so, a sample of those may show little difference in mean scores (Martindale et al., 2005).

The closer the researcher can match choice of sample to the people for whom the intervention’s underlying mechanism is effective, the larger the mean difference. The more general the sample, including people for whom the intervention is ineffective, the smaller the mean difference (and also the wider the spread)...

These issues also interact: if one study conducted with lower-achieving groups has a higher effect size than an otherwise identical study with wider-ability groups, without adjusting for range restriction it is difficult to tell whether the intervention is better targeted at lower-achieving pupils or whether restricted range has inflated effect size.

However, the key issue is that studies with the same intervention (and the same comparison treatment and outcome measure) can have very different effect sizes with different samples. Again, this demonstrates that identifying relative effect sizes with the relative effectiveness of interventions is a category error" (p. 902).

3. The Test Used

Simpson provides a thought experiment using a simple regime to learn 1 Hungarian word and different types of tests used to test for that word. Depending on the test he can generate an effect size of 0 to infinity!

Then Simpson provides specific examples,

"In their evaluation of the ‘Response to Intervention’ literacy programme, Gorard et al. (2014) report d = +0.19 for the ‘New Group Reading Test’ and -0.09 for the ‘Progress in English’ test. In their evaluation of the Nuffield Early Language intervention, Sibieta et al. (2016) used a variety of different outcome measures, including a grammar test (d = 0.29), an expressive vocabulary test (0.25), a letter sound knowledge test (0.12) and a word reading test (0.01). The same intervention, sample and comparison treatment results in very different effect sizes depending on the test" (p. 903).

Simpson then cites other researchers with the same findings,

"Cheung and Slavin (2016) looked at 645 studies across 12 meta-analyses across a wide range of topics. They separated studies using tests designed by the researchers from studies with ‘standardised tests’ (tests designed by others to cover a particular area of the curriculum, often having been norm referenced, designed for particular age ranges, etc.). Across the studies, effect sizes for standardised tests were around half those of researcher-designed tests...

Ruiz-Primo et al. (2002)... found effect sizes between two and five times larger for close assessments than distal ones" (p. 903).

Simpson summarises,

"A test focused on what it is that the intervention does to the pupils (compared to the comparison) will lead to a larger effect size. However, even in an intervention with a very narrow outcome (such as improving procedural fraction addition), researchers may be constrained to use a standardised test. But they can still select to maximise power (and increase effect size) by choosing, say, a numeracy test rather than a more general mathematics test.

Again, there are other subtleties with test choice. Depending on the consistency of the test items, increasing the length of a test may increase the effect size (though equally, adding irrelevant items will tend to decrease the effect size by adding noise).

This is bound up with design decisions and researchers’ freedoms and constraints: a long test may be impractical, piloting a test may have led to changes which increase its reliability (and therefore effect size), and so on.

Selecting a test is a design decision which, for fixed intervention, sample and comparison treatment, can result in very different effect sizes. So, again, taking relative effect size as a measure of the relative effectiveness of interventions is a category error" (p. 904).

Simpson's conclusion

"Recall that the average photo-size of one set of creatures being larger than the average photo-size of another warrants a valid conclusion that the average real size of the set of creatures is larger only if an ‘all other things being equal’ (in distribution) assumption holds. Only if the distribution of lenses and distances for elephant photographers and princess photographers were the same—either systematically orrandomly—in Catherine’s and Douglas’s collections, might the argument work.

The same requirements apply to meta-analysis. To make a valid comparison between averaged effect sizes stand as a valid comparison of the effectiveness of interventions, the other elements impacting on effect size have to be distributed equally across the meta-analyses (or meta-meta-analyses)" (p. 906).

"‘Evidence based education’ proponents often claim inspiration from pharmacology, yet we would dismiss arguments about ‘the effect of aspirin’ that drew on studies of aspirin against a placebo, aspirin against paracetamol, aspirin against an antacid and aspirin against warfarin on measures as diverse as blood pressure, wound healing, headache pain and heart attack survival. Yet we are asked to believe in an ‘effect for instructional technology’ based on comparing interactive tutorials to human tutoring, to reading text, to ‘traditional instruction’ and to not being taught the topic at all, on outcomes as diverse as understanding the central limit theorem, writing simple computer programs, completing spatial transformations and filling out account books.

The assumptions for comparing meta-analyses and meta-meta-analyses require that design factors are distributed equally across different areas of education, yet clearly researchers in different areas make systematically different decisions. In Berk’s (2011) deconstruction of meta-analysis he notes that even small deviations from assumptions invalidate the inferential argument, leaving ‘statistical malpractice disguised as statistical razzle-dazzle’" (p. 910).

Hattie's Response

In an interview with Ollie Lovell in 2018, Hattie responds to Simpson's 2017 paper.

But, in my view, Hattie uses red herrings and tangents to talk around the issues and never deals with the specific examples Simpson provides.

Hattie's typical response to the issues Simpson raises is,

Yes, it can be an issue.

But not in my work,

And I'm sensitive to these issues.

I've just added a review of Collective Teacher Efficacy, which in the last couple of years has replaced Self Report Grades as the largest effect size in Hattie's influences.

The research Hattie cites suffers from all the problems detailed by Simpson. The study is correlational - there is NO control group. The sample is small, and a range of different tests are used. Hattie does not appear in reality to be sensitive to these issues.

Also, the Evidence for Learning Organisation (an offshoot of the Education Endowment Foundation (EEF)) posted a response to Simpson's paper here.

They also mostly agree with Simpson and similar to Hattie they don't respond to Simpson's specific examples and state they are careful not to let these issues interfere with their work.

Hattie and the EEF just Ignore the Problem

Many scholars concur with Simpson,

Eacott (2018) comments about Hattie's defenses,

'Hattie did produce an antithesis to my original paper. His response made a counter argument to my claims. To do so, he did not need to engage with my ideas on anything but a superficial level (to think with Ryle, this is the wink all over – a focus on a few words without grasping their underlying generative meaning). There was no refutation of my claims. There was no testing of my ideas to destruction and no public accountability for his analysis. If anything, he simply dismissed my ideas with minimal reference to any evidence' (p. 6).

The interviewer, Ollie Lovell, also posted a detailed review of Hattie's answers here and summarises:

'...I came to the conclusion that combining effect sizes from multiple studies, then using these aggregated effect sizes to try to determine ‘what works best’, equates to a category error... I asked myself the following, ‘Has an effect size ever helped me to be a better teacher?’ I honestly couldn’t think of an example that would enable me to answer ‘yes’ to this question... But if for you, like me, the answer is ‘no’, then let’s agree to stop basing policy decisions, teacher professional development, or anything else in education upon comparisons of effect sizes. As both John and Adrian suggest, let’s focus on stories and mechanisms from now on.'

Prof's Snook, Clark, Harker, Anne-Marie O’Neill and John O’Neill respond to Hattie's 2010 defense in 'Critic and Conscience of Society: A Reply to John Hattie' (p. 97),

'In our view, John Hattie’s article has not satisfactorily addressed the concerns we raised about the use of meta-analyses to guide educational policy and practice.'

Prof Arne Kare Topphol responds to Hattie's defense,

'Hattie has now given a response to the criticism I made. What he writes in his comment makes me even more worried, rather than reassured.'

Darcy Moore posts,

'Hattie’s [2017] reply to Eacott’s paper does not even remotely grapple with the issues raised.'

Prof Eacott also responded to Hattie's defense,

'Disappointed that SLAM declined my offer to write a response to Hattie's reply to my paper. Dialogue & debate is not encouraged/supported.'

Eacott (2018) was able to publish a response in a different journal,

'Hattie did produce an antithesis to my original paper. His response made a counter argument to my claims. To do so, he did not need to engage with my ideas on anything but a superficial level (to think with Ryle, this is the wink all over – a focus on a few words without grasping their underlying generative meaning). There was no refutation of my claims. There was no testing of my ideas to destruction and no public accountability for his analysis. If anything, he simply dismissed my ideas with minimal reference to any evidence' (p. 6).

Professor Pierre-Jérôme Bergeron in his voicEd interview also talks about Hattie's conflict of interest and Hattie's reluctance to address the details of his critics. Listen here - at 17min 46sec.

VisibleLearning

Prof Adrian Simpson

No comments:

Post a Comment

Blog Archive

About Me