Effect Size

"If you torture the data long enough, it will confess." Ronald Coase.

A short video summary of Hattie's methods here and another short video summary of how effect size is calculated here.

The effect size statistic is the cornerstone of Hattie's work. He claimed that the larger the effect size, the greater the impact, "know thy impact", on student learning and this enabled him to rank educational influences and list the top ranked influences as "what works best".

However, in 2018, in an interview with Ollie Lovell, he admitted his rankings were misleading and that he does not rank anymore - listen @ 1hr 21minutes here.

Example of an Effect Size Calculation

Clark & Mayer (2016) give a simple example,

The control group received a basic multimedia lesson that explains content with graphics and audio narration. We call this the no-music group. The experimental group received the same lesson with background music added to the narration. We call this the music group. 

Suppose the no-music group averaged 90 percent correct on a test of the material and the music group averaged 80 percent on the same test. 

Also suppose the scores were not very spread out, so most of the no-music students scored close to 90 and most of the music students scored close to 80. 

The standard deviation tells you how spread out the scores are, or how much variation there is in the results.

Clark & Mayer (2016) provide the following summary and Effect Size calculation.

The effect size is equivalent to a 'Z-score' of a standard normal distribution. For example, an effect size of 1 means that the score of the average person in the experimental (treatment) group is 1 standard deviation above the average person in the control group (no treatment).

But, there are many OTHER ways to calculate an effect size.

Hattie claims his 2 Major Ways to Calculate Effect Size are:

Hattie calls the first equation the Random Method, which is the same as Clark & Mayer (2016) above, and the second the Fixed Method (VL, p. 8).

Hattie stated he used the fixed method,
"Given that the majority of meta-analyses so far published have used the fixed effect model, then this fixed model has been used in this book." (VL, p. 12)

Problem 1 - Can You Directly Compare Effect Sizes Calculated by Different Methods?

The peer review below, shows there is significant doubt about Hattie's foundational premise that effect sizes from disparate studies can be compared to determine "what works best".

Hattie even admits if you mix the above two methods up you have significant problems interpreting your data,
"combining or comparing the effects generated from the two models may differ solely because different models are used and not as a function of the topic of interest" (VL, p. 12). 
Slavin (2015), Bergeron & Rivard (2017) also identified this issue with Hattie's work, with Bergeron & Rivard (2017) stating,
"These two types of effects are not equivalent and cannot be directly compared... A statistician would already be asking many questions and would have an enormous doubt towards the entire methodology in Visible Learning and its derivatives."
UPDATE: Hattie finally admits this is a major problem in his work. 

In Wisniewski B, Zierer K, Hattie J. (2020). The Power of Feedback Revisited: A Meta-Analysis of Educational Feedback Research.

Hattie reverses his view stated in VL, on the 'fixed method',
"the use of a fixed-effect model may not be appropriate. A meaningful interpretation of the mean of integrated effect with this model is only possible if these effects are homogenous (Hedges and Olkin, 1985). Because previous research on feedback includes studies that differ in variants of treatment, age of participants, school type, etc., it is highly likely that the effect size varies from study to study, which is not taken into account by a fixed-effect model. By contrast, under the random-effects model, we do not assume one true effect but try to estimate the mean of a distribution of effects. The effect sizes of the studies are assumed to represent a random sample from a particular distribution of these effect sizes (Borenstein et al., 2010)." (p. 2)

Hattie's Inconsistency

Hattie's inconsistency is concerning as he continues to compare studies using both the fixed and random methods with his commercial partner Corwin - list of studies here.

A clear example is with Feedback. Originally the effect size for feedback was 0.73, but using this random effects method, Wisniewski B, Zierer K, Hattie J. (2020) publish a significantly reduced result of 0.48.

However, with Corwin, Hattie continues to combine studies using any method of effect size calculation to get 0.62 (Corwin, October, 2021).

Once again, indicating Hattie's original claim of comparing effect sizes from disparate studies is neither reliable nor valid.

Worse, Hattie continues to use a 3rd method of converting a correlation to an Effect size. Hattie's top 3 ranked influences are correlation studies: Collective Teacher Efficacy, Self-Report Grades and Piagetian Programs where he he uses this conversion (which is also highly criticised in the peer review - see below).

A 3rd Method - Correlation Converted to an Effect size

Despite Hattie's claim above of using the 'fixed' method in VL, a number of scholars, Bergeron & Rivard (2017), Blatchford (2016), Wrigley (2018), Bakker et al. (2019) & Kraft (2020) have identified that Hattie most often uses a 3rd method of converting a correlation to an effect size, but there are major problems with this - see Correlation.

In his updated version of VL 2012 (summary) Hattie once again emphasises he mostly uses the 'fixed' method. Again, he makes no mention of using the weaker methodology of correlation (p. 10).

Yet, after not revealing, let alone not justifying the use of correlation studies and admitting there are issues when comparing, Hattie ignores the problem and directly compares effect sizes from the 3 different methods without comment or adjustment.

But, Studies Use Even More Ways to Calculate the Effect Size!

Simpson (2017, p. 452) explains, 
"...while calculating an effect size may be simple enough for a first course in statistics, there are considerable subtleties in understanding it sufficiently well to ensure that the processes of combining effect sizes in meta-analyses allow valid conclusions to be drawn."
There are further variations in the effect size calculation that different researchers use, e.g., Cohen's d, Hedges' g and Glass’s Δ. Each of these methods uses a different Standard Deviation. 

This creates more problems when comparing studies - see the section Standard Deviation below.

This is best summarised by a report from John Mandrola,
"The Year’s Most Important Study Adds to Uncertainty in Science."
Mandrola summarises a large study by Nosek et al (2018) who recruited 29 teams comprising 61 researchers to use the SAME data, but came up with 29 totally DIFFERENT effect sizes!

Randomisation Also Greatly Influences the Effect Size

The Random Method insists on random assignment of students to a control & experimental group. Note the medical method also insists on "double blindness". That is, neither the control or experimental group nor the staff know, who is getting the treatment. This is done to remove the effect of confounding variables. 

Few of the studies that Hattie cites, use random allocation.

Cheung & Slavin (2016) support the concern of which method to use to calculate the effect size,
"...effect sizes are significantly higher in quasi-experiments than in randomized experiments."
Slavin (2015) details the difference,
"Matched quasi-experiments did produce inflated effect sizes (ES=+0.23 for quasi-experiments, +0.16 for randomized). This difference is not nearly as large as other factors we looked at, such as sample size (small studies greatly exaggerate outcomes), use of experimenter-made measures, and published vs. unpublished sources (experimenter-made tests and published sources exaggerate impacts). But our findings about matched vs. randomized studies are reason for caution about putting too much faith in quasi-experiments."
Berk (2011) concurs with Slavin and Bergeron & Rivard,
"when the studies are not randomized experiments, there is a strong likelihood that a collection of biased treatment effect estimates is being combined. How is one then better off? Biased estimates are not random errors and do not cancel out. The result can be just a more precise causal estimate that has the wrong sign and is systematically far too large or far too small." (p. 199)
DuPaul & Eckert (2012), details their concern,
"randomised control trials are considered the scientific gold standard for evaluating treatment effects ... the lack of such studies in the school-based intervention literature is a significant concern" (p. 408).
Note that the Education Endowment Foundation (EEF), as part of their quality control, only accept studies that use the randomised method (they would disregard MOST of Hattie's 1400+ studies).

Similarly, the What Works Clearinghouse (WWC) reserve their highest rating for studies that use the randomised method (they would also disregard MOST of Hattie's studies).

Many scholars are critical of Hattie's lack of quality control, e.g., 

Hattie cites many meta-analysis from Prof Bob Slavin, but Slavin (2017), is very critical of Hattie's method,
"Hattie includes literally everything in his meta-meta analyses, including studies with no control groups, studies in which the control group never saw the content assessed by the post-test, and so on."
Slavin (2017) explains the use of some sort of quality control would remove,  
"...a lot of the awful research that gives Hattie the false impression that everything works, and fabulously."
As a result of all these issues, Slavin (2018) posted that John Hattie is Wrong!

Hattie's lack of quality control, in part, explains the vast difference in the conclusions of Hattie compared to EEF, WWC and others - see Other Researchers.

Problem 2. Student Achievement is measured in different ways or not at all - A VALIDITY problem!

Some examples of different tests used:

Standardised tests, specific tests, physical tests (the number of times a ball can be caught off the wall),  a mother's rating of their child out of 5, IQ and many examples of measuring something else like hyperactivity & engagement.

Hattie claims all of the studies he used focused on student achievement. But this is clearly NOT the case as many studies measured something else like hyperactivity (see- Behavior).

Wecker et al. (2017, p. 28) confirm this saying, Hattie mistakenly included studies that do not measure academic performance.

Fletcher-Wood (2021) commenting on Hattie's prime Feedback study,
"Kluger and DeNisi focused on the way feedback affects behaviour – not how it affects learning."
Even if all the studies did measure Student Achievement, there is a growing body of evidence showing that the test used can determine the effect size, e.g., standardised tests generate lower effect sizes than specific tests.

Simpson (2017, p. 461) gives examples of specific tests designed for a particular influence, e.g., improving algebra skills, resulting in a 40% higher effect size than a standardised test on the same students!

Kraft (2019) confirms this,
"Even among measures of student achievement, effect sizes for researcher-designed and specialized topic tests aligned with the treatment are often two to four times larger than effects on broad standardized state tests (Lipsey et al., 2012; Cheung & Slavin, 2016)" (p. 8).
Simpson (2017, p. 462) details more problems with tests. He shows tests with more questions give 400% higher effect sizes.

Simpson (2018b, p. 5) also shows problems with different standardised tests for the SAME maths intervention,
"The effect size... for the PIM test was 0.33 and for the SENT-R-B test was 1.11."
Slavin (2019) has also written extensively on this issue and confirms Simpson's analysis.

Hattie just ignores these issue and jumbles any test together.

So comparing these effect sizes is the classic 'apples versus oranges' problem.

The page Student Achievement discusses in more detail the HUGE question of what is student achievement? There is NO consensus of what it is. So, how does one measure it???

Blichfeldt (2011),
"We also get no information about how 'learning outcomes' are defined or measured in the studies at different levels, what tests are used, which subjects are tested and how."
Many of the scholars that Hattie used also comment on this problem,

DuPaul & Eckert (2012),

"It is difficult to compare effect size estimates across research design types. Not only are effect size estimates calculated differently for each research design, but there appear to be differences in the types of outcome measures used across designs." (p. 408).
Kelley & Camilli (2007), 
"methodological variations across the studies make it problematic to draw coherent generalisations. These summaries illustrate the diversity in study characteristics including child samples, research designs, measurement, independent and dependent variables, and modes of analysis." (p. 7).
Problem 3. The Control Group (or lack of one):

Simpson (2017) & Bergeron & Rivard (2017) give examples of how the same influence, depending on how you define the control and experimental groups, can give effect sizes ranging from 0 to infinity!

As a result, Bergeron & Rivard & Simpson call into question Hattie's entire use of effect size comparisons, e.g., Simpson (2017, p. 463),
"standardised effect size is a research tool for individual studies, not a policy tool for directing whole educational areas. These meta-meta-analyses which order areas on the basis of effect size are thus poor selection mechanisms for driving educational policy and should not be used for directing large portions of a country’s education budget."
Bakker et al. (2019) confirmed these problems (detailed below) with Hattie's work and conclude,
"...his lists of effect sizes ignore these points and are therefore misleading."
Also, Simpson (2017, p. 455) details other problems,
"the experimental condition in some studies and meta-analyses is the comparison condition in others."
You can listen to podcasts - Bergeron (2018) and Simpson (2018p).

Many other scholars also warn of this problem,

Wrigley (2018also discusses the problem of control groups with regard to Hattie's work,
"should the control group experience the absence of the practice being trialled, or simply ‘business as usual’? 
This ambiguity concerning the control group can seriously distort attempts to calculate an ‘effect size’.
We do not learn whether teachers and teaching assistants in the control group had any access to training comparable to that of the treatment group, whether they also taught small
classes, or what ‘business as usual’ actually involved" 
(p. 363).
"Sometimes Hattie uses ‘effect size’ to mean ‘as compared to a control group’ and at other times to mean ‘as compared to the same students before the study started" (p. 368).
Poulsen (2014, p. 3) identifies that Hattie often uses studies that do not have control groups,
"It does not appear if the many effects studies were in general investigations control groups. Control groups mentioned, but in what sense were they actually compatible with the trial groups? If not, much cannot be concluded about learning outcomes" (translated from Danish).
Nielsen and Klitmøller (2017, p. 4) concur,
"The meta-analyses... do not have uniform standards for, how they measure the effect. In many meta-analyses, studies involving the effect are not related to the use of control groups" (translated from Danish).
Also, Lervåg &  Melby-Lervåg (2014),
"If you do not have a control group, the effect size will be calculated only on the basis of performance on the mapping before and after the action. The effect size will then be artificially high without this being a correct image. An example of this from Hattie's book is that vocabulary programs come out with a very high effect size."
Hattie's flagship study on Feedback, Kluger & DeNisi (1989), also warn generally of the problem of lack of control groups in educational studies,
"Without control groups, we may know more about the relative merits of several types of FI messages, but we have no idea if they are better, equal, or inferior to no intervention. This state of affairs is alarming." (p. 276)
Becker (2012) in his critique of Marzano (but relevant for Hattie) states,
"Marzano and his research team had a dependent variable problem. That is, there was no single, comparable measure of 'student achievement' (his stated outcome of interest) that they could use as a dependent variable across all participants. I should note that they were forced into this problem by choosing a lazy research design [a meta-analysis]. A tighter, more focused design could have alleviated this problem."
Problem 4. Hattie Combines Studies that have Totally Different Definitions of Influences, i.e., the Apples vs Oranges problem

This is a major problem with all of Hattie's work. Examples include -

Self Report - Hattie combines peer assessment with self report. 

Feedback - Hattie combines background music on an assembly line with monetary rewards, with feedback to teachers and feedback to students.

Snook et al. (2009) was one of the earliest critiques of Hattie's Visible Learning (VL).

Hattie responded to some of their critiques, however, Snook et al. (2010) reply that they were surprised Hattie did not respond to what they consider to be the major problem with Visible Learning, i.e., the lack of consistency in defining variables & carefully defined concepts. They give this example,
'In education, however, the variables being studied are often poorly conceptualised and the studies often far from rigorous. How does one clearly distinguish for research purposes between a classroom that is “teacher centred” and one which is “student centred”' (p. 96 )
Later, Yelle et al. (2016) also summarise the problem with Hattie's combining of studies,
"In education, if a researcher distinguishes, for example, project-based teaching, co-operative work and teamwork, while other researchers do not distinguish or delimit them otherwise, comparing these results will be difficult. It will also be difficult to locate and rigorously filter the results that must be included (or not included) in the meta-analysis. Finally, it will be impossible to know what the averages would be. 
It is therefore necessary to define theoretically the main concepts under study and to ensure that precise and unambiguous criteria for inclusion and exclusion are established. The same thing happens when you try to understand how the author chose the studies on e.g., problem-based learning. The word we find is general, because it compiles a large number of researches, dealing with different school subjects. It should be noted that Hattie notes variances between the different school subjects, which calls for even greater circumspection in the evaluation of the indicators attributed to the different approaches. 
This is why it is crucial to know from which criteria Hattie chose and classified the meta-analyses retained and how they were constituted. How do the authors of the 800 meta-analyses compiled in Hattie (2009) define, for example, the different approaches by problem? In other words, what are the labels that they attach to the concepts they mobilize?
As for the concepts of desirability and efficiency from which these approaches must be located, they themselves are marked by epistemological and ideological issues. What do they mean? According to what types of knowledge is a method desirable? In what way is it effective? What does it achieve?
Hattie's book does not contain information on these important factors, or when it does, it does so too broadly. This vagueness prevents readers from judging for themselves the stability of so-called important variables, their variance or the criteria and methods of their selection. The lack of clarity in the criteria used for the selection of studies is therefore a problem."
Pant (2014, p. 85) is also critical of Hattie aggregating a wide variety of interventions under one label - 
"which calls into question the theoretical relevance of the analysis."
A great example of this is in the studies on class size.

A comparison of the studies shows different definitions for small and normal classes, e.g. one study defines 23 as a small class but another study defines 23 as a normal class. So comparing the effect size is not comparing the same thing!

Schulmeister & Loviscach (2014),
"Even where he has grouped meta-analyses correctly by their independent variables such as instructional interventions, Hattie has in many cases mixed apples and oranges concerning the dependent variables. In some groupings, however, both the independent and the dependent variables do not match easily. For instance, in the group “feedback”, a meta-analysis using music to reinforce behavior is grouped with other studies using instructional interventions that are intended to elicit effects on cognitive processes."
"Many of the meta-analyses do not really match the same effect group (i.e., the influence) in which Hattie refers to them. For instance, in the group 'feedback', studies investigating the effect of student feedback on teachers are mixed with studies that examine the effect of teacher feedback on students."
Nielsen & Klitmøller (2017) discuss in detail the many problems of different definitions of feedback and large versus small class sizes - see feedback.

Blatchford (2016b) also raises this issue about Hattie,
"it is odd that so much weight is attached to studies that don't directly address the topic on which the conclusions are made" (p. 13).

Hattie's Defense of Apples vs Oranges

Hattie defense in Visible Learning (VL) was, 

"A common criticism is that it combines 'apples with oranges' and such combining of many seemingly disparate studies is fraught with difficulties. It is the case, however, that in the study of fruit nothing else is sensible" (p. 10).

In his latest defense, Hattie & Hamilton (2020), "Real Gold Vs Fool's Gold", Hattie continues with this "fruit" response and once again does not address the significant issues of disparate studies,

"Any literature review involves making balanced judgements about diverse studies. A major reason for the development of meta- analysis was to find a more systematic way to join studies, in a similar way that apples and oranges can make fruit salad. Meta-analysis can be considered to ask about “fruit” and then assess the implications of combining apples and oranges, and the appropriate weighting of this combination. Unlike traditional reviews, meta-analyses provide systematic methods to evaluate the quality of combinations, allow for evaluation of various moderators, and provide excellent data for others to replicate or recombine the results. The key in all cases is the quality of the interpretation of the combined analyses. Further, as noted above, the individual studies can be evaluated for methodological quality." (p. 3-4)

Finally, Hattie's has admitted that these differences, or heterogeneity of definitions, is a major problem. In his recent analysis, he has changed to Method 1 (The Random Model), Wisniewski, Zierer & Hattie (2020). "The Power of Feedback Revisited",
"...the significant heterogeneity in the data shows that feedback cannot be understood as a single consistent form of treatment." (p. 1)

Problem 5. Effect Size Calculation Can Vary Significantly Depending on the Standard Deviation (SD) chosen.

The 3 common effect size calculations are, Cohen's d, Hedges' and Glass’s Δ. The difference is in the SD they use.

Cohen uses the pooled SD, Hedges also uses the pooled SD but adjusts for sample size and Glass uses the control group SD.

Prof Gene Glass the inventor of the meta-analysis in this seminal paper, warned of SD problems, Integrating Findings: The Meta-Analysis of Research (1977). 

Glass shows that since the effect size is calculated by dividing by the standard deviation (see formulas above) the standard deviation that is chosen can change the effect size in a significant way!

Glass gives this example (p. 370):
"The definition of ES appears uncomplicated, but heterogeneous group variances cause substantial difficulties. Suppose that experimental and control groups have means and standard deviations as follows:
The measure of experimental effect could be calculated either by use of Se or Sc or some combination of the two, such as an average or the square root of the average of their squares or whatever. The differences in effect sizes ensuing from such choices are huge:
The third basis of standardization—the average standard deviation—probably should be eliminated as merely a mindless statistical reaction to a perplexing choice. It must be acknowledged that both the remaining 1.00 and 0.20 are correct; neither can be ruled out as false... However, the control group mean is only one-fifth standard deviation below the mean of the experimental group when measured in control group standard deviations; thus, the average experimental group subject exceeds 58 percent of the subjects in the control group. These facts are neither contradictory nor inconsistent; rather they are two distinct features of a finding which cannot be captured by one number."
Note: A few years after Gene Glass wrote this Cohen (1988) added another method to calculate standard deviation - the 'pooled standard deviation' which averages the variances first then finds the standard deviation. This seems to be the accepted method now and using this would get d = 0.39.

As can be seen in this example the effect size can be either 0.20, 0.33, 0.39 or 1 for the same data!

If comparing effect sizes across studies, as Hattie does, then Gene Glass warns,
"If some attempt is not made to deal with this problem, a source of inexplicable and annoying variance will be left in a group of effect-size measures" (p. 372).
Hattie references this seminal paper from Glass in VL, but once again ignores the problem.

As a general rule, older studies used Glass’s Δwhile newer studies used Cohen's d or Hedges' g. Note the huge WWC use Hedges' g.

For example, in the studies Hattie used for feedback, there was a range of standard deviations used. Standley (1996, p. 109) used the Glass’s Δ while the other studies used Cohen's d.
"The effect sizes of experimental results in this analysis were estimated by contrasting the means of experimental/treatment conditions (Exp) and control/base-line conditions (Con) divided by the standard deviation (SD) of the control/baseline condition, as in the formula below:"
In Problem Based Learning, Gijbels et al. (2005) use Glass’s Δ.

Topphol (2011) also discusses a slight variation of this problem with Hattie's work.
"...in these two cases, the difference between the mean values ​​is the same, D = 20 point. The distribution to the control group is drawn with a solid curve while dotted curve is used for the treatment group. Standard deviations are different, 5 to the left and 17 to the right. This gives different effect sizes, d = 4 and d = 1.18." (p. 464)

Sampling students from small or abnormal populations:

What Topphol displays above is a well-known issue for meta-analyses for a number of reasons: effect sizes are erroneously larger (due to a smaller standard deviation) and moderating variables are exacerbated.

Using such samples makes it invalid to generalise influences to the broader student population.

Professor Dylan Wiliam explains:

Simpson (2017) details this problem,
"Researchers can make legitimate design decisions which alter the standard deviation and thus report very different effect sizes for identical interventions. One such design decision is range restriction" (p. 456).
Simpson then insightfully explains that sampling from smaller populations is a major reason why effects for influences such as feedback, meta-cognition, etc are high while effects for whole school influences - class size, summer school, etc are low.
"One cannot compare standardised mean differences between sets of studies which tend to use restricted ranges of participants with researcher designed, tightly focussed measures and sets of studies which tend to use a wide range of participants and use standardised tests as measures" (p. 463).
Allerup (2015)also identifies  this problem, if one distribution has very little spread, and, moreover, lies entirely within the second sharing outer boundaries then an effect size is almost impossible to calculate (p. 6).

Kraft (2019) and Bakker et al. (2019) confirm this problem with SD.

But, Hattie just ignores these issues and uses meta-analyses from abnormal student populations, e.g., ADHD, hyperactive, emotional/behavioural disturbed and English Second Language students. 

Also, he uses abnormal subjects from NON-student populations, e.g., doctors, tradesmen, nurses, athletes, sports teams and military groups.

Professor John O'Neill's (2012b) letter to the NZ Education Minister regarding major issues with Hattie's research. One of the issues he emphasises is Hattie's use of students from abnormal populations.

Some examples from the research Hattie used is Standley (1996) that Hattie used in Feedback. Standley reported effect sizes up to 35.44 and noted that these were based on very small sample sizes (p. 109).

Problem 6. Use of the same data in different meta-analyses:
Shannahan (2017, p. 751) provides a detailed example,
"What Hattie seems to have done is just take an average of the original effects reported in the various meta-analyses. That sometimes is all right, but it can create a lot of double counting and weighting problems that play havoc with the results. 
For example, Hattie combined two meta-analyses of studies on repeated reading. He indicated that these meta-analyses together included 36 studies. I took a close look myself, and it appears that there were only 35 studies, not 36, but more importantly, four of these studies were double counted. Thus, we have two analyses of 31 studies, not 36, and the effects reported for repeated reading are based on counting four of the studies twice each!
Students who received this intervention outperform those who didn't by 25 percentiles, a sizeable difference in learning. However, because of the double counting, I can't be sure whether this is an over- or underestimate of the actual effects of repeated reading that were found in the studies. Of course, the more meta-analyses that are combined, and the more studies that are double and triple and quadruple counted, the bigger the problem becomes."
Shannahan (2017, p. 752) provides another detailed example,
"this is (also) evident with Hattie's combination of six vocabulary meta-analyses, each reporting positive learning outcomes from explicit vocabulary teaching. I couldn't find all of the original papers, so I couldn't thoroughly analyze the problems. However, my comparison of only two of the vocabulary meta-analyses revealed 18 studies that weren't there. Hattie claimed that one of the meta-analyses synthesized 33 studies, but it only included 15, and four of those 15 studies were also included in Stahl and Fairbanks's (1986) meta-analysis, whittling these 33 studies down to only 11. One wonders how many more double counts there were in the rest of the vocabulary meta-analyses. 
This problem gets especially egregious when the meta-analyses themselves are counted twice! The National Reading Panel (National Institute of Child Health and Human Development, 2000) reviewed research on several topics, including phonics teaching and phonemic awareness training, finding that teaching phonics and phonemic awareness was beneficial to young readers and to older struggling readers who lacked these particular skills. Later, some of these National Reading Panel meta-analyses were republished, with minor updating, in refereed journals (e.g., Ehri et al., 2001; Ehri, Nunes, Stahl, & Willows, 2002). Hattie managed to count both the originals and the republications and lump them all together under the label Phonics Instruction—ignoring the important distinction between phonemic awareness (chldren's ability to hear and manipulate the sounds within words) and phonics (children's ability to use letter–sound relationships and spelling patterns to read words). That error both double counted 86 studies in the phonics section of Visible Learning and overestimated the amount of research on phonics instruction by more than 100 studies, because the phonemic awareness research is another kettle of fish. Those kinds of errors can only lead educators to believe that there is more evidence than there is and may result in misleading effect estimates."
Wecker et al (2017, p. 30) also detail examples,
"In the case of papers summarizing the results of several reviews on the same topic, the problem usually arises that a large part of the primary studies has been included in several of the reviews to be summarized (see Cooper and Koenka 2012 , p. 450 ff.). In the few meta-analyzes available so far, complete meta-analyzes of the first stage have often been ruled out because of overlaps in the primary studies involved (Lipsey and Wilson 1993 , 1197, Peterson 2001 , p.454), as early as overlaps of 25% (Wilson et al Lipsey 2001 , p. 416) or three or more primary studies (Sipe and Curlette 1997, P. 624). 
Hattie, on the other hand, completely ignores the doubts problem despite sometimes significantly greater overlaps. 
For example, on the subject of web-based learning, 14 of the 15 primary studies from the meta-analysis by Olson and Wisher ( 2002 , p. 11), whose mean effect size of 0.24 is significantly different from the results of the other two meta-analyzes on the same topic (0.14 or 0.15), already covered by one of the two other meta-analyzes (Sitzmann et al., 2006 , pp. 654 ff.)"
Kelley & Camilli (2007, p. 25) Many studies use the same data sets. To maintain the statistical independence of the data, only one set of data points from each data set should be included in the meta-analysis.

Hacke (2010, p. 83),
"Independence is the statistical assumption that groups, samples, or other studies in the meta-analyses are unaffected by each other."
This is a major problem in Hattie's synthesis as many of the meta-analyses that Hattie averages use the same data-sets - e.g., much of the same data is used in Teacher Training as is used in Teacher Subject Knowledge.

Problem 7. Inappropriate Averaging:
Hattie's averaging hides much of the complexity, for example, Snook et al.(2009), on Homework:
"There is also the difficulty which arises amalgamating a large number of disparate studies. When results of many studies are averaged, the complexity of education is ignored: variables such as age, ability, gender, and subject studied are set aside. An example of this problem can be seen in Hattie’s treatment of homework: does homework improve learning or not?

Overall, Hattie finds that the effect size of homework is 0.29. Thus a media commentator, reading a summary might justifiably report: “Hattie finds that homework does not make a difference.” When, however, we turn to the section on homework we find that, for example, the effect sizes for elementary (primary in our terms) and high schools students are 0.15 and 0.64 respectively.

Putting it crudely, the figures suggest that homework is very important for high school students but relatively unimportant for primary school students.

There were also significant differences in the effects of homework in mathematics (high effects) and science and social studies (both low effects). Results were high for low ability students and low for high ability students. The nature of the homework set was also influential. (pp 234-236). All these complexities are lost in an average effect size of 0.29" (p. 4).
Schulmeister & Loviscach (2014),
'The effect size given per influence is the mean value of a very broad distribution. For instance, in “Inductive Teaching” Hattie combines two meta-analyses with effect sizes of d = 0.06 and d = 0.59 to a mean effect size of d = 0.33 with a standard error of 0.035. This is like saying ”this six-sided dice does not produce numbers from 1 to 6; rather, it produces the number 3.5 in the mean, and we are pretty sure about the first decimal place of this mean value.”'
Dr. Jim Thornton (2018) Professor of Obstetrics and Gynaecology at Nottingham University said,
"To a medical researcher, it seems bonkers that Hattie combines all studies of the same intervention into a single effect size... In medicine it would be like combining trials of steroids to treat rheumatoid arthritis, effective, with trials of steroids to treat pneumonia, harmful, and concluding that steroids have no effect! I keep expecting someone to tell me I’ve misread Hattie."
Another example from Nilholm (2013) It's time to critically review John Hattie on Inductive Teaching,

"Hattie reports two meta-analyzes. One is from 2008 and includes 73 studies related to 'inductive teaching', it shows that the work method generally gives a relatively strong effect. According to a meta-analysis from 1983, which includes 24 studies of inductive teaching in natural sciences, the work method gives a weak effect. 
Hattie simply takes the mean of these two meta-analyzes and thus "inductive teaching" can be dismissed. A more reasonable conclusion would be that "inductive teaching" in science subjects has weak support but that generally it seems to be a good way of' working. Alternatively, it did not appear to work before, but later research gives a much more positive picture" (p. 2).
Nilholm (2013) details another example using "problem-based learning".

This problem is widespread in Hattie's work other examples include class size, feedback, ability grouping. Also, many of Hattie's researchers warn about averaging:

Mabe and West (1982),
"considerable information would be lost by averaging the often widely discrepant correlations within studies" (p. 291).
Wrigley (2018),
"What now stands proxy for a breadth of evidence is statistical averaging. This mathematical abstraction neglects the contribution of the practitioner’s accumulated experience, a sense of the students’ needs and wishes, and an understanding of social and cultural context" (p. 359).
Wrigley (2018) then goes into detail about inappropriate averaging by Hattie and the EEF, 
"... quite dissimilar studies are thrown together and an aggregate mean of effect sizes calculated. Although some tolerance is acceptable in meta-analysis, since no two research studies are exactly alike, serious problems can arise from aggregating and averaging studies using different definitions of an issue, and based on different curriculum areas, ages and attainment levels of students, types of school, education systems, and so on... 
Indeed, Gene Glass, who originated the idea of meta-analysis, issued this sharp warning about heterogeneity: 'Our biggest challenge is to tame the wild variation in our findings not by decreeing this or that set of standard protocols but by describing and accounting for the variability in our findings. The result of a meta-analysis should never be an average; it should be a graph.'(Robinson, 2004: 29, my italics)" (p. 367).
Wrigley (2018) then quotes Coe,
"One final caveat should be made here about the danger of combining in-commensurable results. Given two (or more) numbers, one can always calculate an average. However, if they are effect sizes from experiments that differ significantly in terms of the outcome measures used, then the result may be totally meaningless...
In comparing (or combining) effect sizes, one should therefore consider carefully whether they relate to the same outcomes... One should also consider whether those outcome measures are derived from the same (or sufficiently similar) instruments and the same (or sufficiently similar) populations... It is also important to compare only like with like in terms of the treatments used to create the differences being measured. In the education literature, the same name is often given to interventions that are actually very different. It could also be that... the actual implementation differed, or that the same treatment may have had different levels of intensity in different studies. In any of these cases, it makes no sense to average out their effects. (Coe, 2002, my italics)" (p. 367)
Problem 8. Equal Weightings:

Prof Gene Glass (1977), the inventor of the meta-analysis, who Hattie quotes regularly, warned of this problem in his seminal paper, Integrating Findings: The Meta-Analysis of Research.
"Precisely what weight to assign to each study in an aggregation is an extremely complex question, one that is not answered adequately by suggestions to pool the raw data (which are rarely available) or to give each study equal weight, regardless of sample size. If one is aggregating arithmetic means, a weighting of results from each study according to SRT(N) might make sense" (p. 358).
Fixed Methods scholars recommend weighting (Pigott, 2010, p. 9). Larger studies are then weighted greater. If this were done this would affect all the reported effect sizes of Hattie and his rankings would totally change.

The range of students numbers in studies that Hattie used is enormous. In the influence 'Comprehensive teaching reforms' Hattie cites Borman & D'Agostino (1996) using nearly 42 Million Students! While in the 'gender - attitudes' influence Hattie cites Cooper, Burger & Good (1980) with 219 students. These have equal weight in Hattie's work.

Shannahan (2017, p. 752) gives more detailed examples,
"when meta-analyses of very different scopes are combined - what if one of the meta-analyses being averaged has many more studies than the others? Simply averaging the results of a meta-analysis based on 1,077 studies with a meta-analysis based on six studies would be very misleading. Hattie combined data from 17 meta-analyses of studies that looked at the effects of students’ prior knowledge or prior achievement levels on later learning. Two of these meta-analyses focused on more than a thousand studies each; others focused on fewer than 50 studies, and one as few as six. Hattie treated them all as equal. Again, potentially misleading."
Pant (2014, p. 95) verifies Shannahan's analysis and provides another detailed example:
"Hattie (2009) aggregates the mean effect sizes of the original meta-analyzes without weighting them by the number of studies received. Meta-analyzes, which are based on many hundreds of individual studies, enter the d- barometer with the same weight as meta-analyzes with only five primary studies. The consequences of this approach for the content conclusions will be briefly demonstrated by a numerical example from Hattie's (2009) data. The determined from four meta effect of the teaching method of direct instruction (Direct Instruction) is to Hattie (2009 , p 205;) d = 0.59 and thus falls into the 'desired zone' ( d > 0.4). 
Direct instruction is by no means undisputed, highly structured, and teacher-centered teaching. Looking at the processed meta-analyzes one by one, it is striking that the analysis by far the largest in 232 primary studies (Borman et al., 2003 ) is the one with the least effect size (i.e. = 0.21). If the three meta-analyzes for which information on the standard error were presented were weighted according to their primary number of studies (Hill et al. 2007, Shadish and Haddock 2009), the resulting effect size would be d = 0.39 and thus no longer in the 'desired' zone of action defined by Hattie."
Wecker et al. (2017, p. 31) give an example of using weighted averages:
"This would mean a descent from 26th place to 98th in his ranking."
Professor Peter Blatchford (2016) also warns of this problem,
"unfortunately many reviews and meta-analyses have given them equal weighting" (p. 15).
See (2017) emphasises the issue of quality of evidence & averaging by Hattie, Marzano, and others, 
"there are studies which involved only one participant, some had no comparator groups and some involved children with specific learning difficulties or had huge attrition as large as 70%. These may form the majority of studies reporting huge positive effects. On the other hand, the few good quality studies may report small effects. 
Averaging effect sizes from across studies of different quality giving equal weights to all can lead to misleading conclusions" (p. 10).  
Arnold (2011), 
"I was surprised that Hattie has chosen to summarise the effect sizes of the 800 meta-analyses using unweighted averages. Small and large meta-analyses have equal weight, while I would assume that the number of studies on which a meta-analysis is based indicates its validity and importance. Instead I would have opted for weighted averaging by number of studies, students or effect sizes. At a minimum, it would be interesting to see whether the results are robust to the choice of averaging."
Proulx (2017) and Thibault (2017) also question Hattie's averaging.

Example - Visual Perception Programs-

Hattie's effect size is d = 0.55. But if we weight according to the number of students (with the assumption studies reporting no students are assigned the lowest number of students, 4,400 (highlighted yellow). We get a weighted effect size d = 0.79 shooting this up from #35 to #7.

Nielsen & Klitmøller (2017) also show this problem in their detailed analysis of Hattie's use of feedback studies- see feedback.

In his latest 2020 defense, Real Gold Vs Fool's Gold, Hattie does not address and simply dismisses the detailed issues presented by all the peer reviews above (p. 2),

Problem 9. Confounding Variables:

Related to problem 1 - the research designers usually put a lot of thought into the controlling of other variables. Random assignments and double blindness are the major strategies used. Unfortunately, most of the studies Hattie cites, do not use these strategies. This introduces major moderating variables into the study. Class size is a good example, many studies compare the achievement of small versus large classes in schools, but many schools assign lower achieving students to smaller classes, they do not use random assignment.

Thibault (2017) gives other examples (English translation),

"a goal of the mega-analyzes is to relativize the factors of variation that have not been identified in a study, balancing in some so the extreme data influenced by uncontrolled variables. But by combining all the data as well as the particular context that is associated with each study, we eliminate the specificities of each context, which for many give meaning to the study itself! We then lose the richness of the data and the meaning of what we try to measure.
It even happens that brings together results that are deeply different, even contradictory in their nature.
For example, the source of the feedback remains risky, as explained by Proulx (2017), given that Hattie (2009) claims to have realized that the feedback comes from the student and not from the teacher, but it is no less certain that his analysis focused on feedback from the teacher. It is right to question this way of doing things since the studies quantitatively seek to control variables to isolate the effect of each. When combining data from different studies, the attempt to control the variables is annihilated. Indeed, all these studies have not necessarily sought to control the same variables in the same way, they have probably used instruments different and carried out with populations difficult to compare. So these combinations are not just uninformative, but they significantly skew the meaning."
Nielsen & Klitmøller (2017) discuss the problem of Hattie not addressing moderating factors, the interaction of factors and the disparate operational definitions of different studies, 
"it is our assessment that in four of the five "heaviest" surveys that mentioned in connection with Hattie's cover of Feedback, it is conceptually unclear whether they are operates with a feedback term that is identical with Hattie's" (p. 11, translated from Danish).
Blichfeldt (2011),
"to validly put more blurred variables into accurate calculations seems problematic...
...he allows a very low degree of precision as to what variables are included in the calculations as to what may be expected and how results can be understood. At the same time, he uses calculations and statistics that should require precision and control that it is hard to find coverage for. Which does not prevent him from producing results as very precise with two decimal places...
What he studies is summarized statistical relationships between unclear variables and skill tests."
Nilholm (2013) confirms this problem,
"Hattie's major failure is to report summative measurements of meta-analysis without taking into account so-called moderating factors. Working methods can work better for a particular subject, a certain grade, some students and so on. Hattie believes that the significance of such moderating factors is less than one can think. I would argue that they are often very noticeable, as in the examples I reported [see problem-based learning and inductive teaching] Unless such moderating factors are taken into account, direct generalizations will be made directly" (p. 3).
Allerup (2015) in 'Hattie's use of effect size as the ranking of educational efforts', calls for a more sophisticated multivariate analysis,
"it is well known that analyses in the educational world often require the involvement of more dimensional (multivariate) analyses" (p. 8).
Hattie rarely acknowledges this problem now, but in earlier work, Hattie & Clinton (2008, p. 320) they stated:
"student test scores depend on multiple factors, many of which are out of the control of the teacher."
Another pertinent example is from Kulik and Kulik (1992) - see ability grouping:

Two different methods produced distinctly different results. Each of the 11 studies with same-age control groups showed greater achievement average effect size in these studies was 0.87.

However, if you use the (usually 1 year older) students as the control group, The average effect size in the 12 studies was 0.02. Hattie uses this figure in the category 'ability grouping for gifted students'.

Hattie does not include the d = 0.87. I think a strong argument can be made that the result d = 0.87 should be reported instead of the d = 0.02 as the accelerated students should be compared to the student group they came from (same age students) rather than the older group they are accelerating into.

The Combination of Influences:

In addition, a study may be measuring the combination of many influences. For example, using class size, how do you remove other influences from the study? For example, time on task, motivation, behaviour, teacher subject knowledge, feedback, home life, welfare, etc.

Nielsen & Klitmøller (2017) discuss this problem in detail.

But, Hattie wavers on this major issue. In his commentary on 'within-class grouping' about Lou et al. (1996, p. 94) Hattie does report some degree of additivity,
"this analysis shows that the effect of grouping depends on class size. In large classes (more than 35 students) the mean effect of grouping is d = 0.35, whereas in small classes (less than 26 students) the mean effect is d = 0.22."
But in his summary, he states, 
"It is unlikely that many of the effects reported in this book are additive" (p. 256).
Problem 10. Quality of Studies:
"Extraordinary claims require extraordinary evidence." Carl Sagan
Hattie's constant proclamation (VL 2012 summary, p. 3),
"it is the interpretations that are critical, rather than data itself."
Is opposite to the Scientific Method paradigm as Snook et al. (2009, p. 2) explain:
"Hattie says that he is not concerned with the quality of the research... of course, quality is everything. Any meta-analysis that does not exclude poor or inadequate studies is misleading, and potentially damaging if it leads to ill-advised policy developments. He also needs to be sure that restricting his data base to meta-analyses did not lead to the omission of significant studies of the variables he is interested in."
Professor John O'Neill (2012a) writes a significant letter to the NZ Education Minister & Hattie regarding the poor quality of Hattie's research, in particular, the overuse of studies about University, graduate or preschool students and the danger of making classroom policy decision without consulting other forms of evidence, e.g., case and naturalistic studies. 
"The method of the synthesis and, consequently, the rank ordering are highly problematic" (p. 7).
Hattie ignored O'Neill's critique and constantly proclaims,
"Almost all of this data is based on what happens in real schools with real kids..."

See (2017), emphasises the lack of quality in the evidence by Hattie, 
"there are several problems with relying on such evidence taken from meta-analyses of meta-analyses for policy and practice. 
First, much of it is not particularly robust (small-scale, involving non-randomisation of participants, based on summaries of effects across a wide range of subjects and age groups). 
Second, no consideration was taken of the quality of research in the synthesis of existing evidence. For example, there are studies which involved only one participant, some had no comparative groups and some involved children with specific learning difficulties or had huge attrition as large as 70%. These may form the majority of studies reporting huge positive effects. On the other hand, the few good quality studies may report small effects. Averaging effect sizes from across studies of different quality giving equal weights to all can lead to misleading conclusions" (p. 10). 
Schulmeister & Loviscach (2014),
"Many of the meta-analyses used by Hattie are dubious in terms of methodology. Hattie obviously did not look into the individual empirical studies that form the bases of the meta-analyses, but used the latter in good faith."
Nielsen & Klitmøller (2017) also discuss the problems of quality using examples from VL, p. 75 and 196 - see feedback.
"Hattie does not deal with the potential problems in his own investigation but instead refers to others who have to deal with problems in connection with meta-analyses generally. In other words, Hattie is not directly concerned about the quality of his own investigation. 
In some selected contexts nevertheless, Hattie does throw out studies based on quality, but this neither consistent nor systematic" (p. 10 translated from Danish).
Nielsen & Klitmøller's criticism is based on Hattie sometimes using the following protocols to justify exclusion of meta-analyses,
"mainly based on doctoral dissertations, ..., with mostly attitudinal outcomes, many were based on adult samples ... and some of the sample sizes were tiny" (VL, p. 196).
Lind (2013) confirms this and also uses more examples from VL, pp. 196 ff. Where, he accused Hattie of disregarding studies that do not suit him, e.g. kinesthetic learning.

The Encyclopedia of Measurement and Statistics outlines the problem of quality: 
"many experts agree that a useful research synthesis should be based on findings from high-quality studies with methodological rigour. Relaxed inclusion standards for studies in a meta-analysis may lead to a problem that Hans J. Eysenck in 1978 labelled as garbage in, garbage out."
Or in modern termsDr. Gary Smith (2014, p. 25),
"garbage in, gospel out."
Most researchers that Hattie used warn about the quality of studies, e.g., Slavin (1990, p. 477)
"any measure of central tendency in a meta-analysis... should be interpreted in light of the quality and consistency of the studies from which it was derived, not as a finding in its own right. 
'best evidence synthesis' of any education policy should encourage decision makers to favour results from studies with high internal and external validity—that is, randomised field trials involving large numbers of students, schools, and districts."
Janson (2018),
"Hattie compiles large numbers of meta-analyses of all kinds for his meta-meta-analyses, without paying too much attention to the meaning or quality of the original studies."
The U.S. Department of Education has set up the National Center for Education Research whose focus is to investigate the quality of educational research. Their results are published in the What Works Clearing House. They also publish a Teacher Practice Guide which differs markedly from Hattie's results - see Other Researchers.

Importantly they focus on the QUALITY of the research and reserve their highest ratings for research that use randomised division of students into a control and an experimental group. Where students are non-randomly divided into a control and experimental group for what they term a quasi-experiment, a moderate rating is used. However, the two groups must have some sort of equivalence measure before the intervention. A low rating is used for other research design methods - e.g., correlation studies.

However, once again, Hattie ignores these issues and makes an astonishing caveat, there is, 
"no reason to throw out studies automatically because of lower quality" (p. 11).
Problem 11. Time over which each study ran:

Given Hattie interprets an effect size of 0.40 as equivalent to 1 year of schooling, and his polemic related to this figure:
"I would go further and claim that those students who do not achieve at least a 0.40 improvement in a year are going backwards..." (p. 250).
In terms of teacher performance, he takes this one step further by declaring teachers who don't attain up to an effect size of 0.40 are 'below average'. Hattie (2010, p. 87).

This means, as Professor Dylan Wiliam points out, that studies need to be controlled for the time over which they run, otherwise legitimate comparisons cannot be made.

Professor Wiliam, who also produced the seminal research, 'Inside the black box', also reflects on his own research and cautions,
"it is only within the last few years that I have become aware of just how many problems there are. Many published studies on feedback, for example, are conducted by psychology professors, on their own students, in experimental sessions that last a single day. The generalizability of such studies to school classrooms is highly questionable. 
In retrospect, therefore, it may well have been a mistake to use effect sizes in our booklet 'Inside the black box' to indicate the sorts of impact that formative assessment might have.

I do still think that effect sizes are useful... If the effect sizes are based on experiments of similar duration, on similar populations, using outcome measures that are similar in their sensitivity to the effects of teaching, then I think comparisons are reasonable. Otherwise, I think effect sizes are extremely difficult to interpret."
Hattie (2015) finally admitted this was an issue:
"Yes, the time over which any intervention is conducted can matter (we find that calculations over less than 10-12 weeks can be unstable, the time is too short to engender change, and you end up doing too much assessment relative to teaching). These are critical moderators of the overall effect-sizes and any use of hinge=.4 should, of course, take these into account."
Yet this has not affected his public pronouncements nor additions or reductions of studies to his database. He has not made any adjustment to his section on feedback, whereas Professor Wiliam states many of the studies are on university students over 1 DAY. Hattie does not appear to take TIME into account!

The section A YEARS PROGRESS? goes into more detail about this issue.

These issues have been known for a long time and many researchers, e.g., Berk (2011), recommend a focus on high quality INDIVIDUAL studies (as does the What Works Clearing House), 
"One should applaud the view that public policy is to be based on evidence. However, what qualifies as evidence, let alone strong evidence, is too often left unspecified. Into this vacuum has been drawn a mix of evaluations ranging from excellent to terrible. 
...the importance of meta-analysis for estimating causal effects has been grossly overrated. A conventional literature review will often do better. At the very least, readers will not be swayed by statistical malpractice disguised as statistical razzle-dazzle" (p. 199).

Problem 12. Researcher Bias:

Wolf et al. (2020) - conclude that effect sizes conducted by a program's developers are 80% larger than those done by independent evaluators (0.31 vs 0.14) with ~66% of the difference attributable to publication bias.

Problem 13. The assumption of Normality:

Allerup (2015) in 'Hattie's use of effect size as the ranking of educational efforts', shows that deviations to this assumption in the form of skewed or Cauchy distributions, which have wider tails than normal distributions, give very different effect size measures and therefore it becomes difficult for appropriate interpretations of effect size (p. 10). 

Allerup gives the examples that International evaluations under The OECD (PISA) and The IEA (TIMSS) are not normally distributed (p. 7).

Problem 14. meta-analysis vs META-meta-analysis

These issues have been known for a long time and many researchers, e.g., Berk (2011), recommend a focus on high quality INDIVIDUAL studies (as does the What Works Clearing House), 
"One should applaud the view that public policy is to be based on evidence. However, what qualifies as evidence, let alone strong evidence, is too often left unspecified. Into this vacuum has been drawn a mix of evaluations ranging from excellent to terrible. 
...the importance of meta-analysis for estimating causal effects has been grossly overrated. A conventional literature review will often do better. At the very least, readers will not be swayed by statistical malpractice disguised as statistical razzle-dazzle" (p. 199).

Finally, Hattie's has admitted that there are significant problems with his Synthesis of Meta-analyses or META-meta-analysis method, Wisniewski, Zierer & Hattie (2020) "The Power of Feedback Revisited", 

"The question arises, whether synthesizing research on feedback on different levels, from different perspectives and in different directions and compressing this research in a single effect size value leads to interpretable results. In contrast to a synthesis approach, the meta-analysis of primary studies allows to weigh study effects, consider the issues of systematic variation of effect sizes, remove duplets, and search for moderator variables based on study characteristics.  

Therefore, a meta-analysis is likely to produce more precise results." (p. 2-3)
To understand this monumental change it is REALLY important to understand the subtle difference between Hattie's and the EEF's approach of a META-meta-analysis versus a simpler meta-analysis approach.

This all leads to significant criticism of VL:

Snook et al. (2009): 
"Any meta-analysis that does not exclude poor or inadequate studies is misleading and potentially damaging" (p. 2).
Terhart (2011):
"It is striking that Hattie does not supply the reader with exact information on the issue of the quality standards he uses when he has to decide whether a certain research study meta-analysis is integrated into his meta-meta-analysis or not. Usually, the authors of meta-analyses devote much energy and effort to discussing this problem because the value or persuasiveness of the results obtained are dependent on the strictness of the eligibility criteria" (p. 429).
Rømer (2016)
"...I have demonstrated that there are problems with the dependent variable, the learning yield, i.e., the effect of the intervention. It is weakly understood and there is an unpredictable contradiction between the theory of learning theory and the theory of education theory" (p. 15, translated from Danish).
David Didau gives an excellent overview of Hattie's effect sizes, cleverly using the classic clip from the movie Spinal Tap, where Nigel tries to explain why his guitar amp goes up to 11.

Hooley (2013), in his review of Hattie - talks about the complexity of classrooms and the difficulty of controlling variables, 
"Under these circumstances, the measure of effect size is highly dubious" (p. 44).
Neil Brown: https://academiccomputing.wordpress.com/2013/08/05/book-review-visible-learning/
"My criticisms in the rest of the review relate to inappropriate averaging and comparison of effect sizes across quite different studies and interventions."
The USA Government Funded Study on Educational Effect Size Bench Marks - https://ies.ed.gov/ncser/pubs/20133000/
"The usefulness of these empirical benchmarks depends on the degree to which they are drawn from high-quality studies and the degree to which they summarise effect sizes with regard to similar types of interventions, target populations, and outcome measures."
and also defined the criterion for accepting a research study, i.e., the quality needed (p. 33):
Search for published and unpublished research dated 1995 or later. 
Specialised groups such as special education students, etc. were not included. 
studies were restricted to those using random assignment designs (that is method 1) with practice-as-usual control groups and attrition rates no higher than 20%.
NOTE: using these criteria virtually NONE of the 800+ meta-analyses in VL would pass the quality test!

The U.S. Department of Education standards:

The intervention must be systematically manipulated by the researcher, not passively observed.

The dependent variable must be measured repeatedly over a series of assessment points and demonstrate high reliability.

Method 1 (random allocation) is the gold standard. 

Method 2 is accepted but with a number of caveats. They use the phrase quasi-experimental design, which compares outcomes for students, classrooms, or schools who had access to the intervention with those who did not but were similar in observable characteristics. In this design, the study MUST demonstrate baseline equivalence.

In other words, the students can be broken into a control and experimental group (without randomization), but the two groups must display equivalence at the beginning of the study.

However, the rating of these types of studies is 'Meets WWC Group Design Standards with Reservations.'

So at BEST most of the studies used by Hattie would be classified by The U.S. Department of Education as 'Meets WWC Group Design Standards with Reservations.'

But Hattie uses Millions of students!

A large number of students used in the synthesis seems to excuse Hattie's from the usual validity and reliability requirements. For example, Kuncel (2005) has over 56,000 students and reports the highest effect size of d=3.10 but it does not measure what Hattie's says -  a self-report grade in the future; but rather, student honesty with regard to their GPA a year ago. So this meta-analysis is not a valid or reliable study of the influence of self-report grades. The 56,000 students are totally irrelevant. 

Note, many of the controversial influences have only 1 or 2 meta-analyses as evidence.

Bergeron & Rivard (2017) - on Hattie's huge numbers:
"We cannot allow ourselves to simply be impressed by the quantity of numbers and the sample sizes; we must be concerned with the quality of the study plan and the validity of collected data."
Larsen (2014),
"the megalomaniac additive annexation of all sorts of meta-analyses is not concerned with methodologically critical self-reflections, nor with validity claims, i.e., it does not specify the limits to what can be said and made commensurable. The risk is that knowledge in the collected empirical data piles disappears when it is formalised in a second-, third-, and-fourth-order perspective" (p. 6).
Sjøberg (2012) also argues that Hattie uses a rhetorical strategy of an overwhelming number of meta-analyses instead of supporting a hypothesis to heighten the effects of the meta-analyses public impact.

Wrigley (2015) in Bullying by Numbers, gives a detailed analysis of this problem.

David Weston gives a good summary of issues with Effect Sizes:
2min - contradictory results of studies are lost by averaging
4min 30sec - Reports of studies are too simplified and detail lost
5min - What does effect size mean?
6min 15 sec - Hattie's use of effect size
7min - Issues with effect size
8min 40sec - problems with spread of scores (standard deviation)
9min 30sec - need to check details of Hattie's studies
10min 30sec - problem with Hattie's hinge point d=0.40 (see A Year's Progress)
16min 50secs - Prof Dylan Wiliam's seminal work - 'Inside the Black Box', is an example of research that has been oversimplified by Educationalists - e.g., 'writing objectives on the board' but other more important findings have been lost.
18min - Context is king

Professor Robert Coe's detailed paper on Effect size Calculations here.

David Weston uses a great analogy of a chef with teaching (5min onwards).

John Oliver gives a funny overview of the problems with Scientific Studies:

Another overview the issues with published studies-

A short video on the issues with Social Science Research


  1. What do you do if you are using data with no definitive parameters? Our major assessment is NWEA Map and students do not score a 0-100 result and there isn't really a cap they could attain. It appears these calculations result in skewed effect sizes. Is there a solution to this?

  2. the students don't have to score between 0-100. My example just used a score out of 100 for simplicity. You can use any test with any total score as long as the two groups of students or the before and after scores are from the same test.

    In your case, it looks as though NWEA compare each individual student with a normed result, so effect size = (student score - normed score)/ divided by pooled standard deviation.

    Another slightly different use of effect sizes, is the USA benchmark Effect sizes for their standardised tests - see here & scroll down - https://visablelearning.blogspot.com/p/a-years-progress.html

    and here's an example of students who did a test out of 10 marks -