Ability Grouping

The meta-analyses divide ability grouping into a number of categories listed below. However, most people seem to concentrate on the "general" ability grouping with the low effect size and generalise that ability grouping has no effect on achievement. But, as you read the studies a much richer and complex picture emerges.

"General" Ability grouping - effect size d = 0.12 (cross-grade grouping d = 0.30 but Hattie does not include this in this category, I’m not sure why).

Within-class grouping - d = 0.16 (Hattie did not include the Kulik study detailed below which gave d = 0.25). Here's an example of a school in my state which has significantly improved students math's scores by within-class ability grouping.

Ability grouping for gifted students - d = 0.30 (Hattie did not include the result from Kulik below, where d  = 0.87; if this was used d = 0.47)

Acceleration - d = 0.88 (Hattie’s 5th ranked influence, but overlaps with gifted students and research is used selectively).

Enhancement - d = 0.39

Dylan Wiliam gives a quick overview of the problems of Ability Grouping research-

Goldacre (2008) comments about meta-analysis in education:
"I think you’ll find it’s a bit more complicated than that".
Andrew Old does an excellent analysis of the EEF toolkit on ability grouping here.

Emma De Heaume completes an excellent systematic review of Homogeneous grouping here,
"As a result of the adoption of data-driven practices, HOW2Learn and Hattie’s effect sizes, schools in my local region are choosing to disband within-school ability grouping. Hattie’s meta-analysis (NSWDEC, 2015) claims that ability grouping has an effect size of 0.12, approximately a quarter of the expected growth over a 12 month period. According to this publication, ability grouping does not allow for impressive student achievement...
I was interested to see the consistent influence of Hattie’s research in schools and the disbanding of ability grouping as a result. I am interested to know the impact that removing ability grouping in the forms of whole school streamed Mathematics" (p. 4).
De Haeume concludes,
"It could be predicted that, the removal of clustered Mathematics groups will have a detrimental effect on gifted mathematicians academically. This will also be reflected in an increase in the prevalence of boredom with gifted learners" (p. 20).
Hattie's Use of the Ability Grouping Research:

There is much overlap from one category to another, for example, there is virtually no difference between the categories 'ability grouping for gifted' and 'acceleration' and also, 'ability grouping' and 'cross-grade grouping'.

The researchers who dominate these categories in Hattie’s book are Kulik & Kulik. We will look at their 1992 study as Hattie includes it in many of the different influences (except 'within-class grouping').

Hattie is selective with which studies he includes in each category. For example, the acceleration group is constructed by grouping all the gifted kids together. Kulik & Kulik have some very high effect sizes of around 0.90 which could be placed in the category 'ability grouping for gifted' which would raise the effect size significantly, but Hattie does not do this.

Technical Note: most of the meta-analyses include many of the same studies and this can lead to bias. This issue was looked at in detail in Effect Size - Problem 6 (multiple uses of the same data in several studies).

Study 1 (used in ability grouping and ability grouping for gifted):

Kulik and Kulik (1992) - Meta-analytic Findings on Grouping Programs.

They summarise:
'Meta-analytic reviews have focused on five distinct instructional programs that separate students by ability: multilevel classes, cross-grade programs, within-class grouping, enriched classes for the gifted and talented, and accelerated classes. 

The reviews show that effects are a function of program type. 

Multilevel classes, which entail only minor adjustment of course content for ability groups, usually have little or no effect on student achievement. Programs that entail more substantial adjustment of curriculum to ability, such as cross-grade and within class programs, produce clear positive effects. Programs of enrichment and acceleration, which usually involve the greatest amount of curricular adjustment, have the largest effects on student learning. 

These results do not support recent claims that no one benefits from grouping or that students in the lower groups are harmed academically and emotionally by grouping
 (p. 73).

Multi-level classes-
A total of 36 of the 51 studies examined results separately by ability level. Effects varied slightly with aptitude. The average effect size was 0.10 for higher aptitude, -0.02 for middle aptitude, and -0.01 for lower aptitude students.

The average effect size in all multi-level class programs was 0.03. (This is the figure Hattie reports, but as you can see it hides the detail).

Cross-grade grouping-

The average effect size in the 14 studies was 0.30,

Cross-grade grouping is like multi-level grouping in that students of different ability levels are taught in separate classrooms. But in cross-grade plans, there are typically more levels.

Comment- Hattie does not include this category in the ability grouping and he does not explain why he did this. 

Within Class grouping-
The average overall effect of grouping in the studies was d = 0.25. The average effect size was 0.30 for the higher ability students; 0.18 for the middle ability students; and 0.16 for the low-ability students. Once again an average would hide the detail.

Comment, Hattie does not include this result, but rather an earlier Kulik study (1985) where d = 0.15.

Gifted Talented-

The average effect in the studies was 0.41.

Accelerated classes for gifted talented-

Two types of studies produced distinctly different results. Each of the studies with same-age control groups showed greater achievement average effect size in these studies was 0.87.

Note: controlling for age is what I was arguing for in the section - a year's progress?

However, if you use the (usually 1 year older) students as the control group, The average effect size in the studies was -0.02. Hattie mistakenly reports d = 0.02 in the category 'ability grouping for gifted students'.

NOTE: Hattie does not include the d = 0.87. I think a strong argument can be made that the result d = 0.87 should be reported instead of the d = 0.02 as the accelerated students should be compared to the student group they came from (same age students) rather than the older group they are accelerating into.

Kulik and Kulik conclude: 'talented youngsters who were accelerated into higher grades performed as well as the talented older students already in those grades. Second, in the subjects in which they were accelerated, talented accelerates showed almost a year's advancement over talented same-age nonaccelerates' (p. 89).

Study 2 (acceleration):

Kulik and Kulik (1984) - Synthesis of Research on Effects of Accelerated Instruction.

Kulik and Kulik highlight the importance of what you are comparing in your study. They state that acceleration studies are divided into 2 groups. " The first type of study compares the performance of accelerated students with the performance of same age non-accelerates... The second type of study compares accelerates with same grade non-accelerates of equal intelligence. Comparison groups in this type of study are equivalent in grade and IQ" (p. 86).

With same age control groups (13 studies) d = 0.88 (Hattie uses this result, although he incorrectly states there were 26 studies).

With the year older control group (13 studies) d = 0.05.

Note: in the 1992 Kulik study, above, they get similar results, but Hattie uses the lower d value.

Kulik and Kulik note an issue with confounding variables, "The variation seemed great enough to lead us to suspect that factors other than acceleration were playing a role in determining study outcomes" (p. 87).

On the issue of how achievement is measured they say, "With poorly calibrated measuring instruments, investigators can hardly expect to achieve exact agreement in their results" (p. 89).

Regarding quality control, they state that no results from correlation studies were included in this meta-analysis but rather proper experimental studies (p. 89).

Study 3 (within class grouping):

Lou et al (1996) - Within-Class Grouping: A Meta-Analysis

Hattie reports d = 0.17, but Lou et al., conclude: "We caution the reader that this meta-analysis, like others, does not allow one to make strong causal inferences, particularly with regard to explanatory features.

Not only were we unable to extract information from every study about the existence of particular factors, which reduces the sensitivity of the analyses, but the study features were often intercorrelated while the heterogeneity within categories of study features were not resolved in many cases, which makes unambiguous interpretation impossible and untempered conclusions unwise."

Interestingly they analyse ability grouping with class size; in classes of more than 35 students d = 0.35, whereas in classes less than 26 students d = 0.22.

Lou et al., Conclusions and Recommendations:

"The practice of within-class grouping is supported by the results of this review. Within-class grouping appears to be a useful means to facilitate student learning, particularly in large classes and especially in math and science courses" (p. 446).

Study 4 (ability grouping):

Neber et al (2001) - Cooperative Learning with Gifted and High-achieving Students: A review and meta-analyses of 12 studies.

Hattie reports d = 0.33 but as the title suggests this is a study on gifted students so Hattie should have included this in the 'gifted' category not the general 'ability group' category.

Neber et al, report, "few methodologically sound studies can be found at present." (p. 199).

They then summarise, "... high achievers' performance improve if they learn in homogeneous groups with other high-achieving students" (p. 210).

Study 5 (ability grouping):

Slavin (1990) - Achievement effects of Ability Grouping in Secondary Schools.

Hattie reports one of the lowest effect sizes d = -0.02.

Slavin used 29 studies (mostly from the 1960's): 6 randomised, 14 correlation and 9 matched experiments. Effect sizes differed markedly from d = 0.28 down to d = -0.48. He then used the median (not the mean) of d = -0.02. He states "There are few consistent patterns in the study findings" (p. 484).

Slavin defends the use of the median, "In pooling findings across studies, medians rather than means were used, principally to avoid giving too much weight to outliers. However, any measure of central tendency  ... should be interpreted in light of the quality and consistency of the studies from which it was derived, not a finding in its own right" (p. 477).

Note that 9 studies were statistically insignificant and Slavin assigns d = 0.00 to these studies (p484). Other meta-analyses differ by using statistically insignificant d values. Very few studies completely dismiss these. The different strategies PROFOUNDLY affect the mean or median d values obtained.

Similar to many other researchers, Slavin cautions the reader that many of the studies use approximation techniques to derive an effect size and these should be interpreted with even more caution than usual (p. 477).

Slavin also talks about the problem of varying sample size. "All other things being equal" studies with more students provide better evidence (p. 484). Note:  Hattie has been criticised heavily for ignoring this issue.

Slavin also notes another major confounding variables:

-the problem of dropouts becomes serious in senior high school as those who are most likely to drop out are in the low tracks. This could reduce the differences between high and low track students (p. 488).

-studies in higher tracks are also likely to be higher in such attributes as motivation, internal locus of control, self-esteem, and effort, factors that are not likely to be controlled in correlation studies (p. 489).

-high and low track students usually differ in pretests or IQ by 1-2 standard deviations, an enormous systematic difference for which no statistical procedure can adequately control (p. 489).

Slavin concludes; "The present review cannot provide definitive answers " (p. 491).

He recommends a move to more proper experimental design using randomised control experiments rather than correlational studies (p. 490).

Other Commentary:

Professor Maureen Hallinan (1990) The Effects of Ability Grouping in Secondary Schools: A Response to Slavin's Best-Evidence Synthesis.

Slavin is used by Hattie for a variety of influences including ability grouping. Hallinan is very critical of Slavin's research, Her critique also is instructive for all research on ability grouping and in many ways also points out issues with much of the research that Hattie uses.

The Problem with Averaging:
"The fact that the studies Slavin examines show no direct effect of ability grouping on student achievement is not surprising. The studies compare mean achievement scores of classes that are ability grouped to those that are not. Since means are averages, they reveal nothing about the distribution of scores in the two kinds of classes. Ability grouping may increase the spread of test scores while leaving the mean unchanged. 

This would occur if the practice had a differential impact on students with different abilities. Since teachers generally gear instruction to the ability level of the students being taught, students in a high ability group are likely to receive more and faster instruction and those in low ability groups less and slower instruction than pupils in an ungrouped class where instruction is geared to the average of the class. If greater gains of high achievers balance lesser gains of slow students in a grouped class, there should be no overall impact on the mean achievement of the class, compared to a heterogeneous class, even though the variance of the test scores in the two classes may differ markedly. 

Studies comparing only mean would show no direct effect of grouping on achievement" (p. 501).

Example, using Kelley & Camilli (2007) on Teacher Training comparing teachers.

The Complexity of Classroom Instruction:

"None of the studies referred to by Slavin takes into account either the content and pace of instruction or the pedagogical practice. The research systematically ignores instructional and curricular differences across classes. This is a fatal flaw in studies aimed to evaluate grouping effects" (p. 502).

The Issue of What is Used to Measure Achievement:
"Slavin's studies rely almost solely on standardised test scores to measure achievement. This outcome measure has well-known limitations. Standardised tests are not adequate measures of what students are taught in school. They can be viewed more accurately as tests of general ability or intelligence rather than of mastery of the curriculum. Failure to observe differences in standardised test scores across students is a poor measure of grouping effects... In general, Slavin's conclusions are based on a limited and flawed measure of student learning" (p. 502).

The Bias of Experimental Studies versus Case Studies:
"Finally, Slavin's selection of studies is skewed heavily in favour of experimental research. There are only a few surveys in his sample, and there is a complete absence of case studies... In so doing, he disregards important field work, such as that by Oakes (1985), Rosenbaum (1976), and others. Their work shows distinct differences in instructional techniques, teacher interactions, reward systems, student motivation, effort and self-esteem, student behaviour, disciplinary measures, administrative load, role modelling, and peer influences by level of ability group. It is difficult to believe that these dramatic findings are not related to differential learning patterns across ability groups. It may be that the design of some of the experimental studies Slavin examines hides the richness of the learning process—a complexity that is better detected by more in-depth studies" (p. 503).

Demirsky (1991) review of Kulik and Slavin's work seems to pick up most of the issues that have already been discussed:

"If educators are to make informed decisions based on the findings ... they must study the original research and be sure the questions they're asking are the same ones posed by the researchers" (p. 60).

There are major issues depending on the tests you use for achievement, for example, the scores of gifted students usually approach the ceiling on standardised tests, making it difficult to show significant academic improvement on their part (p. 61).

Another major criticism by teachers is that the standardised tests do not evaluate what they are teaching (p. 61).

"The most destructive aspect of the controversy over ability grouping is the misrepresentation of the findings." (p. 62).

There has been a great deal of misrepresentation and misinterpretation of the research. Educators need to be critical consumers (p. 64).

No comments:

Post a Comment