Effect sizes and meta-analyses: How to interpret the “evidence” in evidence-based

Paul A. Kirschner & Mirjam Neelen

Kripa Sundar and Pooja Agarwal have published a guide to understanding meta-analyses and meta-meta-analyses.

Wikipedia defines a meta-analysis as:

a statistical analysis that combines the results of multiple scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting measurements that are expected to have some degree of error. The aim then is to use approaches from statistics to derive a pooled estimate closest to the unknown common truth based on how this error is perceived. Meta-analytic results are considered the most trustworthy source of evidence by the evidence-based medicine literature.

Not only can meta-analyses provide an estimate of the unknown effect size, it also has the capacity to contrast results from different studies and identify patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light with multiple studies.

And a meta-meta-analysis is a meta-analysis of meta-analyses! A prime example of this is John Hattie’s Visible Learning in which he ranked 138 influences that are related to learning outcomes from very positive effects to very negative effects. He followed this up with Visible Learning for Teachers (2011) which discussed 150 effects and finally to The Applicability of Visible Learning to Higher Education (2015) with 195 effects. For more information on Hattie’s works see this.

It’s also the case that more and more meta-analytical articles are being written about many different educational interventions. But meta-analysis is tricky and its interpretation is sometimes even trickier. Why? Because:

  • Not all meta-analyses are trustworthy. When it comes to learning outcomes measured in a meta-analysis, researchers may be comparing apples to oranges.
  • The learning strategies being researched in a meta-analysis are not consistent both with respect to their definition and the research methods used to study them.
  • Often, the meta-analysis does not include recent studies. A meta-analysis published today was probably submitted at least a year ago (that’s how long it takes to get an article published). The analysis of the studies found and the writing of the article in umpteen versions probably took about two years. That means the most recent articles studied might be at least four years old. Results can change as more studies are conducted on the topic, and research will continue to be published but are not included in the meta-analysis published today!

The guide consists of three parts, namely:

  • an overview of meta-analyses,
  • an introduction to meta-meta-analyses, and
  • effect size statistics, tables, and more

With this guide, Sundar and Agarwal hope, in their words, “to empower you to question the “evidence” in the evidence-based practices you encounter. Specifically, we want to equip you with the tools to assess whether summative evidence that’s presented based on effect sizes is trustworthy [in] meta-analyses and meta-meta-analyses.


7 thoughts on “Effect sizes and meta-analyses: How to interpret the “evidence” in evidence-based

  1. Richard Phelps says:

    In my view, the “apples and oranges” argument among meta-analysts is irresolvable. Some years ago, I conducted a meta-analysis of the effects of testing on student achievement. I included any study that included any kind of variation in testing and a student achievement outcome measure. Some criticized it for “apples and oranges” comparisons: it included classroom testing as well as large-scale testing, testing in math as well as testing in reading, testing of adults as well as testing of kindergartners, and so on.
    I was interested in the overall effect of any kind of testing on student achievement. I believe that’s a valid research question. Those who wish to isolate the unique effects of classroom, math, or kindergarten testing are free to conduct those relevant meta-analyses.
    To argue, however, that there exists some threshold beyond which the category of testing is too broad to be considered valid for a meta-analysis is, in my opinion, a fool’s errand. I could just as well argue that a meta-analysis of classroom testing in reading on student achievement is too broad. Any old classroom anywhere? Oral as well as written tests? Adults as well as kindergartners? Tests administered on Wednesdays as well as tests administered on Fridays? They could all be different. There’s no end to such considerations.


  2. George LILLEY says:

    The distinction between META-meta analysis used by Hattie & the EEF, versus a meta-analysis is important, given Hattie and EEF dominate education policy in my country. The article identifies a lot of significant issues & I think the simplest to understand is -“The learning strategies being researched in a meta-analysis are not consistent”. The example given is Hattie’s collection of feedback studies- “If learning strategies included in the meta-analysis are not consistent and logical, then beware! For example, if you find a meta-analysis that groups together “feedback” strategies including teacher praise, computer instruction, oral negative feedback, timing of feedback, and music as reinforcement, does that sound consistent to you?

    If we go further into those ‘feedback’ studies, Hattie includes Standley (1996), which is about background music in a variety of settings – production lines & nursing home. It gives almost the largest effect size in his book – a whopping 2.87! Which should raise some concerns as to the relevance in a classroom setting. If the details of other studies are analysed, even more concerns are raised. If you look at other influences e.g. Self Report Grades, Behaviour & many others, you will see largely disparate studies group together to get one effect size. Many peer reviewers, e.g., Thibault (2017) & Bergeron & Rivard (2017) – “Hattie computes averages that do not make any sense.”
    more details here – https://visablelearning.blogspot.com/


  3. Richard Phelps says:

    “if you find a meta-analysis that groups together “feedback” strategies including teacher praise, computer instruction, oral negative feedback, timing of feedback, and music as reinforcement, does that sound consistent to you?”
    Don’t know if this question from George was directed at my comment. Nonetheless, my response would be “yes”. If they are all types of feedback, then they fit under the category of “feedback.”
    Hattie did a huge amount of work. Moreover, he’s open about what he includes, making it easier for others to build off of his work. He’s not stopping anyone from doing more. Anyone is welcome to do their own studies with narrower or different categorizations.
    It’s fine that Kripa and Pooja form their own categorization for retrieval practice.
    I just don’t think that someone is wrong because they categorize studies at a higher level of aggregation or otherwise differently than one would have liked them to. Including both apples and oranges is fine if one is studying “fruit.”
    If one doesn’t like someone else’s categorization scheme, they can make their own, as Kripa and Pooja did.


  4. George LILLEY says:

    I can’t make my own classifications, as i’m a teacher in an Educational system where Hattie’s strategies are mandated & I’m held accountable for using them. I can’t raise objections and point out issues with his research in that system. Also, I don’t think Hattie’s argument that we are studying “fruit” justifies his use of totally disparate studies. The notion of “feedback” & all other influences that Hattie presents to School teachers, is that most of his studies are in class room settings. What relevance does the Standley study have to teachers in a classroom? Also, Hattie’s claim is he presents in a league table “What Works Best” in Schools. How do effect sizes, averaged from totally disparate studies, determine ‘what works best’? We are not talking about ‘fruit’ here Hattie claims he has ranked specific influences that effect schools.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s