Making a SMOCKery of Science

Paul A. Kirschner

In 2017, Mirjam Neelen and I wrote a blog entitled Truth or Truthiness in which we discussed whether a study is trustworthy or whether it only seems to be trustworthy. We used the term truthiness to describe the latter. Truthiness is a word that Stephen Colbert – American comedian – came up with. Roughly it means: something that sounds plausible and therefore people prefer to believe it and hold on to it, without taking facts, logic, or any contradictory evidence into consideration. Truthiness shouldn’t be confused with trustworthiness because the latter means that you can actually rely on something as being honest or truthful (i.e., you can trust it).

This morning I found an email in my inbox from Richard Clark with whom I wrote a few articles including the 2006 article[1] on inquiry learning and how it as well as its synonymic approaches to learning don’t work. He alerted me to a recent article by friend and colleague Dan Robinson (I tried to get him to apply for my vacated position at the Open Universiteit, which he unfortunately declined didn’t do) who is inclined to debunk inflated and inaccurate claims made by researchers. Dick wrote: “His latest, just published effort is a great example of how to systematically debunk a poor study that has received a huge amount of attention”.

The article written by Dan, ‘A Complete SMOCkery: Daily Online Testing Did Not Boost College Performance’ was published in Educational Research Review[2] and discusses an earlier article by Pennebaker, Gosling, and Ferrell (2013)[3] published in PLoS One. Here’s the abstract of that article:

An in-class computer-based system, that included daily online testing, was introduced to two large university classes. We examined subsequent improvements in academic performance and reductions in the achievement gaps between lower- and upper-middle class students in academic performance. Students (N = 901) brought laptop computers to classes and took daily quizzes that provided immediate and personalized feedback. Student performance was compared with the same data for traditional classes taught previously by the same instructors (N = 935). Exam performance was approximately half a letter grade above previous semesters, based on comparisons of identical questions asked from earlier years. Students in the experimental classes performed better in other classes, both in the semester they took the course and in subsequent semester classes. The new system resulted in a 50% reduction in the achievement gap as measured by grades among students of different social classes. These findings suggest that frequent consequential quizzing should be used routinely in large lecture courses to improve performance in class and in other concurrent and subsequent courses.

Robinson writes that Pennebaker et al. reported that “an innovative computer-based system that included daily online testing resulted in better student performance in other concurrent courses and a reduction in achievement gaps between lower and upper middle-class students” (p. 1). But is this so? Robinson takes a closer look at the data that Pennebaker and his colleagues used and essentially applied part of Gorard’s Sieve to it. To help you, here’s an excerpt from Mirjam and my aforementioned blog:

In his 2014 article[4], ‘A proposal for judging the trustworthiness of research findings’, Stephen Gorard explains what to look for in order to determine if a study is trustworthy or not…[He] created a ‘sieve’ with six categories through which research can be filtered to help estimate the trustworthiness of a study. The categories are: design, scale, dropout, outcomes, fidelity, and validity He has given each scale five quality levels; from 0 to 4 stars. The lowest rating in any one column determines the overall trustworthiness of a study. The reason why the lowest rating determines the overall trustworthiness of a study, is because even when is a study is honest and large-scale with hardly any dropout and with standardised results, if the intervention is described in a wishy-washy manner (i.e., you really don’t know or understand what the intervention exactly was) or if the intervention is not equivalent (e.g., the intervention group spent twice as much time working on the learning experience than the control group), that study, overall, has a low trustworthiness and still only gets 2 stars.

Robinson writes: “As in many cases of false claims, threats to internal validity were not adequately addressed. Student performance increases in other courses can be explained entirely by selection bias, whereas achievement gap reductions may be explained by differential attrition.”

As the study wasn’t one in which there was a comparison with randomly assigning students to experimental conditions, he first looked for possible pre-existing student differences that could explain any subsequent performance differences. In other words, was there selection bias? One of the differences that he looked at was the students’ major. He writes: “As many people know, there exist, at most universities, differences in GPA among various majors. For example, it is well known that education majors typically have higher GPAs than do engineering majors. Thus, if one of the groups in a comparison study has more students from an “easier” or “harder” major than the other group, this pre-existing difference could surface in any outcome variables that use the same measure or a similar one.” And if this is the case, the whole or a great deal of the effect could be ascribed to this difference. What he found was:

This “major” effect size for the online testing group over the traditional group (3.29 − 3.18 = 0.11) is almost identical to the reported advantages reported by Pennebaker et al. (2013) for both the concurrent semester (3.07 − 2.96 = 0.11) and the subsequent semester (3.10 − 2.98 = 0.12). Thus, the student performance increases can be fully explained by selection bias: there were different proportions of students from majors that naturally tend to have higher or lower grades in those major courses. With regard to internal validity, when an alternative explanation exists that can account for an “experimental” effect, then that experimental effect becomes bogus. (p. 4)

Dan didn’t stop there. With respect to the reduction in achievement gaps found by Pennebaker et al. (2013), the authors themselves noted that the online testing courses were more rigorous than the traditional courses due to daily quizzes. Why is this important? Because, typically, with increased rigor comes increased drop rates. Because of this, he also examined whether differential attrition (i.e. that one group lost more participants than the other due to the intervention itself) might explain the reduction in achievement gaps. To understand this he gives the following example. Let’s say you’re comparing weight loss between participants an non-participants in a fitness bootcamp and find that the average bootcamp participant loses 15 pounds by the end of the four-week camp (and the non-participant loses no weight or even gains weight. The problem here is that out of every 100 participants that show up for the bootcamp, 80 drop out (i.e., don’t finish it) due to its extreme rigour or effort that the participant needs to put into it, while of the 100 in the control group who didn’t participate, zero drop out (i.e., no rigour) and thus remain until the end of the four weeks. Weight loss comparisons are made between the 20 who finished the bootcamp and the 100 control group participants. Thus, the bootcamp’s claim is exaggerated due to differential attrition.

And what did Robinson find?

…in 2008 when the psychology course was less rigorous with no daily quizzes, only 32 students dropped the course. Comparatively, in 2011 when the rigor was increased, almost twice as many students (58) dropped. Students from lower SES families unfortunately tend to drop courses at higher rates than do their richer counterparts. It is certainly possible that many of these students who dropped were from the low middle class. Thus, any analysis would show a reduction in the performance differences between the low and high middle-class students. (p. 5)

Robinson concludes with two things to think about. First, spurred on by the shift to online courses during the current pandemic many are arguing whether online instruction is just as effective as face-to-face instruction (never waste a good crisis). Pennebaker et al’s. (2013) findings not only allowed some to conclude that online instruction may be equally effective, but “the suggestion that online may be more effective than face-to-face spurred efforts to shift more and more instruction to online environments. But, as the present findings suggest, such enthusiasm for online instruction may not be supported by the data” (p. 5), Second, and this is more about research and journals, he writes: “All members of the scientific community need to consider using the strongest possible methods and carefully note study limitations.” Pennebaker et al. (2013) didn’t. Also, after this one-shot piece of research, it would have been really easy to design and carry out a randomized experiment to test the effectiveness claims. Robinson writes: “With almost one thousand students enrolling in the introductory psychology course each semester, it would have been easy to randomly assign half of them to either a SMOC or control, face-to-face section.” If they had done this, then we would know whether the results were truthful or just truthy.

We, both academics and practitioners, need to be careful to constantly be critical of what we read, even if it has been published in a refereed journal. We need to analyse what others say, putting it through Gorard’s sieve so as not to accept or repeat bogus claims. And if they do find their way into journals, we need more people like Dan Robinson to call them out.

[1] Kirschner, P. A., Sweller, J., & Clark, R. E. (2006). Why minimal guidance during instruction does not work: An analysis of the failure of constructivist, discovery, problem-based, experiential, and inquiry-based teaching. Educational Psychologist, 46(2), 75-86. doi:10.1207/s15326985ep4102_1

[2] Robinson, D. H. (2021, Online). A complete SMOCkery: Daily online testing did not boost college performance. Educucational Psychol0gy Review. doi:10.1007/s10648-020-09588-0

[3] Pennebaker, J. W., Gosling, S. D., & Ferrell, J. D. (2013). Daily online testing in large classes: boosting college performance while reducing achievement gaps. PLoS One, 8(11), e79774. doi:10.1371/journal.pone.0079774.

[4] Gorard, S. (2014) A proposal for judging the trustworthiness of research findings. Radical Statistics, 110, 47-60. http://www.radstats.org.uk/no110/Gorard110.pdf

2 thoughts on “Making a SMOCKery of Science

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s