Paul A. Kirschner & Mirjam Neelen
Is a study really trustworthy or does it only seem to be? This is an exploration of truth versus the beautiful word that Stephen Colbert – American comedian – came up with: ‘truthiness’ (funny short video on restoring truthiness here), which means roughly: something that sounds plausible and therefore people prefer to believe it and to hold on to it, without taking facts, logic, or any contradictory evidence into consideration. Truthiness shouldn’t be confused with trustworthiness because the latter means that you can actually rely on something as being honest or truthful (i.e., you can trust it).
This blog is about, unfortunately, an almost daily experience that the conclusions that scientists draw from their findings – and sometimes even the study itself – are actually based on truthiness and not on truth or trustworthiness. And even worse, the conclusions that are then drawn by newspapers (and other media), administrators, teachers, and eduquacks have this same non-basis.
Judging truth versus truthiness
In his 2014 article, ‘A proposal for judging the trustworthiness of research findings’, Stephen Gorard explains what to look for in order to determine if a study is trustworthy or not. To put it simply, “a poorly described study cannot be trusted” (p. 49).
An example is the study ‘Mobile Devices in Early Learning’ recently discussed by Robbie Meredith through BBC News. The article starts with the following sexy sentence: “Young children’s maths, English and communication skills improve if they use iPads in schools on a regular basis”. And in the press release from Stranmillis University College where the researchers work we read:
The study’s findings showed that, in the five participating schools, all of which were located in catchment areas of high social deprivation and academic under-achievement, the introduction of digital technology has had a positive impact on the development of pupil literacy and numeracy skills. And, contrary to initial expectations, principals and teachers also reported that their use had enhanced children’s communication skills, acting as a stimulus for peer to peer and pupil to teacher discussion.
Wow! Fantastic! We can just hear school boards all around the world thinking “Let’s buy iPads for all the kids in our school district”.
As researchers, you can hear us thinking, “Let’s see the study so that we can determine its value”. Thanks to the detective work of Tom Bennett, we can all now get our hands on the study and what do we find?
- All of the results are completely based on subjective self-reporting. Nothing was objectively measured!
- The response rate to the questionnaire was 27% (after a second push). The first response was 8%. In other words a dropout of 73%.
- There are possible important design biases as the schools were selected based on their commitment to the project, pre-existing use of ICT and iPads in the school, and commitment to use iPads in the future.
- There was variable usage in that the schools used them at different times, with different apps (and we have no idea what those magical apps were), in different ways, and with different children.
- There was no control group. What is this intervention better than?
What we’re trying to get at is that if you want to use research to truly provide evidence for a point you’re trying to make (which in itself is praiseworthy, we’d say) you need to be absolutely sure that the research you’re referring to is trustworthy. In order to do so, you should consider the following aspects:
Research design – You need to ask yourself whether what the researcher has done in his or her study has actually caused the results obtained. In other words, you need to be sure that there is an actual causal relationship. This means that the presented findings or results are indeed the direct result of the experiment (see validity later on in this blog). Correlations won’t get you very far. After all, it’s very likely that people who died by becoming tangled in their bedsheets have eaten cheese (as the graphic below shows, 94.7% correlation). However, how likely is it that eating cheese actually causes your death by becoming tangled in your bedsheets? Just saying[1].
The ‘gold standard’ in research is a research design in which the participants are randomly selected and randomly placed in groups where everything that happens within the groups is exactly the same, except for the intervention (this totality is called a Randomised Controlled Trial (RCT)). Such a set-up is even better when the participants also don’t know whether they’re in the intervention group or in the control group so that their behaviour cannot be subconsciously influenced (this is called a blind or blinded experiment[2]) It is even better when the researcher too doesn’t know who is in which group so that her/his observations and decisions cannot be subconsciously influenced (this is called double-blinded[3] experiment).RCT is simply the best because, when you keep all variables (except the one that you want to test) exactly the same, you can be quite sure that if your findings confirm that your intervention (e.g., a teacher’s teaching preference or an employee’s received peer feedback) works, that the intervention is indeed causing the results.
Scale – In this case, generally speaking, the adage ‘the more the better’ is usually true. In other words, the more participants that take part in a research project, the more trustworthy the research. If a researcher states that her/his intervention causes the amazing results based on an experiment with 10 participants (N=10), you should surely feel a bit suspicious. And beware when you come across a study that has divided 30 schools (with a total of 9,000 students and thus, on average 300 students per school) into two groups, introduced and intervention and a control group, and then has collected and compared all of the students’ scores and then makes claims: This study has an N of 30 (and only 15 per treatment) and NOT an N of 9,000 or 300. After all, it’s not the scores of each individual student that’s being compared, but only the average results of the schools with, for example, a possible conclusion being that schools that use a certain innovative method have a significantly higher average test score than schools that used an old or different method.
Dropout– Does the researcher tell you how many participants stuck with the experiment all the way? In other words, does the study describe the number of participants that it started and ended with, and does it also analyse the dropouts? The number of participants who drop out (also known as experimental mortality[4] or differential attrition) and their distribution amongst groups can have substantial consequences for the conclusions that you’re able to draw. For example, if there was a lot of dropout in the ‘intervention group’, it might mean that only the very motivated or smartest ones persisted and therefore it remains to be seen if the intervention is actually causing the findings (or perhaps the motivation or intelligence or something else).
Data quality – Without a doubt, the data must be both reliable and valid. Reliable means that 1) the experiment must be repeatable by others with the same or highly similar results (replicability), 2) if the experimenter carries out the same experiment or introduces the same intervention again, then the results should also be the same that time too (test-retest reliability) and 3) if various researchers are carrying out the same experiment, they must do it all the exact same way and score it in the same way (inter-rater or inter-researcher reliability).
Validity is something different. Simply stated, it means that you are measuring what you intend to measure, Here we can speak of a whole group of sometimes interrelated validities:
- Face validity is a measure of how representative a research project is ‘at face value,’ and whether it appears to be a good project. This is the weakest form of validity.
- Internal validity is about whether what is observed in a study is due to the intervention carried out and not something else. In-other-words there’s a causal relationship between what you do (the independent variable) and what occurs (the dependent variable). Research on media are especially bad in this respect. Not only are the media used different, but also the pedagogy / teaching method. Dick Clark wrote a masterpiece on this in 1983.
- External validity is about generalisation: Can, and to what extent is an effect that is found, generalized to other populations (population validity), other settings (ecological validity), other treatment variables, and other measurement variables?
- Test validity is an indicator of how much meaning can be placed upon a set of test results. Is there evidence and theory – and if so how strong is it – to support the interpretations of the test scores of a specific test? Many research studies suffer here in that they measure something different than they purport to measure.
If a researcher claims that, for example “the intervention shows that course X was more effective because of gamification Y” then this isn’t tenable if what is actually measured is that “learners say that course X was more effective because of gamification Y.” The latter is just based on opinion and in this case virtually meaningless. - Criterion validity assesses whether a test reflects a certain set of abilities; the extent to which a measure really is related to an outcome either in relation to a comparison with a currently existing criterion (concurrent validity) or whether and to what extent it accurately predicts something that will occur in the future (predictive validity).
- Content validity is the estimate of how much a measure represents every single element of a construct; the extent to which a measure represents all facets of a given construct and not just one aspect.
- Construct validity defines how well a test or experiment measures up to its claims. A test designed to measure depression must only measure that particular construct, not closely related ideals such as anxiety or stress.
Note: If data are valid, they must be reliable. If people receive very different scores on a test every time they take it, the test is not likely to predict anything. However, if a test is reliable, that does not mean that it is valid. For example, we can measure length of someone’s foot reliably, but that does not make it a valid measure of intelligence or even of running speed. Reliability is a necessary, but not sufficient, condition for validity.
Finally, Gorard mentions a couple of ‘threats’ that can cause the research to be dodgy. The major ones being that the researchers (a) aren’t independent (e.g., the researchers themselves who know what they’re trying to accomplish) and (b) are aware in which group the results were produced (in other words, they’re not ‘blind’ for the intervention).
And there are also biases in the researcher that can cause both the research and its conclusions to be dodgy. Take for example the Nine types of bias here, or the Varieties of bias to guard against here or Twenty cognitive biases that screw up your decisions here.
Gorard has created a ‘sieve’ with six categories through which research can be filtered to help estimate the trustworthiness of a study. The categories are: design, scale, dropout, outcomes, fidelity, and validity He has given each scale five quality levels; from 0 to 4 stars. The lowest rating in any one column determines the overall trustworthiness of a study.
Table 1: A ‘sieve’ to assist in the estimation of trustworthiness Gorard, 2014
The reason why the lowest rating determines the overall trustworthiness of a study, is because even when is a study is honest and large-scale with hardly any dropout and with standardised results, if the intervention is described in a wishy-washy manner (i.e., you really don’t know or understand what the intervention exactly was) or if the intervention is not equivalent (e.g., the intervention group spent twice as much time working on the learning experience than the control group), that study, overall, has a low trustworthiness and still only gets 2 stars. And returning to the iPad study described earlier, we can give it a big fat 0.
And if this has gotten you to thinking, then you also might enjoy this: Understanding Types of Evidence: A Guide for Educators from Mathematica Policy Research. This short ‘paper’ is a guide describing four key types of evidence educators are likely to encounter and explains how to tell whether these types of evidence can provide strong support for claims about an educational technology’s effectiveness.
DO try this at home next time you read an article or publication. It will help you to determine if you’re dealing with something that is trustworthy or is just simply truthy!
References
Bennett, T., (2017, May 24). iPads: game changers or money paperweights? New study tells us little. [Blog post]. Retrieved from http://behaviourguru.blogspot.ie/2017/05/ipads-game-changers-or-money.html
Chojnacki, G., Resch, A., Vigil, A., Martinez, I., & Bates, S., (2016). Types of Evidence: A Guide for Educators. Retrieved from https://www.mathematica-mpr.com/download-media?MediaItemId=%7b8049287B-2D53-4150-9F26-72C7084E5236%7d
Clark, R.E., (1983). Reconsidering Research on Learning from Media. Review of Educational Research, 53, 445-459. Retrieved from http://www.uky.edu/~gmswan3/609/Clark_1983.pdf
Gorard, S. (2014) A proposal for judging the trustworthiness of research findings. Radical Statistics, 110, 47-60. http://www.radstats.org.uk/no110/Gorard110.pdf
[1] By the way: If you want to laugh at some great spurious correlations, go to Tyler Vigen’s webpage (http://tylervigen.com/spurious-correlations) and even play with it yourself here (http://tylervigen.com/discover).
[2] A blind — or blinded — experiment is an experiment in which information about the test is masked (kept) from the participant, to reduce or eliminate bias, until after a trial outcome is known It is understood that bias may be intentional or unconscious, thus no dishonesty is implied by blinding.
[3] Double-blind describes an especially stringent way of conducting an experiment which attempts to eliminate subjective, unrecognized biases carried by an experiment’s subjects (usually human) and conductors. Here both researcher and respondent are blinded.