Truth or Truthiness? Analysing a VR Study Using Gorard’s Sieve

Mirjam Neelen & Paul A. Kirschner

It’s probably no news to anyone that we’re big on evidence-informed learning design (our whole blog is dedicated to it, mind you, and we published a book on the topic as well). One crucial part of evidence-informed practice is judging the veracity of information and/or research on how people learn or how one best can teach or train others, or which media/methods work and which don’t.

Asking ourselves whether certain information is truly truthful or only seems to be is not just nice to do, it’s a must. We need to be able to solidly/definitively differentiate between truth and it’s wicked stepsister ‘truthiness’[1] (see our blog here), a term coined by Stephen Colbert. Only if we’re able to determine which information we should dismiss and which we should trust, can we make well-informed decisions on how to design effective, efficient, and enjoyable learning experiences.

In this blog, we use Stephen Gorard’s sieve (2014) (see table 1) to try to determine if a study we ran across is trustworthy or not. The casus is a recent publication by PwC, in collaboration with Oculus for Business (a producer of Vitrtual Reality (VR) googles) and Talespin (a company that makes VR content), of a study titled ‘The Effectiveness of Virtual Reality Soft Skills Training in the Enterprise’.

While we may think that it’s great that organisations walk the extra mile to conduct research in the learning space, at the same time we need to be critical, especially when there are commercial interests involved. If we’re going to invest time and money – sometimes lots of both – we want the research to be impartial and transparent and not to be tantamount to learners filling in their own grades or letting the fox guard the hen house!

We didn’t go looking for studies with chips on our shoulders to prove a point, but decided to analyse the PwC study because the Head of VR/AR at PwC posted it on LinkedIn and it received many very positive responses (e.g., ‘these figures are compelling evidence’, ‘pleased to find evidence about learning’, ‘this is fantastic’, and so further). Seeing this we wondered whether this was truly a kind of holy grail – which would be great – or not. We felt that analysing it would be a good case study to evaluate the study by putting through Gorard’s sieve.

Before we dive in, first a brief description of the (purpose of) the PwC study.

Summary of the Study

For this research, the PwC Emerging Technology team collaborated with a team of internal learning scientists (part of their own L&D group). Having people who know tech working with people who know learning science is a good combination in itself. However, it might have been better to employ more neutral parties to avoid both conflicts of interest and confirmation bias. Especially because, in addition, the two PwC teams collaborated with two external commercial companies (Oculus and Talespin) on the research. This is risky to say the least. After all, these two external partners were also not very independent (both Oculus and Talespin would benefit from selling more VR). Anyone who conducts a study should be very aware of conflicts of interest and confirmation bias and should do everything possible to avoid both (or at the least note that there are potential conflicts of interest in the report).

According to the PwC, the purpose of the study is to determine whether VR is effective for training leadership, soft skills, or other human-to-human interactions. In other words, does VR have advantages over traditional[2] classroom or e-learning methods? The study wanted to answer two specific questions:

  1. Is VR soft skills training more effective than traditional training methods?
  2. Is VR soft skills training more cost-effective to deploy than traditional training methods?

The VR pilot studied “the impact of using VR to train new managers on inclusive leadership, …training our leaders about diversity and inclusion” (p 6). Note that research that studies training people on inclusive leadership can’t answer the much broader questions about the impact of VR on soft skills development overall. After all, even if you’d determine that VR is effective for learning and/or is cost-effective for ‘inclusive leadership training’, you can’t just conclude that those results can be generalised to all soft skills. So, we need to proceed with some caution there.

The hypotheses (they called them the outcomes that were focussed on) were:

H1. Employee satisfaction – Learners will enjoy VR learning more than traditional learning methods.

H2. Learner flexibility – VR learning will provide a more flexible remote learning experience than traditional classroom methods (e.g., ability to take the training away from the workspace, easier to fit into personal schedules).

H3. Comfortable learning environment – VR learning will provide a more comfortable and less stressful environment to practice new skills than traditional methods (e.g. being awkward, making mistakes, course-correcting, etc).

H4. Improved attention – VR learning will result in fewer instances of being distracted while learning (e.g., less multitasking).

H5. Higher information retention – VR learning will evoke deeper emotional connections than traditional methods, which significantly improves retention of the information learned
[Note from us: The authors refer to a study by Chai et al, 2017, however it’s not clear if a) VR actually evokes deeper ‘emotional connections’ and b) we can’t assume that any emotion (if emotions are triggered), improves information retention].

H6. Confidence-building – VR learning will help build more confidence in the execution of learning outcomes than traditional learning methods. [Note from us: Confidence is also discussed in the ‘comfortable learning environment’ hypothesis. It’s not clear how H6 differs from H3.]

Time to put the study and its results through Gorard’s sieve.

Analysis of the PwC Study, Using Gorard’s Sieve

Stephen Gorard’s sieve as depicted in the table below gives an excellent overview of the levels of ‘design quality’ in research. What you need to know before we kick off the analysis, is that the lowest rating (the ‘stars’ in the final column) determines the overall trustworthiness of a study. In Gorard’s (and our) thinking, a chain is only as strong as its weakest link. Even when, for example, a study describes the intervention clearly, but there was a lot of dropout or the reliability and/or validity is low, then the study overall still has a low level of trustworthiness.

We’ll work from left to right, so we start with ‘Design’. For each category, we provide a rating and then explain our rationale for giving that rating.

Design: 0 ☆

If you want to determine if your intervention actually makes a difference compared to other interventions, you need to make sure that the intervention is the ONLY differentiator. In this case, if the study wants to determine whether VR technology leads to significantly better outcomes than other training modalities (as listed in the hypotheses), then the VR technology should be the only thing that’s different. All other variables need to be the same.

In this case, the study compares three ‘modality groups’: one group completed classroom training, one an eLearning course, and one the VR experience. So far, so good. [As a side note, the report explains that participants were divided in three groups, but the procedure unfortunately how they did this wasn’t explained (e.g., randomised, blind?)].

However, when it comes to the design within each modality, the researchers made a fundamental mistake.

The classroom intervention and the eLearning intervention were very similar, but not 100% the same. For both, the researchers used an existing, linear design in which the learner goes through a series of videos, reflection activities and discussion topics” (p 15). The difference is that the classroom intervention took 2 hours and the eLearning intervention only 45 minutes. This means that they’re not really comparable with respect to learning time / time on task.

But the key problem of the study lies with the VR intervention. The authors leveraged the existing content from the classroom and eLearning content (e.g., the scenarios – who to hire, who to staff, and who got the performance differentiator) but ‘employed it in a significantly more interactive simulation’ (p 16). So, instead of replicating the linear approach used in the classroom and eLearning intervention, the VR intervention was non-linear and more interactive, for example, the learner was one of the team members participating in the discussions and they had to act and make choices based on situations and questions (which wasn’t the case in the other interventions). If the VR intervention proved to be better (or worse), was it the VR or the non-linear, interactive training method that was the reason? That’s the problem when intervention variables are confounded!

The authors argue that, if they would have followed the linear approach similar to the classroom and eLearning intervention “they would not leverage any advantages of the VR modality” (p 16).” This is a fundamental mistake. It’st’s NEVER about the modality, it’s always about the design. As Clark and Feldon (2005) state: “There is no credible evidence of learning benefits from any medium or combination of media that cannot be explained by other, non-multimedia factors” (p 6).

If you want to compare interventions, the design needs to be exactly the same! So, instead, the researchers should have designed non-linear experiences with interactive scenarios for ALL modalities, in which the learner would play the role of a team member and actively participate in the discussions on who to hire, staff, and award the performance differentiator. Only then it could have been established if it was actually the VR technology making the difference. Now the researchers have compared apples and oranges. As a result, although we assume that this mistake was unintended or was the result of the researchers simply being unaware that they were confounding variables, we can’t give more than 0 stars as there is no fair design for comparison.

It’s hopefully clear that it’s very easy to ‘earn’ more stars quickly by just being very careful with making sure that the intervention that you’re trying to ‘test’ is truly the only differentiating variable.

Note: It’s understandable that commercial organisations are targeting a wide audience and don’t want to end up with many pages of ‘dry’ content. However, it would be easy to create an optional appendix so that one can still provide the oh so needed transparency and clarity while still being able to provide a relatively short and concise report for a wider audience.

Scale: 0

When it comes to participants, ‘the more the better’ is usually true in the context of research. In this case, the report says that “the total potential study participants could be greater than 1600” (p 19) but unfortunately it doesn’t say anywhere what the actual number is. All we know is that they “identified our test group and divided them into three groups”. We’d love to give the researchers the benefit of the doubt and think that they had scores or even hundreds of participants in each group, but the truth of the matter is it might just as well have been five participants in each group. The only ‘clue’ that we have is that the 2018 classroom version had 60 participants per classroom and that each year PwC promotes 700 senior associates are promoted to manager (these new managers were part of the intended test group), but that was for a real company training and not for an experiment. For this study N is unclear and, thus, zero stars. It’s up to the authors to write up the study in such a way that it is possible for us as readers to determine its trustworthiness.

Dropout: 0

The authors should have described not only the initial number of participants, but also the number that actually completed the interventions (for all three groups) and who didn’t. How many dropped out before the study was completed? What were the reasons for dropout? Was the dropout evenly distributed across all three groups of was it skewed? Unfortunately, this information is also missing.

Outcomes: 0

There were 6 hypotheses, or, expected outcomes:

H1 (satisfaction), H3 (comfortable learning environment), H4 (improved attention), and H6 (confidence) were all subjective, self-report measures (surveys). That a person says that (s)he is more confident or attended better doesn’t mean that this was actually the case. Self-report, especially with respect to learning, is notoriously unreliable.

H2 (learner flexibility) wasn’t tested for several reasons, but the reason for this was clearly outlined by the researchers.

H5 (information retention) is a tricky one. First of all, as a reader, you need to understand what the ‘inclusive leadership’ training objectives are in the first place. This is what the report says [bold type added by us]:

In 2018, PwC mandated that every manager and above take classroom training on inclusive leadership. This training, designed and built by PwC, focused on how familiarity, comfort and trust (FCT) influence hiring, staffing and performance reviews. During the training, learners are asked to understand personal and team member behaviour that could potentially be caused by unconscious bias. The goal of the training is for learners to commit to using only objective criteria in decision-making (p 14).

On page 15, it says that learners need to ‘identify undesirable behaviours and highlight which inclusive behaviours should have been employed instead’. It also says that ‘the goal was to… self-identify where they could better employ inclusive leadership behaviours’.

There was a pre-assessment, a post-assessment, and a retention assessment. According to the report, all evaluated the ability to make inclusive leadership decisions.

However, it’s not clear how the decision-making was assessed and it’s also not clear what exactly the ‘inclusive behaviours’ are that the learner needs to make decisions on. As they say, the proof of the pudding is in the eating. The only true way to determine whether ‘more inclusive leadership decisions’ were made is to follow the participants after the trainings and evaluate their leadership decisions. What they say they might do or asking them to evaluate cases with decisions as to whether they are inclusive or asking them what they would do in a particular situation doesn’t tell us whether they make more inclusive decisions! We don’t know how they measured the learning but we think that we can safely assume that they didn’t use the follow-up method. Although, again, the authors of the report didn’t intend to write an academic paper, understanding loud and clear what the objectives are and how they are assessed, is of course critical. Without it, as readers, we have no way to determine if the objectives have been achieved. The researchers themselves conclude that “retention scores were inconclusive” (p 44).

We must conclude that the measures were weak (not clearly described or subjective) and therefore we end up with zero stars.

Fidelity: 3

The interventions are clearly described but are problematic (see Design) because the way they’re designed automatically indicates that there’s variation in the delivery (after all, the researchers are comparing 3 different designs within 3 different modalities). In addition, it’s unclear what kind of instructions the classroom facilitators received to make sure that there was as much ‘delivery consistency’ across modalities as possible. The researchers themselves write on page 46 that, for example, the success of a classroom experience depends heavily on the skills of the instructor. This is true and in particular in a corporate learning environment, the model is often that subject-matter experts are the facilitators. However, when you’re running a classroom session for test purposes, you need to find a way to keep things as controlled and consistent as possible.

We give the researchers the benefit of the doubt here and assume that the variation in delivery was unintended. Therefore ‘Fidelity’ gets three stars.

Validity: 2

The two stars are based on the principle of ‘benefit of the doubt’. We assume that the researchers really intended to investigate if VR can make a difference. However, the researchers clearly assumed that VR WILL be better (which suggests that they’re at least somewhat biased) and they were collaborating with Oculus and Talespin, which is also problematic for obvious reasons and this conflict of interest hasn’t been disclosed in the study. We, however, have no idea whether the people who evaluated the outcomes were or weren’t aware of which respondents were in which group. If this wasn’t the case – that is if the evaluators weren’t ‘blind’ – then the ‘validity’ aspect of the study would receive zero stars.

Conclusion

First off, as this chain has a number of very weak links (a number of the categories received zero stars), we can only conclude that the study cannot be considered to be very trustworthy; at least not trustworthy enough to buy the VR goggles and invest all of the time and money needed for developing the VR training. While this wasn’t an academic study, all research upon which decisions are meant to be made, should always follow a rigorous and fully transparent research process. Without it, the results can’t be trusted. It’s as simple as that.

It’s also the case that it’s not incredibly difficult to provide more transparency and clarity around the research process and the decisions made along the way (e.g., provide an additional appendix with all the ‘boring’ but necessary information either on paper or online).

To be completely clear: We’re not saying that the VR version wasn’t much better than the other two versions or that the training programme wasn’t great. Maybe it was incredibly effective and maybe companies should seriously consider investing in this type of training. All we are saying is that we can’t conclude anything based upon the published report. We’re also not saying that the research/researchers was/were unethical. We assume that the research was carried out in good faith and that the researchers were ethical honest people. We wrote this blog to provide a detailed example of how you as a learning professional can and should analyse all research and research claims, independent of whether the research is academic or not. Based on what we found, we’d like say two things.

  • To practitioners: Our analysis shows that those who responded so positively on LinkedIn might have been a bit too enthusiastic and didn’t look at the study through a critical lens. We’d like to urge all practitioners to not just blindly accept lovely looking results[3]. Read with an open and critical eye, especially if the research is carried out or paid for by the companies who also stand to profit from possibly positive results.
  • To (commercial) researchers: Assuming that you’re really interested in studying learning phenomena to either provide better learning experiences or move the profession forward, find ways to become more neutral/independent and provide 100% clarity and transparency. We know from experience how difficult it is to fully control a study in a commercial and/or organisational environment, but you need to give insight into all the ins and outs for others – practitioners in particular – to be able to critically evaluate your study and its results.

References

Tyng, C. M., Amin, H. U., Saad, M., & Malik, A. S. (2017). The influences of emotion on learning and memory. Frontiers in psychology, 8, 1454. https://doi.org/10.3389/fpsyg.2017.01454

Clark, R. & Feldon, D. (2005). Five common but questionable principles of multimedia learning. The Cambridge Handbook of Multimedia Learning. 97-115. Retrieved from https://www.researchgate.net/publication/281023153_Five_common_but_questionable_principles_of_multimedia_Learning

Gorard, S. (2014) A proposal for judging the trustworthiness of research findings. Radical Statistics, 110, 47-60. http://www.radstats.org.uk/no110/Gorard110.pdf

PwC (2020). The Effectiveness of Virtual Reality Soft Skills Training in the Enterprise. A Study. Retrieved from https://www.pwc.com/us/en/services/consulting/technology/emerging-technology/assets/pwc-understanding-the-effectiveness-of-soft-skills-training-in-the-enterprise-a-study.pdf

[1] Something that sounds plausible and therefore people prefer to believe it and to hold on to it, without taking facts, logic, or any contradictory evidence into consideration.
The Merriam Webster dictionary defines truthiness as a “seemingly truthful quality that is claimed for something not because of supporting facts or evidence but because of a feeling that it is true or a desire for it to be true”.

[2] ‘Traditional training methods’ are defined as classroom or any non-VR digital experience (e-learning).

[3] Note that we haven’t discussed the results because based on our analysis we can only conclude that the claimed results need to be interpreted with huge caution.

3 thoughts on “Truth or Truthiness? Analysing a VR Study Using Gorard’s Sieve

Leave a comment