Despite being emulated in a growing number of similar exercises around the world, the UK’s Research Excellence Framework (REF) had been widely criticised in the UK. Research funded by Research England itself showed that the majority (54%) of researchers harbour negative attitudes towards the exercise, and a recent survey by Wellcome Trust showed that 54% researchers feel pressured to meet REF and funding targets and 75% felt their creativity was being stifled by the impact agenda.
In response to this, four UK funding bodies have come together to explore alternative approaches to the assessment of UK higher education research performance. The programme of work, called the Future Research Assessment Programme (FRAP), reached a major milestone this week withthe publication of “Harnessing the Metric Tide”. Among its ten recommendations is the suggestion that peer-review of case studies is retained for the evaluation of impact, to ensure that the breadth and richness of impacts are captured by the exercise. Here’s an excerpt from the report (my emphasis):
Recommendation 5: Avoid all-metric approaches to REF
Whatever the ultimate conclusions of the FRAP process on REF purposes and design, it is unlikely that an all-metric approach will deliver what the research community, government and stakeholders need from the exercise. This pertains with particular force to the assessment of research impacts, where (despite some developments with potential) available indicators or infrastructure cannot approximate the richness of the current case study format.
The case study peer review model clearly has a lot in its favour, but some researchers point to the potential biases of this approach, arguing that well-crafted case studies score better than poorly written case studies with equal impact (e.g. Watermeyer 2019, Gow and Redwood 2020, McKenna 2021). However, in order for presentation to compromise scoring, there needs to be a difference in presentation between scoring brackets in the first place. Differences between high- and low-scoring impact case studies could be suspected in three areas:
content – different impact, significance and reach;
language – for example, grammar or word choice;
other aspects of presentation – from the emphasis on different types of content down to formatting.
The first of these differences is what should be assessed, so we expect to find differences in content of this kind. Differences in language, or in the way that narratives are crafted, should have no influence on the assessment, and if there is a lot of difference here, those arguing that presentation took precedence over content may have a basis.
A superficial reading of our comparative study of high-scoring and low-scoring case studies from REF2014 could lead to this conclusion. In the paper, my colleagues and I emphasised the characteristics of high-scoring case studies in order to level the playing field and avoid “false negatives” in the 2021 REF where far-reaching and significant impact would not be given the credit it deserved because it was not made visible or contextualised appropriately. Overall, though, the different types of analysis in my PhD research indicate that the measurable difference between high- and low-scoring case studies does not suggest a substantial influence of presentation on scoring in REF2014. There were of course differences in content between high- and low-scoring case studies, and it is important to distinguish between these factors, which were likely to have an effect on scores, and factors related to presentation, which were less likely to influence scores.
1. There are differences in content between high- and low-scoring case studies.
High-scoring case studies describe impact, but not all low-scoring ones do.
If your text does not include details of impact (even if impact exists), readers cannot see it.
84% of high scoring cases articulated benefits to specific groups and provided evidence of their significance and reach, compared to 32% of low-scoring cases which typically focused instead on the pathway to impact, for example describing dissemination of research findings and engagement with stakeholders and publics without citing the benefits arising from dissemination or engagement. This finding is based on a collaborative thematic analysis of 85 high- and 90 low-scoring case studies across Main Panels. For a sub-sample of 76 case studies, I classified all the material in Section 1 (“Summary of the impact”) as being related to either research, impact or pathway. Most texts included material on all three, but seven of the low-scoring case studies did not include any impact claims in the summary. Both findings could be a problem of presentation, if the impact was there but was not articulated; or they could be a problem of content, if the impact (by REF-definitions) did not exist. While it is of course impossible to conjure up additional impact content, hopefully our work has helped writers across UK universities to avoid this becoming a problem of presentation for existing impact in 2021.
2. There is a marginal difference in readability
High-scoring case studies are easier to read, but not by much.
Focus on clarity and explicit connection, but don’t stress over readability scores.
High-scoring case studies were easier to read based on common readability measures - but only very marginally. Compared with both general English texts and research articles, the mean readability scores of high- and low-scoring case studies (n= 124 and 93 respectively) are very close to each other. Moreover, the tool that we used gives eight different measures for cohesion, and six of these are not significantly different between high- and low-scoring case studies. The two where a difference can be found are causal links (e.g. “because”, “so”) and logical connectivity (e.g. “and”, “but”), and high-scoring case studies were better connected. The difference is significant but with a moderate effect size. So looking at the overall difference that there could have been, the similarities are far stronger.
3. There is a marginal difference in evaluative language.
High-scoring case studies are more specific, but on most measures there is no general difference in word choice.
Give specific detail or context, as fits the impact you describe.
Most specifically, I looked at the type of evaluative language in Section 1 using the Appraisal framework. I tagged the 76 texts in the sample for 47 different features, and there were measurable differences in only five of these; 42 features were used fairly evenly across scoring brackets. Again, this shows that with the difference that there could have been, the measurable difference is probably not enough to have influenced scoring. Moreover, the features where a difference was statistically significant were related either to content (where the writer has no influence) or to the level of specificity, where high-scoring case studies used more specific details about location or timelines than low-scoring ones. This may have cost some high-impact case studies a better score, but is unlikely to have inflated the scores of less-deserving impacts.
The differences in language between high- and low-scoring case studies in REF2014 can mostly be explained by the fact that the genre was new to everyone, without the database of example texts that were available in 2021. These differences were fairly straightforward to bring into the open, and many could hopefully be avoided in later iterations. Others are also symptoms of the “content” they represent; for example, if a case study reports on research and dissemination activities but not impact, the lack of impact that can be assessed is also reflected in the lack of impact-related material in the text. In this case, a low score is fair (in the REF framework) and not based on assessor bias.
Overall, compared to the number of indicators where there could have been differences, the statistically significant differences between high- and low-scoring case studies are not enough to assume that language choices had an undue influence on scores. This assumption is usually based on self-reports by assessors, either directly (e.g., McKenna 2021) or based on observations or interview data (e.g. Manville 2015; Watermeyer and Hedgecoe 2016). I hope that my analysis of textual data adds further evidence to this picture and allows decision makers to have more confidence in the integrity of the exercise.
Thanks to Mark Reed for extremely helpful suggestions for writing this post.
As this is my PhD research, there are many more details that I could add. If you would like to know more about any of the aspects introduced above, please contact me.
Gow, J and Redwood, H (2020), Impact in International Affairs: The quest for world-leading research (London: Routledge).
Manville, C, et al. (2015), 'Assessing impact submissions for REF2014: An evaluation.' (Santa Monica: RAND).
McKenna, HP (2021), Research Impact: Guidance on advancement, achievement and assessment (Cham: Springer).
Watermeyer, R (2019), Competitive Accountability in Academic Life: The Struggle for Social Impact and Public Legitimacy (Cheltenham: Edward Elgar).
Watermeyer, R and Hedgecoe, A (2016), 'Selling 'impact': peer reviewer projections of what is needed and what counts in REF impact case studies. A retrospective analysis', Journal of Education Policy, 31 (5), 651-65.