Page 19 of the Strategy Evaluation Protocol 2021-2027 identifies benchmarking as a potential way to generate robust data: ‘Other sources of robust data may include benchmarking against peer research units [..]’. In other words: in addition to the indicators and case studies selected by the unit, benchmarking can be used to support the narrative arguments in the self-evaluation.
What exactly is meant by benchmarking is unclear, but it is suggested that some kind of hybrid form of quantitative indicators and benchmarking can provide effective substantiation.
This, however, is very problematic. To start with, a technique is being suggested without any explanation of how to operationalise it. The problem becomes even more entrenched when we start from the common definition of benchmarking: comparing business processes and performance statistics with industry best practices and the best practices of other businesses; the dimensions that are typically measured are those of quality, time and cost. A performance comparison such as this, based on purely quantitative criteria (time, cost, yield), is at odds with the spirit and purpose of the SEP.
But that is not the only problem.
The fact that the SEP fails to clarify what exactly is meant by benchmarking leads to unclear, problematic practices, such as making comparisons (i.e., benchmarking) based on quantitative indicators, on the assumption that it is relatively straightforward to compare two or more units in this way. But that is too easy an assumption. Indeed, the problems and objections associated with this kind of quantitative benchmarking are so weighty that it is inadvisable on either substantive or ethical grounds.
The first problem stems from the fact that there are no clear criteria on benchmarking in the context of research evaluations, and thus no guidelines on what constitutes good material for comparison. This also applies to the selection of units for comparison. There is no clarity on what such a choice should be based on or which criteria a unit for comparison should meet, raising the prospect of chance, arbitrariness and opportunism. To put it cynically: an unambitious selection of benchmark units will produce a good outcome in the evaluation process, but this will not necessarily be the best outcome in the longer term. Conversely, an overly ambitious choice of certain benchmark units may cause a unit to come out of the comparison badly, when this need not have been the case. In short, the selection of units for comparison is an extremely delicate issue.
The next problem that arises when operationalising benchmarking relates to inequality in the collection of the underlying data for comparison. In your own research unit, you will often have access to material that has been collected accurately, often from a local information system (such as Pure, Metis or Converis), but you will rarely have access to similar information for the units that feature in the benchmarking comparison. When it comes to the latter, you are reliant on material collected in different ways, ones that often produce much less accurate final results. This seriously undermines the validity of the final comparison, something of which people are often unaware.
Another problem is that when comparison is made based on quantitative indicators, it is assumed that these indicators represent the units in a similar way. How this happens can be illustrated with reference to the following example. Normalisation plays a major role in bibliometric studies; as publication and reference cultures differ across disciplines, bibliometricians devised field normalisation in order to compensate for these differences in citation behaviour. The main purpose of normalisation is to equalise differences in citation counts. Once you have done that, you are able to compare a cardiologist, say, to an oncologist. So far, so good. However, when diverse disciplines or even entire universities are analysed in a study, this field normalisation no longer holds. One could compare citations counts after field normalisation, but that would completely miss the fact that the comparison takes no account of the underlying differences in publication culture. When it comes to physics, for example, you might find that between 80% and 85% of all publications appear in international journals, giving one a pretty good idea of that unit’s output. The chances are that this pattern also applies to some domains in the humanities. But when it comes to historians, for example, the percentages might be very different; let’s say that around 30% of all of that unit’s publications appear in international journals, while the other 70% consist of books, book chapters in edited volumes, and publications in languages other than English that do not appear in systems such as Scopus and Web of Science. These are just examples, but there are many such differences in publication culture in academia; take different publication channels, for example, such as proceedings in engineering and information sciences.
A final point concerns the ethics of this process. All being well, the data on one’s own unit are checked and validated as valid material for evaluation purposes. But this does not apply to the material that is used for the benchmarking units, which may have been collected in a different way or for very different purposes. In addition, it is questionable whether these data should be used without the knowledge of the units concerned, and without the necessary quality checks on that material. In other words, one can question the ethics of the entire process, certainly in the light of the fact that the results of the evaluation are meant to be made public, something that could unfairly damage the reputation and image of the benchmark units.