The abuses and perverse effects of quantitative evaluation in the academy

The world of academic research is scored according to so-called “objective” measures, with an emphasis on publications and citations. But the very foundations of this approach are flawed. Is it time to abandon these simplistic ranking schemes?

Since the neoliberal ideology of the “new public management” and its introduction of rankings in academia began in the 1990s, researchers and administrators have become increasingly familiar with the terms “evaluation,” “impact factors,” and “h-index.” Since that time, the worlds of research and higher education have fallen prey to a dangerous evaluation fever. It seems that we want to assess everything, including teachers, faculty, researchers, training programs, and universities. “Excellence” and “quality” indicators have proliferated in usage without anyone really understanding what these terms precisely mean or how they are determined.

“Excellence” and “quality” indicators have proliferated in usage without anyone really understanding what these terms precisely mean or how they are determined.

Bibliometrics, a research method that considers scientific publications and their citations as indicators of scientific production and its uses, is one of the primary tools that informs the many “excellence indicators” that this administrative vision of higher education and research is attempting to impose on everyone. Whether ranking universities, laboratories, or researchers, calculating the number of publications and citations they receive often serves as an “objective” measure for determining research quality.

It is therefore important to understand the many dangers associated with the growing use of oversimplified bibliometric indicators, which are supposed to objectively measure researchers’ productivity and scientific impact. This paper focuses on analyzing two key indicators used extensively by both researchers and research administrators. It also examines the perverse effects that the oversimplified use of bad indicators has upon the dynamics of scientific research, specifically in the areas of social and human sciences.

The impact factor: Corrupting intellectual output

A journal’s impact factor (IF) is a simple, mathematical average of the number of citations received in a given year (e.g., 2016) for articles published by a journal during the previous two years (in this case, 2014 and 2015). The IF has been calculated and published every year since 1975 in the Web of Science Journal Citation Reports. As early as the mid-1990s, experts in bibliometrics were drawing attention to the absurdity of confusing articles and journals. However, this did not stop decision-makers—who themselves are supposedly rational researchers—from using a journal’s IF to assess researchers and establish financial bonuses based directly on the numerical value of the IF.
For example, as the journal
Nature reported in 2006, the Pakistan Ministry of Science and Technology calculates the total IF of articles over a year to help it establish bonuses ranging between $1,000 and $20,000. The Beijing Institute of Biophysics established a similar system: An IF of 3 to 5 brings in 2,000 yuan ($375) per point and an IF above 10 brings in 7,000 yuan ($1,400) per point.

However, in an editorial in the same issue, Nature criticized this system, noting that it is impossible for a mathematical journal to score an IF value as high as a biomedical research journal due to the substantially larger number of potential citers in the biomedical sciences. No sensible person believes that biomedical articles are superior to math articles, nor can they believe that this scoring system justifies granting one group of authors a larger bonus than another group. And, in another more recent (and ugly) example of the kind of intellectual corruption generated by taking the ranking race seriously, universities have contacted cited researchers who are working for other institutions and offered these researchers compensation for including the university as an affiliated body in the individual’s next article.1 These fictitious affiliations, without real teaching or research duties, allow marginal institutions to enhance their position in university rankings without having to maintain real laboratories.

These extreme cases should be enough to warn university managers and their communications departments away from the use or promotion of such inaccurate rankings. In short, it is important to scrutinize the ranking system’s “black box,” rather than accepting its results without question.

The exploitation of these false rankings and indicators to promote institutional and individual achievement is a behaviour that reveals an ignorance of the system’s flaws. Only the institutions that benefit from association with these rankings, researchers who profit from incorrectly computed bonuses based on invalid indicators, and journals that benefit from the evaluative use of impact factors, can believe—or feign to believe—that such a system is fair, ethical, and rational.

The h index epidemic

In the mid-2000s, when scientific communities started devising bibliometric indices to make individual evaluations more objective, American physicist Jorge E. Hirsch, from the University of California in San Diego, came up with a proposition: the h index. This index is defined as being equal to the number N of articles published by a researcher that received at least N citations since their publication. For example, if an author has published 20 articles, 10 of which were cited at least 10 times each since their publication, the author will have an h index of 10. It is now common to see researchers cite their h index on their Facebook pages or in their curricula vitae.

The problematic nature of the h index is reflected in the very title of Hirsh’s article published in a journal that is usually considered prestigious, the Proceedings of the National Academy of Sciences of the United States of America, “An index to quantify an individual’s scientific research output.” In fact, this index is neither a measure of quantity (output) nor a measure of quality or impact: It is a combination of both. It arbitrarily combines the number of articles published and the number of citations received. In the eye of its creator, this index was meant to counter the use of the total number of articles published, a metric that does not take their quality into account. The problem is that the h index is itself strongly correlated with the total number of articles published, and is therefore redundant.

Furthermore, the h index has none of the basic properties of a good indicator. As Waltman and van Eck demonstrated, the h index is incoherent in the way it ranks researchers whose number of citations increases proportionally, and it therefore “cannot be considered an appropriate indicator of a scientist’s overall scientific impact.”2

This poorly constructed index also causes harm when it is used as an aid in the decision-making process. Let us compare two scenarios: A young researcher has published five articles, which were cited 60 times each (for a given period); a second researcher, of the same age, is twice as prolific and wrote 10 articles, which were cited 11 times each. The second researcher has an h index of 10, while the first researcher only has an h index of 5. Should we conclude that the second researcher is twice as “good” as the first one and should therefore be hired or promoted ahead of the first researcher? Of course not, because the h index does not really measure the relative quality of two researchers and is therefore not a technically valid indicator.

Despite these fundamental technical flaws, use of the h index has become widespread in many scientific disciplines. It seems as though it was created primarily to satisfy the ego of some researchers. Let us not forget that its rapid dissemination has been facilitated by the fact that it is calculated automatically within journal databases, making it quite easy to obtain. It is unfortunate to see scientists, who purportedly study mathematics, lose all critical sense when presented with this flawed and oversimplified number. It confirms an old English saying, “Any number beats no number.” In other words, it is better to have an incorrect number than no number at all.

A multidimensional universe

What is most frustrating in the debates around research evaluation is the tendency to try to summarize complex results with a single number. The oversimplification of such an approach becomes obvious when one realizes that it means transforming a space with many dimensions into a one-dimensional space, thus realizing Herbert Marcuse’s prediction of the advent of a One-Dimensional Man. In fact, by combining various weighted indicators to get a single number, we lose the information on each axis (indicators) within the multidimensional space. Everything is reduced to a single dimension.

Only by considering the many different initial indicators individually can we determine the dimensions of concepts such as research quality and impact. While postsecondary institutions and researchers are primarily interested in the academic and scientific impact of these publications, we should not ignore other impacts for which valid indicators are easily accessible. Think of the economic, societal, cultural, environmental, and political impacts of scientific research, for example.

In the case of universities, research is not the only mission and the quality of education cannot be measured solely by bibliometric indicators that ignore the environment in which students live and study, including the quality of buildings, library resources, or students’ demographic backgrounds. For these dimensions to emerge, we must avoid the “lamppost syndrome,” which leads us to only look for our keys in brightly lit places rather than in the specific (but dark) places where they are actually to be found. It is therefore necessary to go beyond readily accessible indicators and to conduct case studies that assess the impacts for each of the major indicators. It is a costly and time-consuming qualitative operation, but it is essential for measuring the many impacts that research can have.

The simplistic nature of rankings culminate in annual attempts to identify the world’s “best” universities, as if the massive inertia of a university could change significantly every year! This in itself should suffice to show that the only aim of these rankings is to sell the journals that print them.

The simplistic nature of rankings culminate in annual attempts to identify the world’s “best” universities, as if the massive inertia of a university could change significantly every year!

Quantifying as a way to control

The heated arguments around the use of bibliometric indicators for assessing individual researchers often neglects a fundamental aspect of this kind of evaluation, which is the role of peers in the evaluation process. Peer review is a very old and dependable system that requires reviewers to have first-hand knowledge of the assessed researcher’s field of study. However, in an attempt to assert more control over the evaluation process, some managers in universities and granting agencies are pushing forward with a new concept of “expert review” in which an individual, often from outside the field of research being considered, is responsible for evaluating its merits. A standardized quantitative evaluation, such as the h index, makes this shift easier by providing supposedly objective data that can be used by anyone. It is in this context that we need to understand the creation of journal rankings as a means to facilitate, if not to mechanize, the evaluation of individuals. This constitutes a de facto form of Taylorization of the evaluation process—the use of a scientific method to de-specialize the expertise needed for evaluation.

Thus surfaces a paradox. The evaluation of a researcher requires appointing a committee of peers who know the researcher’s field very well. These experts would already be familiar with the best journals in their field and do not need a list concocted by some unknown group of experts ranking them according to different criteria. On the other hand, these rankings allow people who don’t know anything about a field to pretend to make an expert judgment just by looking at a ranked list without having to read a single paper. These individuals simply do not belong on an evaluation committee. Therefore, the proliferation of poorly built indicators serves the process of bypassing peer review, which does consider productivity indices but interprets them within the specific context of the researcher being evaluated. That some researchers contribute to the implementation of these rankings and the use of invalid indicators does not change the fact that these methods minimize the role of the qualitative evaluation of research by replacing it with flawed mechanical evaluations.

Pseudo-internationalization and the decline of local research

A seldom-discussed aspect of the importance given to impact factors and journal rankings is that they indirectly divert from the study of local, marginal, or less popular topics. This is particularly risky in human and social sciences, in which research topics are, by nature, more local than those of the natural sciences (there are no “Canadian” electrons). Needless to say, some topics are less “exportable” than others.

Since the most frequently cited journals are in English, the likelihood of being published in them depends on the interest these journals have in the topics being studied. A researcher who wants to publish in the most visible journals would be well advised to study the United States’ economy rather than the Bank of Canada’s uniqueness or Quebec’s regional economy, topics that are of little interest to an American journal. Sociologists whose topic is international or who put forward more general theories are more likely to have their articles exported than those who propose an empirical analysis of their own society. If you want to study Northern Ontario’s economy, for example, you are likely to encounter difficulty “internationalizing” your findings.

Yet is it really less important to reflect on this topic than it is to study the variations of the New York Stock Exchange? As a result, there is a real risk that local but sociologically important topics lose their value and become neglected if citation indicators are mechanically used without taking into account the social interest of research topics in the human and social sciences.

Conclusion: Numbers cannot replace judgement

It is often said—without providing supporting arguments —that rankings are unavoidable, and that we therefore have to live with them. This is, I believe, a false belief and, through resistance, researchers can bring such ill-advised schemes to a halt. For example, in Australia, researchers’ fierce reaction to journal rankings has succeeded in compelling the government to abandon the use of this simplistic approach to research evaluation.

In summary, the world of research does not have to yield to requirements that have no scientific value and that run against academic values. Indeed, French-language journals and local research topics that play an invaluable role in helping us better understand our society have often been the hardest hit by these ill-advised evaluation methods, and so fighting back against this corruption is becoming more important every day.

Yves Gingras is a Professor in the History Department and Canada Research Chair in History and Sociology of Science at the Université du Québec à Montréal.

This article is a translation of a revised and shorter version of the essay, « Dérives et effets pervers de l’évaluation quantitative de la recherche : sur les mauvais usages de la bibliométrie », in Revue international PME 28;2 (2015): 7-14. For a more in-depth analysis, see: Yves Gingras, Bibliometrics and Research Evaluation: Uses and Abuses, Cambridge: MIT Press.

1. Yves Gingras, “How to boost your university up the rankings,” University World News, (2014) July 18;329, Refer also to the many responses in Science, (2012), March 2;335: 1040-1042.
2. L Waltman and NJ van Eck, “The inconsistency of the h-index,” 2011,