Decoherency: Evaluating the effectiveness of evaluations

Monday, 21 May 2018

Evaluating the effectiveness of evaluations

Peer review is a waste of everyone's time, study finds

That's the misleading clickbaity headline I'd give to this mostly dry, technical, very detailed paper on whether reviewers are accurately able to evaluate proposals or not. Alternatively :

Yo Dawg, We Herd You Like Reviews, So We Reviewed Your Review Of Reviews So You Can Review While You Review

Aaaannnyway.....

This paper looks at the review process used by the European Southern Observatory to allocate telescope time. There's some good stuff in here, but my goodness it's buried in a lot of (I think unnecessarily complex) statistics. They investigate what they call the "True Grade Hypothesis", which is something familiarly Platonic :

For any given proposal a true grade does exist. The true grade can be determined as the average of the grades given by a very large number of referees.

Which is basically what everyone is assuming throughout the whole process.

The first part of hypothesis is obviously debatable, as it implicitly assumes that an absolute and infinitely objective scientific value can be attached to a given science case. It does not take into account, for instance, that a proposal is to be considered within a certain context and cannot be judged in vacuum. Most likely, a proposal requesting time to measure the positions of stars during a total solar eclipse would have been ranked very highly in 1916, but nowadays it would probably score rather poorly. The science case is still very valuable in absolute terms, but is deprived of almost any interest in the context of modern physics.

The second part of the hypothesis is also subject to criticism, because it implicitly assumes that referees behave like objective measurement instruments. This is most likely not the case. For instance, although the referees are proactively instructed to focus their judgement [no-one will ever convince me you can spell it "judgment", that is plainly ludicrous] on the mere scientific merits, it is unavoidable that they (consciously or unconsciously) take into account (or are influenced by) other aspects. Among these are the previous history of the proposing team, its productivity, its public visibility, the inclusions of certain individuals, personal preferences for specific topics, and so on.

What they find is that the TGH does seem to work, but the scatter is very high. Even for the supposedly best and worst proposals, the refereeing process is only a bit more consistent than random, and it's not really any better at all for the average proposals... in terms of grading, though it does do quite a lot better in terms of ranking.

Even so, their results agree with previous findings that the success of an observing proposal is about half down to the quality of the proposal and half due to luck. However, if you add more referees, results do converge. There might be an optimum number of referees to balance the benefits against the extra resources needed.

Of course a major caveat here is that observers write proposals knowing that they will be reviewed. I'd be any sum of money you like that if you were to peer review proposals that the authors were told weren't being peer reviewed, the peer review system would perform twenty bajillion times better than chance alone.

This is the first paper of two, focusing on the review stage when referees individually analyse their assigned proposals. The second paper will look at the effects of the committee meeting where each reviewer explains their reasoning to the others and the final ranking.

https://arxiv.org/abs/1805.06981

Decoherency

Monday, 21 May 2018

Evaluating the effectiveness of evaluations

No comments:

Post a Comment

Review/Rant : Andor