Sister blog of Physicists of the Caribbean in which I babble about non-astronomy stuff, because everyone needs a hobby

Tuesday 20 November 2018

I'm cautiously skeptical of a "replication crisis"

In astronomy we often have to do repeat observations of potential detections to confirm they're real. A good confirmation rate is about 50%. Much less than this and we'd be wasting telescope time, and we'd start to worry that some of the sources we thought were real might not be so secure. Conversely, a much higher fraction would also be a waste of time, and would imply that we hadn't been as careful in our search as we thought - there'd still be other interesting things hidden in the data that we hadn't seen.

I suggest that this is also true to some extent in psychology. There seems a science-wide call for more risky, controversial research. Well, risky, controversial research requires a certain failure rate : if every finding was replicated, that would suggest the research wasn't been risky enough; if none of them were, that would imply lousy research practises. The actual replication rate turns out to be, by happy coincidence, about 50%.

But likewise, in astronomy we don't write a paper in which we consider sources we haven't confirmed yet (or at least it's a very bad idea to do so). We wait until we've got those repeat observations before drawing any conclusions. Risky, preliminary pilot studies ought to have a failure rate by definition, otherwise they wouldn't be risky at all. The big "end-result" studies on the other hand, the ones that are actually used to draw secure conclusions and, in the case of psychology, influence social policy, well those you'd want at least their basic results to be on a secure footing.

The Many Labs 2 project was specifically designed to address these criticisms. With 15,305 participants in total, the new experiments had, on average, 60 times as many volunteers as the studies they were attempting to replicate. The researchers involved worked with the scientists behind the original studies to vet and check every detail of the experiments beforehand. And they repeated those experiments many times over, with volunteers from 36 different countries, to see if the studies would replicate in some cultures and contexts but not others.

Despite the large sample sizes and the blessings of the original teams, the team failed to replicate half of the studies it focused on. It couldn’t, for example, show that people subconsciously exposed to the concept of heat were more likely to believe in global warming, or that moral transgressions create a need for physical cleanliness in the style of Lady Macbeth, or that people who grow up with more siblings are more altruistic. And as in previous big projects, online bettors were surprisingly good at predicting beforehand which studies would ultimately replicate. Somehow, they could intuit which studies were reliable.

Maybe anecdotes are evidence, after all... :P

Many Labs 2 “was explicitly designed to examine how much effects varied from place to place, from culture to culture,” says Katie Corker, the chair of the Society for the Improvement of Psychological Science. “And here’s the surprising result: The results do not show much variability at all.” If one of the participating teams successfully replicated a study, others did, too. If a study failed to replicate, it tended to fail everywhere.

Many researchers have noted that volunteers from Western, educated, industrialized, rich, and democratic countries—weird nations—are an unusual slice of humanity who think differently than those from other parts of the world. In the majority of the Many Labs 2 experiments, the team found very few differences between weird volunteers and those from other countries. But Miyamoto notes that its analysis was a little crude—in considering “non-weird countries” together, it’s lumping together people from cultures as diverse as Mexico, Japan, and South Africa. “Cross-cultural research,” she writes, “must be informed with thorough analyses of each and all of the cultural contexts involved.”

Sanjay Srivastava from the University of Oregon says the lack of variation in Many Labs 2 is actually a positive thing. Sure, it suggests that the large number of failed replications really might be due to sloppy science. But it also hints that the fundamental business of psychology—creating careful lab experiments to study the tricky, slippery, complicated world of the human mind—works pretty well. “Outside the lab, real-world phenomena can and probably do vary by context,” he says. “But within our carefully designed studies and experiments, the results are not chaotic or unpredictable. That means we can do valid social-science research.”

Originally shared by Eli Fennell

Latest Big Social Psych Replication Program Reaffirms and Refines 'Crisis'

There has been much talk in recent years about a 'Replication Crisis' in social psychology, and in some of the other sciences as well it should be added (even medicine). Given, though, the greater lack of replicable foundational findings compared with most fields, however, and the real world policy implications of their alleged findings, such a crisis in social psychology is especially daunting.

Previous replication projects designed to assess the replicability of social psychology experiments have typically found a replication rate of about half, i.e. half of studies are successfully replicated. Previous projects, however, have had numerous limitations that undercut their basic assertions: low sample sizes (meaning insufficient power for replication), critical sampling differences (e.g. using a different sample type, but not using many different sample types that include the original type), imperfect replication protocols, few international replications, or the selection of weak or low prominence findings for testing.

The latest replication project, by a group called Many Labs 2, is helping resolve these issues by focusing on prominent, high impact findings; using samples generally larger than the original studies; carefully replicating the original studies, and even consulting with original researchers to ensure this; and testing many types of participant samples, including international samples.

Surprisingly, their results are in line with previous efforts: about half of all findings replicated successfully, making this an ironically very replicable result. Given the complexity of human behaviour, this is actually not as terrible as it sounds in a way, but is worse in another way since many of these are highly foundational and influential 'findings', such as the Marshmallow Effect.

On the bright side, their findings also suggest that sample differences may matter far less than supposed: if a result was replicated, it tended to be replicated across different samples, even in very different cultures, genders, and ages. The question of whether a given finding generalises beyond a given sample has long haunted the social sciences, especially with their reliance on undergraduate participant samples for much of their experimental research. In fact, those results which were replicated by Many Labs 2 tended to replicate even in groups the original researchers specifically expected not to find them replicated.

They also replicated previous findings suggesting that online betters are surprisingly accurate at predicting which results will be replicated in follow-up studies and which won't, suggesting that intuition may be able to play more of an important role in designing and prioritising research than normally suspected, though with the caveat that those directly involved in research are least likely to themselves possess objective intuitions about their likelihood of success.

Moreover, it is worth remembering that the fact that an experiment replicated, does not mean it necessarily tells us anything about the real world, or tells us what it claims to tell us about it at any rate. As an example, experimental social psychologists consistently find that younger samples have better time management skills than elderly samples; yet studies of real world time management consistently show elderly samples miss fewer deadlines and appointments in practice, and easily so.

In my opinion, what all of this points to is that the experimental psychologies fell prey to a desire to quickly 'catch ip' with very basic reductive operational sciences, like basic physics and chemistry, and were at times too willing to accept highly theatrical demonstrations of effects later shown to be overstated at best (e.g. the Stanford Prison Experiments) to get there. What they ignored is that, even in other operational sciences dealing with complex materials, replicability can be tricky, such as biochemistry or quantum physics.

And in the case of studying humans, controlling all variables is especially tricky, and there is little motivation for most researchers to pursue slow, methodical programs of study when what they want is to name a theory of love after themselves or discover a new therapy for ASD children.

It is important for the social sciences to continue to evolve into full fledged operational sciences, and to embrace reductive approaches wherever appropriate, but there has been perhaps a degree of patience, humility, and sweating-the-details absent in much of the research to-date. Hence the replication crisis. Or, at least, I suspect this is certainly part of the answer.

https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/

2 comments:

  1. I am not sure I completely follow the reasoning here. Whilst I understand that not every attempt to confirm some result will be successful, I don't understand why repetition would imply that previous attempts were not careful. Just because someone wants to revisit your results does not mean they would necessarily doubt them, and if they did that's how science is supposed to work. Besides, tools and methodologies improve so revisiting older studies armed with the latest techniques and new information can make a difference.

    Human studies are not necessarily directly comparable, especially when they are not so much about instruments, computation, and/or technical progress. Physical observations tend to be far more technical by nature compared to human interviews or behavioral studies. I wouldn't be surprised if some of the example hypotheses given above were simply misplaced and the failures to reproduce the results simply because of weak hypotheses. Some of them sound pretty simplistic or even naïve generalizations. In psychology few things have simple reasons. People are quite multivariate subjects and can't be easily defined by single variable explanations.

    ReplyDelete
  2. It would definitely be helpful to have a few case studies here. There's just not enough information here to say why the replication failed - it could be that some research was done in an obviously flawed way, or it could be something else entirely.

    ReplyDelete

Due to a small but consistent influx of spam, comments will now be checked before publishing. Only egregious spam/illegal/racist crap will be disapproved, everything else will be published.

Philosophers be like, "?"

In the Science of Discworld books the authors postulate Homo Sapiens is actually Pan Narrans, the storytelling ape. Telling stories is, the...