Sister blog of Physicists of the Caribbean in which I babble about non-astronomy stuff, because everyone needs a hobby

Monday 28 January 2019

Is Big Data causing a replication crisis ?

I'm a bit suspicious that any kind of "crisis" exists :
https://plus.google.com/u/0/+RhysTaylorRhysy/posts/aXz7V8xZMQn

While we can always improve on methods and statistics, the basic premise here that "lots of data => improbable events happening by chance" is not exactly obscure or difficult to guess. It's obvious as soon as you learn about Gaussian statistics or even earlier.

Suppose there are 100 ladies who cannot tell the difference between the tea, but take a guess after tasting all eight cups. There’s actually a 75.6 percent chance that at least one lady would luckily guess all of the orders correctly.

Now, if a scientist saw some lady with a surprising outcome of all correct cups and ran a statistical analysis for her with the same hypergeometric distribution above, then he might conclude that this lady had the ability to tell the difference between each cup. But this result isn’t reproducible. If the same lady did the experiment again she would very likely sort the cups wrongly – not getting as lucky as her first time – since she couldn’t really tell the difference between them.

This small example illustrates how scientists can “luckily” see interesting but spurious signals from a dataset. They may formulate hypotheses after these signals, then use the same dataset to draw the conclusions, claiming these signals are real. It may be a while before they discover that their conclusions are not reproducible. This problem is particularly common in big data analysis due to the large size of data, just by chance some spurious signals may “luckily” occur.

We do this in radio astronomy all the time. With >100 million data points per cube, the chance of getting at least one interesting-but-spurious detection is close to 1.0, especially when considering that the noise isn't perfectly Gaussian. We get around this by the simple process of doing repeat observations; I find it hard to believe that anyone is seriously unaware that correlation <> causation at this point. Charitably the article may be over-simplifying. While there are certainly plenty of weird, non-intuitive statistical effects at work, I don't believe "sheer size of data set" is causing anyone to panic.

https://theconversation.com/how-big-data-has-created-a-big-crisis-in-science-102835

1 comment:

  1. "I don't believe "sheer size of data set" is causing anyone to panic."

    Maybe not about spurious signals, but certainly about anyone actually having time (or processing power) to find all the interesting objects in the massive data sets. I'm not personally involved, but I know people working on the ASKAP and SKA pipelines who are rightly worried about missing novel events because they have to throw so much data away in the early processing stages. We're lucky FRBs were discovered and proven to be real signals when they were, because I'm told that earlier versions of the ASKAP pipeline would have discarded that data as spurious. It's the same reason that actual scientists looking at the data flagged FRBs as weird artefacts until two telescopes with different gear observed one at the same time and we had actual replication.

    Which is to say that big data has its pitfalls, but as you said, no one working on/with it is unaware of the problems. We all just have to be clever about it.

    ReplyDelete

Due to a small but consistent influx of spam, comments will now be checked before publishing. Only egregious spam/illegal/racist crap will be disapproved, everything else will be published.

These things are not the same as these other things

Today, a couple of similar-ish pieces from Pscyhe I think I can get away with combining into a single post. The first one is very simple, d...