I'm a bit suspicious that any kind of "crisis" exists :
https://plus.google.com/u/0/+RhysTaylorRhysy/posts/aXz7V8xZMQn
While we can always improve on methods and statistics, the basic premise here that "lots of data => improbable events happening by chance" is not exactly obscure or difficult to guess. It's obvious as soon as you learn about Gaussian statistics or even earlier.
Suppose there are 100 ladies who cannot tell the difference between the tea, but take a guess after tasting all eight cups. There’s actually a 75.6 percent chance that at least one lady would luckily guess all of the orders correctly.
Now, if a scientist saw some lady with a surprising outcome of all correct cups and ran a statistical analysis for her with the same hypergeometric distribution above, then he might conclude that this lady had the ability to tell the difference between each cup. But this result isn’t reproducible. If the same lady did the experiment again she would very likely sort the cups wrongly – not getting as lucky as her first time – since she couldn’t really tell the difference between them.
This small example illustrates how scientists can “luckily” see interesting but spurious signals from a dataset. They may formulate hypotheses after these signals, then use the same dataset to draw the conclusions, claiming these signals are real. It may be a while before they discover that their conclusions are not reproducible. This problem is particularly common in big data analysis due to the large size of data, just by chance some spurious signals may “luckily” occur.
We do this in radio astronomy all the time. With >100 million data points per cube, the chance of getting at least one interesting-but-spurious detection is close to 1.0, especially when considering that the noise isn't perfectly Gaussian. We get around this by the simple process of doing repeat observations; I find it hard to believe that anyone is seriously unaware that correlation <> causation at this point. Charitably the article may be over-simplifying. While there are certainly plenty of weird, non-intuitive statistical effects at work, I don't believe "sheer size of data set" is causing anyone to panic.
https://theconversation.com/how-big-data-has-created-a-big-crisis-in-science-102835
Sister blog of Physicists of the Caribbean in which I babble about non-astronomy stuff, because everyone needs a hobby
Subscribe to:
Post Comments (Atom)
Review : Pagan Britain
Having read a good chunk of the original stories, I turn away slightly from mythological themes and back to something more academical : the ...
-
"To claim that you are being discriminated against because you have lost your right to discriminate against others shows a gross lack o...
-
I've noticed that some people care deeply about the truth, but come up with batshit crazy statements. And I've caught myself rationa...
-
For all that I know the Universe is under no obligation to make intuitive sense, I still don't like quantum mechanics. Just because some...
"I don't believe "sheer size of data set" is causing anyone to panic."
ReplyDeleteMaybe not about spurious signals, but certainly about anyone actually having time (or processing power) to find all the interesting objects in the massive data sets. I'm not personally involved, but I know people working on the ASKAP and SKA pipelines who are rightly worried about missing novel events because they have to throw so much data away in the early processing stages. We're lucky FRBs were discovered and proven to be real signals when they were, because I'm told that earlier versions of the ASKAP pipeline would have discarded that data as spurious. It's the same reason that actual scientists looking at the data flagged FRBs as weird artefacts until two telescopes with different gear observed one at the same time and we had actual replication.
Which is to say that big data has its pitfalls, but as you said, no one working on/with it is unaware of the problems. We all just have to be clever about it.