Sister blog of Physicists of the Caribbean in which I babble about non-astronomy stuff, because everyone needs a hobby

Wednesday, 3 May 2017

The ferocious datasaurus

The different configurations of the data points in the gif all have the same mean, median, and standard deviation... yet they're clearly very different. Isn't statistics fun ?

It can be difficult to demonstrate the importance of data visualization. Some people are of the impression that charts are simply "pretty pictures", while all of the important information can be divined through statistical analysis. An effective (and often used) tool used to demonstrate that visualizing your data is in fact important is Anscome's Quartet. Developed by F.J. Anscombe in 1973, Anscombe's Quartet is a set of four datasets, where each produces the same summary statistics (mean, standard deviation, and correlation), which could lead one to believe the datasets are quite similar. However, after visualizing (plotting) the data, it becomes clear that the datasets are markedly different.

Recently, Alberto Cairo created the Datasaurus dataset which urges people to "never trust summary statistics alone; always visualize your data", since, while the data exhibits normal seeming statistics, plotting the data reveals a picture of a dinosaur. Inspired by Anscombe's Quartet and the Datasaurus, we present, The Datasaurus Dozen.


The key insight behind our approach is that while it is relatively difficult to generate a dataset from scratch with particular statistical properties, it is relatively easy to take an existing dataset, modify it slightly, and maintain those statistical properties. We do this by choosing a point at random, moving it a little bit, then checking that the statistical properties of the set haven't strayed outside of the acceptable bounds (in this particular case, we are ensuring that the means, standard deviations, and correlations remain the same to two decimal places.)

https://www.autodeskresearch.com/publications/samestats

1 comment:

  1. Rhys Taylor Nope, its maddening! ...
    ... and mostly used to conform to whatever model you're proposing, or pushed to fit with whatever solution you you're selling.

    ReplyDelete

Due to a small but consistent influx of spam, comments will now be checked before publishing. Only egregious spam/illegal/racist crap will be disapproved, everything else will be published.

Whose cloud is it anyway ?

I really don't understand the most militant climate activists who are also opposed to geoengineering . Or rather, I think I understand t...