Always Draw a Graph
Carrying on with the statistics that I have been looking into recently, I came across a rather curious collection of data sets. In 1973 an English Statistician named Francis Anscombe came up with four collections of points which all looked completely different when plotted and yet provide almost identical results when subjected to most of the common statistical analytics when tests are applied blindly to them.
The first graph shows some positively correlated data points as you would expect if you mapped age to shoe size in a class or something similar. Number two is much tighter to its pattern and forms a parabola. Three all lie exactly on a line, but with an outlier, like you would throw out due to experimental error if you received these results in practice and finally number four has 10 values of x=8 and a single outlier.
As you might be able to see, all three graphs have the same line of best fit. There is a standard way to apply a straight line to a set of data with a method called linear regression. This method calculates the squares of the vertical distances from each point to the hypothetical line and tries to minimise the sum of them. All four graphs have a line of y=0.500x+3.00 which I have written in that strange way to emphasise how many decimal places they are accurate to. Out of the four graphs only the first is an appropriate data set to apply this to. Number two isn't a linear relationship and number three should have a point ignored before applying it.
Let's look at some other similarities. Firstly both the mean and the variance of the x values are exactly the same in all graphs (9 and 11 respectively). In fact, the x values for the first three graphs are identical.
The mean of the y values is less exact but all four graphs have it as 7.50 correct to two decimal places. The variance for y is even more exact at 4.125 plus/minus 0.003.
But the final flourish of the design of these data sets is the correlation coefficient between x and y. If all of the points lied exactly on the same line then they would have a correlation of 1 and if there was absolutely no link between them (imagine just a haze of points) then they would have a coefficient of 0. All four graphs have a correlation coefficient of 0.816 correct to 3 decimal places.
Anscombe set these ingenuis graphs up to illustrate why we should always draw a graph when investigating data rather than just looking at the raw stats tests. According to him it was all too common that people will look at the mean and variance of sets and declare them to be functionally identical.