“In God we trust! All others bring data,” is a popular quote attributed to William Deming, a physicist and a pioneer in quality management. By applying statistical methods from the natural sciences to manufacturing in the first half of the 20th century, he was able to drastically increase the industry’s efficiency. In a way, Deming was one of the predecessors of modern data scientists by applying his regular methodology to business.
Since data science is a comparably new field, there are probably more definitions of data science than there are data scientists. In my opinion, data science is not about data, but it is about a certain way of thinking about data. Today, I will give an example of what one of the most frequent tasks of data scientists actually is, and why you don’t need to be a data scientist to follow the essence of his work in a project.
The pyramid of data understanding
The process of converting data into meaningful information is nowadays called business intelligence. It helps businesses to report data and understand ‘what’, ‘where’, ‘when’ and ‘how much’.
Exploiting data in decision making using Key Performance Indicators (KPI) and to predict their future values is a more sophisticated use of data than reporting it. Such an application of data in the business context is usually termed (business) data analytics. It guides businesses on how to go further and understand the ‘what next’.
Fig. 1: Pyramid of Data Science, Data Analytics and Business Intelligence with their "added values".
However, at no point have we really addressed the ‘why’. From my point of view, understanding – meaning answering the ‘why’ by finding reasons and extracting knowledge using data – is an entirely different process and its owners are data scientists.
Yet another reason why it’s called data science
From my point of view, data scientists try to understand data instead of just crunching it.
How do they extract knowledge from data?
A brief note about knowledge. Anything we know (for example that the Earth is round instead of flat) is a result of a series of findings that disproved previous “knowledge”. But if anything that we think to be true could be falsified, we actually only know negative truths: Things that definitely are not the case. Essentially, everything else is just wishful thinking and has not yet been disproven.
This is disappointing! And rather academic. To convince a client, we need to transform our knowledge about things that are not the case into something actionable, of value. Because of that, data science is not about the data, but about falsifying business hypotheses using data. In the end, the challenge is to formulate the hypotheses in such a way that disproving them actually leads to valuable conclusions.
Unfortunately, this is a very dangerous process and hypothesis disproving skills could be the most important skills of a data scientist.
Jumping to conclusions
To show you how that works let’s go through a full-scale hypothesis test.
There is a group of four datasets designed by Francis Anscombe in 1973 that could lead a data analyst into an ugly trap (Fig. 2). According to statistics, all four datasets are described by almost the same straight line, but in three of the four cases the line is horribly wrong. So, without the plots: How do we know?
Fig. 2: Anscombe's dataset with best fitting straight lines. All of the lines are exactly equal and according to statistics, they are an equally good fit to all datasets, but only in panel (a) the line really fits.
Imagine that in Fig. 2d the x axis is the price of a car and the y axis is the outside temperature at the car dealership. The obvious conclusion from the plot is that the car price (x) has nothing to do with the temperature (y) – no matter what the temperature, the price is the same.
Still, the line describes the points well (at least according to some statistics). The data analyst would conclude from his model that car prices depend on temperature.
A data scientist could now argue that the analyst simply jumped to conclusions – he did not set up a hypothesis that he tested and at least gave a likelihood that the hypothesis is true or false.
An actual hypothesis
A hypothesis could rather be: There is a linear relationship between temperature and car price. The data scientist could then open his toolbox of hypothesis testing and find that the blue line is a very bad fit of the data in panel d so that the hypothesis would be rejected.
How would that work? Let’s briefly recap what a linear fit actually is: It is the process of aligning the y axis offset and the slope of a straight line, b and m, to data. The result then is a description in form of
Y = m X + b.
A computer does this by considering the difference e between the above equation and the data points:
Y = m X + b + e.
In a way, this difference is the remainder the computer cannot describe (yet). If we prefer Latin sounding names, we call it the residual. The computer tries to figure out the values of m and b for which the residuals for all data points are as small as possible. According to the diagnostics of my statistics program, in each of the four panels of Fig. 2 the computer was equally successful.
Luckily, there is one more thing: The residuals need to fulfill (at least) two conditions. Otherwise a straight line is an insufficient model.
- The value of the residuals need to be independent of the data points themselves (the residual plotted against the x value of its data point needs to be random).
- The residuals need to be random themselves (distributed in the form of a normal distribution).
Fig. 3: Normal and paranormal distributions
Testing the preconditions of the test
So let’s see. The left column in Fig. 4 shows the first condition in the four data sets (rows in the figure) and the right column the second. Every plot of the left column except for the first shows some kind of a systematic. This is a strong indicator that only the first fit actually is good.
The panels in the right column should be a (more or less) straight line with y values symmetrical around zero (otherwise the residuals are not normally distributed). This is only the case in the top and bottom panels (in the others the data are located between -1.5 and 0.5, -1 and 3, respectively).
Fig. 4: Diagnostics of the four fits. The rows of the Figure correspond to the four datasets, while the columns correspond to conditions one and two.
Since only for the first dataset both requirements are fulfilled, of course, temperature and car price (panel d) are not linked and the straight lines in Figs. 2b-d are simply wrong. This shows that high-level diagnostics cannot be trusted and a real hypothesis test is more complex. It includes testing the hypotheses behind a hypothesis.
Even if most of this post was about twisting and turning the data, it was to test a hypothesis. In the case of Fig. 2a (in contrast to all other cases) the residuals are nicely distributed and the hypothesis of a linear relation would not be rejected.
Hypothesis-driven thinking is key
I hope this post did two things:
- It generated understanding why the data scientist in your team keeps asking one more “why?” than you’d expect.
- It is one more reason to follow through with hypothesis-driven reasoning in your projects every day! You don’t need to be a data scientist to let yourself be guided by this way of thinking and it helps steer away the client from hasty conclusions.