The distance between correlation and causation is as wide a gulf as that between science and pseudo-science.
The best “Data Science” can do is point us in the direction of a scientific hypothesis – the formulation of which is but the first step in the scientific method. It’s a clue, a starting point for investigation, but cannot be a conclusion in itself.
Science seeks to arrive at causality, that is, what is the reason for a phenomenon; while data science, no matter how carefully constructed the query, can hope to provide nothing more insightful than correlation – which, while true, explains nothing.
If I know that this winter people are buying more red hats than blue hats, that fact cannot tell me anything about why. Further, it can’t even predict whether any one individual will buy a red hat instead of a blue one, or even whether that specific individual can be reliably predicted with certainty to buy any hat at all! It can’t do that because, though it reveals that people are buying more red hats this year, the data has no power to explain why.
What is called “Data Science” is nothing more than the measurement of probability.
But, it is not “probable” that the Earth goes around the sun – it is a fact. There is not even the most minuscule probability that, all things being equal, it won’t. And even better, science explains why. A million points of data that tells us we do go around the sun can’t do that. No explanations arise from this information.
There are many types of logical fallacies, and, based on the graph above, the majority of Americans fall into the trap of at least one. By calling statistical analysis “data science”, we are not doing anything to help.