COLUMN – Ideas in the Second Machine Age: SCIENTIST

A scientist is a person engaging in a systematic activity to acquire knowledge that describes and predicts the natural world [1]. In a more restricted definition, a scientist is somebody, who engages in a scientific method. For most of all (serious) empirical sciences this scientific method is based on modelling, testability, and falsification.

For decades, modern empirical science had worked by the means of falsification. Theory, visualised in the mind of the researcher, was formulated by hypotheses. Around these hypotheses, models were built, testable models, which could either be confirmed or rejected by experiments. Researchers needed to first formulate the underlying theoretical mechanisms between two observations before establishing a causal relation with confidence. Without theoretical assumptions, correlations could as well just be due to coincidence. Theoretical models distinguished correlation from causation, noise from sense.

At the times of massive data, this fundamental principle, the old approach of hypothesising, modelling, and testing becomes obsolete for a practical and a conceptual reason [2]. Practically speaking, if data on the tera- or  petabyte scale, with hundreds of millions or billions of observations is examined, current statistical techniques throw the towel. Consider linear regression models, the most commonly applied technique in natural and social science, to establish statistical inference. During the process of model fitting, a relationship between input and outcome, between X and Y, is established for a set of observations. In order to find the degree to which X influences Y – matrix calculations are performed. For 100 x 100 observations, calculations are neat and easy. For big data matrices the size of 100mn x 100mn observations, the same calculations become ‘too (computation) expensive’ and meaningless, since data-pairs of this sample size will always yield correlation in parametric statistical tests. Moreover, the old fashioned way of theory-driven science has a conceptual flaw.

“All models are wrong, but some are useful”, as statistics mastermind George Box once put it [2], today, you might want to rephrase, like; all models are wrong and they are less and less useful. A model is a simplified version of reality, stripped of all the complexity negligible for the examination of the problem at hand. In the past, whenever scientists modelled something, may it be in quantum physics or in macroeconomics, they pragmatically idealised nature for the sake of solvability. Today, however, the simplification shortcut might be a detour, in fact. Why should we examine nature via a simplification if we could examine nature directly? Data science and machine learning tells us that we should. In the world of petabytes, correlation might just be enough. Clustering algorithms can show us patterns, which no human-made theory could have ever hypothesised.

Already today, scientific discoveries about new micro-organisms, deforestation or enzyme genome interaction are made by data scientists with little or no theoretical knowledge in the field they are – sometimes by accident – researching in. While the ‘theory-free’ paradigm shift is quite fundamental for science, surprisingly, it did not emerge from within science. The development of ‘agnostic statistics’ is a child of the web and the aspiration of internet giants, like Google to conquer the world of advertising. Without any knowledge about the culture and conventions of advertising, Google just relied on better data and better analytical tools. The semantic of causal analysis does not matter for matching ads to content and vice versa. Internet giants, which are not only evaluating but also producing petabytes of data on a daily basis, do not need underlying assumptions in order to deliver taylor-made advertising.

Internet giants were one of the first to operate in a world where classification and clustering techniques on massive amounts of data have made theory obsolete. Forget about theory of human behaviour, from economics to linguistics. Just as Google did, scientists might at some point stop caring why people do things. They do, they can be tracked, and their behaviour can be predicted with unimagined precision. The numbers will soon speak for themselves.

[1] WIKIPEDIA (2016). SCIENTIST.
[2] ANDERSON, CHRIS (2016). THE END OF THEORY: THE DATA DELUGE MAKES THE SCIENTIFIC METHOD OBSOLETE, WIRED.COM.
[3] ALL MODELS ARE WRONG” IS A COMMON APHORISM IN STATISTICS. IT IS GENERALLY ATTRIBUTED TO THE STATISTICIAN GEORGE BOX.