Demystifying Big Data
The topic of big data has been recently in the spotlight, coming to the attention of not only data scientists and researchers but even journalists and Internet enthusiasts. As a buzzword it has penetrated the environment to such a great extent that nearly everybody that I speak with asks if my methodological papers can be applied to big data. And honestly, after browsing through countless resources and talking with multiple experts, I am finally able to conclude: yes. But coming to this conclusion required putting some big data concepts together, which I could not find written explicitly elsewhere. I therefore depict my personal point of view below.
For me, the topic of big data was difficult to comprehend due to its terminology, which comes from across multiple disciplines. Starting with the term ‘big data’, what distinguishes between the fields is not the word ‘data’, but rather the word ‘big’. Of course, different sciences work with different data, but no matter if it is a satellite image, a page from a book, or statistical data, all of them may be described by a set of measurable features (like distance between the relevant pixels, number of specific words occurring in the text, or summary statistics, respectively). To me the word ‘big’ has however two different explanations which may be probably the source of confusion.
A typical problem in data science is to explain the behaviour of variable $Y$, by a set of variables $X_1$, $X_2$, …, $X_k$. To do this we use a sample of data points for which we observe $Y$ and $X$ values. The sample consists of data examples starting from 1 and ranging to $n$. Depending on whom I talked with, the term ‘big’ may have been related either to $k$, so how many variables we have in the model, or it may have explained $n$, so how many data examples we work with. Having either big $k$ or big $n$ create two separate sets of problems, which are tackled in a different ways by different scientific disciplines. In case there is a big $k$ and big $n$ simultaneously, I would call the setup as a ‘very big data’, but I do not aim at throwing another buzzword in the ether.
The big $k$ is rather a conceptual problem. Logically, if we have multiple $X$ variables that we do not observe frequently, there may be not enough evidence in the data to provide explanations for each $X$ variable separately. Moreover, even if we have sufficiently large dataset, the abundance of $X$ variables may make it difficult to interpret the model. This problem is fixed by shrinkage methods, which are a sort of variable selection methods. In a nutshell, shrinkage methods put an extra penalty term in the model, which across the spectrum of possible relations between $Y$ and $X$, punishes the specifications with many $X$ variables. By shrinking the parameter space the area of big data draws immediately to Machine Leaning and Artificial Intelligence, which use these techniques extensively.
The big $n$ problem is rather a capacity problem. In theory, if we had a sufficiently big computer, we could load no-matter-how-many data points. The analytical problem would also be eventually solved, but with a substantial time lag. In practice, time is scarce, and analysts require prompt answers to the most troubling questions. The capacity problems are alleviated by the recent advances in computer sciences. One way is of course to get a more powerful computer, which is called vertical scaling. Practically, it is very costly, however. Another solution is horizontal scaling, which happens to be much more money-attractive. Currently, the most commonly used framework for horizontal scaling is Hadoop environment, with several built-in modules including a distributed file system and MapReduce. The main idea behind is to split the big $n$ dataset into smaller pieces, analyse each of them separately, and then combine the results. Hadoop allows to carry out such analytics over the network, hence the system can work on a computer cluster or in a cloud. A simple schematics in Figure 1 illustrates the distinction between big $k$ and big $n$ problems.
Figure 1. Big data – why big?
I would put a general rule of thumb that in the first case $k>n$ is enough to claim that the problem is ‘big’ as it will require some reduction of the parameter space, or shrinkage. On the other side, it is difficult to put a similar condition on $n$ parameter. On the one side, distributed computing/data storage may be applied to relatively small data sets. On the other side, some really data-heavy processes may be run on a standard computer infrastructure. I hope, however, that this roadmap turns out useful when going deeper into the topic of big data.