The advent of big data mirrors our technological evolution as a society: for the first time in history, we have the ability to easily and cheaply capture and store massive amounts of data in a way that was simply impossible before. This transition means that we are no longer constrained to statistical methods of sampling or estimation in order to extract meaning from data.
Instead, collecting a complete data set means that we can now analyze the dataset in its entirety, as well. Simply put, analyses from here on out must focus on the subject N=all, rather than attempting to guess at a population or hope for a representative subset based on random sampling of data. “Big data” means that we can have it all.
As Mayer-Schönberger and Cukier put it, “when we talk about big data, we mean “big” less in absolute than in relative terms: relative to the comprehensive set of data.” Instead of just using bits and pieces of the data, we want to process as much of it as we can, finally seeing the forest despite the trees.
This shift in statistical measurement comes with its own set of problems. The larger a dataset, the more likely it is to have errors, and the less likely analysts are to have time to carefully clean each and every datum point. However, data scientists have found that even massive error-prone datasets are more reliable than pristine but tiny samples. In a messy dataset, the authors write, “any particular reading may be incorrect, but the aggregate of many readings will provide a more comprehensive picture.” Essentially, the messy whole can outperform exact, accurate subsets.
As we make inroads into big data, we also make an important shift from results that focus on causation to results concerned only with correlation. Mayer-Schönberger and Cukier describe it thusly:
“If millions of electronic medical records reveal that cancer sufferers who take a certain combination of aspirin and orange juice see their disease go into remission, then the exact cause for the improvement in health may be less important than the fact that they lived. Likewise, if we can save money by knowing the best time to buy a plane ticket without understanding the method behind airfare madness, that’s good enough.” — pg. 14
Nowadays, that needs to be enough — and it often is for e-commerce companies looking for profit, and doctors looking to save lives — but it also represents a radically different approach to problem-solving than many of us are used to. Rather than adhering strictly to the traditional scientific method, big data allows us to work backward, first starting with data collection, then analysis and finally drawing conclusions from whatever patterns may appear.
This shift away from trying to support or disprove a theory cancels out the possibility of researcher bias, but also lends itself to a directionless investigation, with results subject to the interests of the analysts exploring the data. Essentially, the only answers that will be found are the ones a researcher chooses to look for.
With their Kindle e-book readers, for example, Amazon.com has the ability to tabulate which sections of books are most highlighted, where readers tend to stop reading and which themes prompt the most user engagement. But since these answers don’t do anything for their long-term business goals, the data just sits there. A publishing company, however, given this same information, might use it to tweak advertising, author writing styles and marketing campaigns. In this example, both companies are using the same data, but the ‘answers’ they get from a set of data may be completely different. In the world of big data, the mindset with which researchers approach a dataset can make all of the difference.
Mayer-Schönberger and Cukier cover data ethics, collection techniques and even a shift in our natural thought processes — just some of the highlights in their book. At only 200 pages, their book is a quick read, filled with well-thought-out and easy-to-digest examples. You don’t need to be a data expert or computer science whiz to gain something from the text. In fact, the structure of the book lends itself to readers looking for a light introduction to the concept of big data.
Whether your questions are about the history of the field or where it’s headed next, Mayer-Schönberger and Cukier’s “Big Data: A Revolution That Will Transform How We Live, Work, and Think” has something for everyone. You’ll likely walk away feeling impressed, informed and most importantly, curious about the immense possibilities that lie before us with the study of big data.