By Viktor Mayer-Schönberger & Kenneth Cukier
Drugs are dangerous things. Medicines go through a rigorous vetting process before they are put on the market. Among the things studied is how they interact when a patient takes another drug. If there is a harmful side effect, it is recorded so doctors know not to prescribe the two together. But with so many treatments around, it is impossible to catch all the possible adverse combinations.
That is what makes research released last March by a group of doctors from Stanford and Columbia – not MDs, but computer science PhDs – so startling. The team poured over almost a year of web searches for common symptoms of low blood sugar: things like persistent hunger and thirst, blurred vision, fatigue, wounds that won’t heal, and so on. Then, they looked at people who typed those symptoms who also searched for a common antidepressant (paroxetine) or a drug to lower cholesterol (pravastatin), and whether the web user searched for both drugs.
After crunching 82 million search queries, they struck gold. People who typed in the low blood sugar symptoms were more than twice as likely to also search for both drugs rather than just one. The old searches had turned up a fresh insight: an adverse drug interaction that wasn’t in the official medical literature.
This sort of finding is critical for drug companies to know. It is essential for drug regulators to be aware of. It is priceless for patients who could potentially die from nasty drug combinations. And it was uncovered by big data – the idea that we can do with a vast amount of data things that we simply couldn’t do without it.
As we collect and crunch more data, the good news is that we can do extraordinary things: fight disease, reduce climate change, unlock mysteries of science. The change in scale leads to a change in state. It upends the nature of business, how government works and the way we live, from healthcare to education. Big data will even change how we think about the world and our place in it.
The bad news is that it raises a host of worries for which society is unprepared. What is clear, however, is that data is becoming the oil of the information age; a raw material and the foundation of new goods and services. To be sure, we have long used information to make decisions. But there is a radical change coming. Society is going from a constant shortage of data to a surfeit of information, and our ability to learn from data has improved as the technical tools have gotten better. Together, this upends everything.
Consider: For centuries, we have only collected and crunched a sliver of information because of the cost and complexity of processing larger amounts. So we used easy data, instead of more data. Next, we relied on data of the cleanest, highest quality possible, since we only tapped a little of it. So we privileged clean data over messy data. And, we tried to uncover the reasons behind how the world worked, to generalize. So we searched for causality and resisted mere correlation.
Yet all this was actually a function of a small-data world, when we never had enough information. Change that, and a lot of other things are transformed as well. Suddenly, we don’t have to rely on less data, clean data, and causality from the data. Instead, we can place our trust in more, messy and correlation.
Think of a car engine. Breakdowns rarely ever happen all at once. Instead, one hears strange noises or the driving “feels funny” a few days in advance. Large commercial fleet operators place sensors into their vehicles that can measure the heat and vibration from the engine. By capturing it in the streams of data flowing past, one can know the patterns for what a healthy engine should looks like, as well as see what the patterns look like that signal a breakdown.
That way, one can identify when a part is about to fail before it actually breaks. The car can alert the driver to visit a service station to get it repaired, as if it is clairvoyant. But we needed lots of data, needed to accept messy data, and had to give up knowing why the engine was about to break for the practical knowledge that it was, without a cause.
The delivery company UPS has used this big-data approach, called predictive analytics, since the late 2000s to monitor its 60,000 vehicles in the United States and to know when to perform preventive maintenance. A breakdown on the road can wreak havoc, delaying pick-ups and deliveries. To be cautious, UPS would replace car parts after two or three years. But that was inefficient, as some of the components were fine. Since harnessing big data, the company has saved millions of dollars by measuring and monitoring individual parts and replacing them only when necessary. In one instance, the data even revealed that an entire fleet of new vehicles had a defective component that could have spelled trouble had it not been spotted before they were sent onto the road for real.
That’s big data. It ushers in three big shifts: more, messy and correlation. Let’s look at each one. First, more. We can finally harness a vast quantity of information, and in some cases, we can analyze all the data about a phenomenon. This lets us drill down into the details we could never see before. Second, messy. When we harness more data, we can shed our preference for data that’s only of the best caliber, and let in some imperfections. The benefits of using more data outweighs cleaner but less comprehensive data. Third, correlation. Instead of trying to uncover causality, it is often sufficient to simply uncover practical answers. So if some combinations of aspirin and orange juice puts a deadly disease into remission, it is less important to know the biological mechanism at work than to just drink the elixir. For many things, with big data it is faster, cheaper and good enough to learn “what,” not “why.”
We can do these things because we have so much more data, and one reason for that is that we are taking more aspects of society and rendering it into a data form. For example, last year IBM was granted a U.S. patent on “Securing premises using surface-based computing technology.” That’s intellectual-property mumbo-jumbo for a touch-sensitive floor covering, like a giant smartphone screen. The potential uses are vast. It would be able to know what objects were placed on it. It would most certainly know to turn on lights in a room or open doors when a person entered a room. But there’s much more. It might identify individuals by their weight or the way they stand and walk. And it could tell if someone fell and did not get back up, a crucial feature for an aging society. Retailers could even learn how foot traffic flows through their stores.
With so much data around, and the ability to process it, big data is becoming the bedrock of new companies. Yet the value of data is in its secondary uses, not simply in the primary purpose for which it was initially collected, which is the way we tended to value it in the past. Hence, one global logistics firm reuses data on who sends packages to whom in order to make economic forecasts. A travel site crunches billions of old flight-price records from airlines to predict whether a given airfare is correct or if the price is likely to increase or decrease, empowering consumers with information on the best time to buy their ticket.
These extraordinary data services require three things: the data, the skills, and a big-data mindset. Today, the skills are lacking. Few have the mindset, even though the data seems abundant. Yet over time, the skills and creativity will become commonplace, and the most prized part will be the data itself.
At the same time, big data also has a dark side. Privacy is harder to protect because the traditional legal and technical restrictions don’t work well anymore. And in an age of big data, a new problem emerges: propensity – penalizing someone based on what they are predicted to do, not on what the have done. At the same time, we will need to be vigilant not to fall victim to the “dictatorship of data,” the idea that we shut off our reasoned judgment and endow in the data-driven decisions more than they deserve.
Solutions to these thorny problems include a fundamental reform of privacy law and the technology to protect personal information. Specifically, we probably need to shift from regulating the collection of data to regulating its uses, to prevent misuse or abuse. In an age when we can be punished by a prediction, we need to guarantee human agency, free will and moral choice. And to prevent big data from being a black box, a new class of professional – call them “algorithmists” – can offer transparency and accountability, by being steeped in statistics, data-collection and big data algorithms, to advise companies and courts. They may do for the big data age what accountants and auditors did for an era 100 years ago, when the data deluge swamping society came in the form of financial information.
The term “big data” is imperfect. Its importance is not just about size. It is about how we harness the new plethora of information to appreciate the world better. And in so doing, understand ourselves.