The New Science of Chance: Big Data Predictions
Defining Big Data and Examples
Big Data, at least in the world of IT number crunchers, has become something of a cause celebre. Big Data is an element of the Third Platform in IT (Visit our blog post about the Third Platform here). IT types debate as to what constitutes a Big Data resource as well as what are the best analytical approaches. This blog will not solve their dilemma but it will help you understand what Big Data is and its importance. First, we will define a few Big Data terms and then describe its predictive value with examples.
Defining Big Data
Big Data is a general term used to describe the voluminous amount of unstructured data a company creates. Big Data Analytics is the process of mining these enormous messy stockpiles of seemingly innocuous entries and queries to discover discernible and, ideally, repeatable patterns. The result of detecting unknown correlations and harnessing this information in novel ways is to produce useful insights or goods and services of significant value.
Today, the terms Big Data and Big Data Analytics are used synonymously.
To give you an idea of the immense size of information deemed Big Data, most are measured in terms of exabytes, which is one billion billion, or a quintillion bytes – enough storage to contain 50,000 years’ of DVD-quality video.
There is controversy regarding the above definition of Big Data because some analysts say there are types of structured data that merit inclusion. Regardless of the exact content and size, a practical rule of thumb in its identification has emerged: Big Data is seemingly chaotic information that is deemed too costly in time and dollars to load into a relational database.
Big Data Examples
So far, the most famous example of the power of Big Data is Google’s innovative analysis of search queries and their relationship to potentially fatal outbreaks of H1N1 flu virus. H1N1 is a mutant hybrid that combines elements of the bird flu and swine flu viruses. And, its discovery and rapid spread in 2009 had health officials panicky that a pandemic sweeping the US was underway.
Health officials saw that the same conditions existed as with the 1918 Spanish flu – millions of people were infected and there was no vaccine against this new strain readily available in all parts of the country. All jittery officials at the Centers for Disease Control and Prevention (CDC) could do was track the number of new flu cases (as reported by primary care physicians) and try to arrest its march across the country. 
The doctors dutifully reported new H1N1 flu patients but there was a latency problem. Most people may feel sick for days before going to the doctor. And, the CDC tabulated these flu numbers once a week causing an average of a two week lag between sickness onset and reporting – this is an eternity when fighting a disease that spreads easily with a cough, sneeze, or even a touch. Unable to pinpoint the spread of H1N1 in near real time, public health officials were groping in the dark for an effective way to head off a potential catastrophe.
Coincidentally, a few weeks before the H1N1 outbreak was discovered Google quietly published an innovative research paper in the scientific journal Nature.  The paper reported an effort to identify areas in the United States infected by the winter flu virus by what people searched for on the Internet. Google’s computer scientists designed sophisticated software that identified a combination of 45 search terms people had used to gather information about the winter flu (e.g. “medicine for cough and fever”). Using complex and creative mathematical models, the researchers compared these terms against inquiries collected between 2003 and 2008.
And, here is where the term Big Data earns its brand. Google receives more than three billion search queries each day. For their analysis of winter flu outbreaks, the researchers utilized 450 million different mathematical models on the 5.5 quadrillion inquires they received between 2003 and 2008 to ferret out flu related inputs. Their system looked for correlations between the frequency of certain search queries and the spread of flu over time and space. Using these results, they created a graph that showed where flu outbreak would occur between 2003 and 2008.
They then compared these results - which were predictions of flu outbreak - against actual cases reported to the CDC during the same time period. And, they hit pay dirt: Their predictive figures correlated almost perfectly with official nationwide figures across the United States!
For the H1N1 outbreak, health officials were now armed with valuable information: where and when large outbreaks were likely to occur. Using this knowledge, they had the vaccine ready for the places that were most affected. Google’s predictive methodology circumvented the natural lags of compiling government statistics and, more than likely, saved lives.
Another example of the life-saving implications of Big Data, involves our most populous city, New York. Not only are there more people in New York than any other city in the country, there are more manholes – 51,000 to be precise. These cast iron Frisbees weigh up to 300 pounds and, on occasion, they blast out of the street as high as three stories.
The city’s commercial electric grid, first lit by Thomas Edison in 1882, contains 21,000 miles of cable – almost enough to circle the globe. Because a manhole launching itself like a Titan missile is very dangerous; the Con Ed Company hired researchers at Columbia University to figure out a way to predict manhole explosions.
The team had data on cable repairs and installation dating back to the 1880s, 10 years of trouble ticket reports which equated to 61,000 typed documents. These data also had much irrelevant information such as parking information for Con Ed vehicles or that a customer did not speak English. The researchers developed an algorithm to create order among the confusion.
Their analysis of the trouble reports showed that the majority of the explosions were from manholes with thick, deteriorating cables. Larger amounts of insulation left more decay which was more vulnerable to the inevitable build-up of methane and other gasses present in an underground environment. From this information, the researchers developed the “hot spot theory”. That is, they predicted manholes with larger cables were more likely to explode.
Armed with this information, city works officials modified manholes with thick cables and virtually eliminated explosions.
In a more benign example, Nate Silver, a former baseball statistician, used Big Data to predict the past Presidential election with stunning accuracy. Mr. Silver consistently rejected the conventional wisdom that the race was tied. He developed a statistical model to aggregate state and district level polling data to predict what states each candidate would win. On election night, he was vindicated as his predictions were 100% correct.
Think of everything you put into the ether of the Internet. Each time you perform a search, create a document, or even a keystroke you could be putting out piece of a puzzle that is waiting to be solved.
 Wired, 2010
 SingularityHUB, 2012
 Tech Target, April 2005
 World Health Organization, 2009
 Center for Disease Control and Prevention (CDC), 2010
 Nature (457), 2009