Defining big data

The Conversation: FCW's reporters and editors respond to your comments.

Big Data

In a comment on FCW's April 15 article, "Sketching the big picture on big data,"  a reader offered a definition of the term: An easily scalable system of unstructured data with accompanying tools that can efficiently pull structured datasets.

Frank Konkel responds: While I do not disagree with your definition, I believe some people might add or subtract bits to it. Your definition wisely includes "easily scalable," which actually answers one question that some big data definitions seem to (conveniently?) leave out: How big the big data actually is. The phrase "easily scalable" tells the user that there really isn't a limit on size here – if it is scalable, we'll get there.

However, I'm not sure I agree that big data has to be unstructured. For example, the National Oceanic and Atmospheric Administration, an agency within the U.S. Department of Commerce, uses pools of structured data from different sources (including satellites and ground-based observatories) in its climate modeling and weather forecasting. These data troves are large – terabytes and bigger – and in some cases, like weather prediction, high-end computers spit out storm models in real-time on the order of several times per day. Is that big data? Depending on who you ask, it might be.

What about at the United States Postal Service? USPS' supercomputing facilities in Minnesota process and detect fraud on 6,100 mail pieces per second, or about 528 million each day. The time it takes to scan one piece at a post office and compare the data against a database of 400 billion objects? Less than 100 milliseconds. Is that big data? Again, it might depend on who you ask.

In addition, while I agree it's nice to pull structured datasets from unstructured data, I feel like one thing missing from most big data definitions is the "why" factor. You're structuring this data – hopefully – for a purpose: to develop actionable insights. Why else would be doing big data, right? Yet only some definitions seem to include the "value" aspect, one of the "v" words that also include volume, veracity, variety and velocity.

Teradata's Bill Franks, who recently authored a book on big data, argues that value is the single most important factor in all of big data. Is it not reasonable to think that aspect might be outlined in any big data definition?

Because big data is relatively new on the IT scene, I suspect ambiguity regarding its definition and uses for a while. But just like cloud computing, its definition, along with its practical uses, will be cemented in the years to come.