Intelligence agencies increasingly are looking beyond the satellite photos and secret reports upon which they've traditionally relied for insight into U.S. adversaries' actions and are turning to data-crunching algorithms that can sift through massive piles of disparate information, such as GPS reports, social media posts and online images, said Amr Awadallah, co-founder and chief technology officer of CloudEra, a vendor that maintains and manages big data systems.
While intelligence agencies are the first government entities to mine big data -- data sets too large to be analyzed by desktop analytical tools -- they're unlikely to be the last, Awadallah said.
Agencies managing Social Security, Medicare and Medicaid for instance, could analyze big data to spot trends in fraud and abuse and the Transportation Department could crunch through satellite images to get a better sense of traffic patterns on interstate highways.
CloudEra's federal customers include the CIA and the National Security Agency. "I can't talk about what those projects are, but you can imagine how much data they have and what type of things they could be doing with it," he said.
The CIA also indirectly invested in CloudEra, through In-Q-Tel, an independent, nonprofit venture capital firm started at the spy agency's request and which describes its mission as delivering useful technology to the agency.
Awadallah spoke with Nextgov on the sidelines of the Government Big Data Forum that vendor Carahsoft Technology sponsored on March 6.
At the root of most big data crunching systems is the open source software Apache Hadoop. Its major innovations are, first,the ability to link together multiple computers and servers, either in a proprietary data center or in a computer cloud, and make them work like one huge computer that can scale up for a major task.
The software's second major innovation is the ability to sort through unstructured data such as all posts under a particular Twitter hash tag or emails containing a particular word or phrase, as well as through more structured data such as spreadsheets.
"The old way of collecting data was to only collect it . . . when a human generates it," Awadallah said, such as by making a purchase or filling out a survey.
"We called that an explicit transaction," he said. "Now we're collecting implicit information. We have all these sensors around humans in mobile devices and satellites taking images and there are Web services collecting information about you all the time nonstop."
The classic example of big data in the private sector is when Google, Facebook or another site mines through a user's search history, network of contacts and profile information to micro-target the advertisements she's most likely to click on.
Big data can be used in other commercial ways, though, that have nothing to do with Web activity.
The company Skybox Imaging, for example, has made a business out of sorting through satellite data to deliver commercial intelligence on demand, according to Awadallah.
"So [for example] you can buy a little stream from them that gives you a measure of how many cars are parked at Home Depot in different locations across the country," he said. "If you're a competitor of Home Depot's or if you're a financial analyst who's trying to predict the quarterly earnings of Home Depot that's very valuable information."