One useful definition for the unstructured data that underlies most existing and theoretical big data projects is that it was often collected for some purpose other than what the researchers are using it for.
That definition was provided by Chris Barrett, executive director of the Virginia Bioinformatics Institute during a series of presentations before the President’s Council of Advisors on Science and Technology on Thursday focused on the value of data mining for public policy.
Data that was initially collected to measure educational achievement, for instance, could be used to analyze how educational achievement relates to obesity or incarceration rates in a particular community.
This definition points to the potential of big data analysis as more and more information is gathered online and elsewhere, but it also points to some challenges as outlined by Duncan Watts, a principal researcher at Microsoft’s research division.
First off, a large portion of the data that might be valuable to social scientists, policymakers, urban planners and others is held by private companies that release only portions of it to researchers. Facebook, Amazon, Google, email providers and ratings companies all know certain things about you and about society, in other words, but there’s no way to aggregate that data to draw global insights.
“Many of the questions that are of interest to social science really require us being able to join these different modes of data and to see who are your friends what are they thinking and what does that mean about what you end up doing,” Watts said. “You cannot answer these questions in any but the most limited way with the data that’s currently assembled.”
Second, even if social scientists were able to draw on that aggregated data, it would raise significant privacy concerns among the public.
“This is a very sensitive point because, to some extent, this is what the NSA has been reputedly doing, joining together different sorts of data,” Watts said. “And you can understand how sensitive people are about that. Precisely the reason why this is scientifically interesting is also the reason why it’s so sensitive from a privacy perspective.”
Finally, because much of the data that’s useful to social scientists was gathered for other purposes, there’s often some bias in the data itself, Watts said.
“When you go to Facebook, you’re not seeing some kind of unfiltered representation of what your friends are interested in,” he said. “What you’re seeing is what Facebook’s news ranking algorithm thinks that you'll find interesting. So when you click on something and the social scientist sees you do that and makes some inference about what you’re sharing and why, it’s hopelessly confounded.”