recommended reading

Here’s Why You May Never Be Truly Anonymous in a Big Data World

Bruce Rolff/

Big data—the kind that statisticians and computer scientists scour for insights on human beings and our societies—is cooked up using a recipe that’s been used a thousand times. Here’s how it goes: Acquire a trove of people’s highly personal data—say, medical records or shopping history. Run that huge set through a “de-identification” process to anonymize the data. And voila—individuals become anonymous, chartable, and unencumbered by personal privacy concerns.

So what’s the problem? It turns out that all that de-identified data may not be so anonymous after all.

So argues Arvind Narayanan, a Princeton computer scientist who first made waves in the privacy community by co-authoring a 2006 paper showing that Netflix users and their entire rental histories could be identified by cross-referencing supposedly anonymous Netflix ratings with the Internet Movie Database. Narayanan and fellow Princeton professor Edward Felten delivered the latest blow to the case of de-identification proponents (those who maintain that de-identification is viable) with a July 9 paper that makes a serious case for data paranoia.

They argue that de-identification doesn’t work—in theory or in practice—and that those who say it does are promoting a “false sense of security” by naively underestimating the attackers who might try to deduce personal information from big data. Here are Narayanan and Felten’s main points:

Personal location data isn’t really anonymous

A 2013 study showed that given a large dataset of human mobility data collected from smartphones, 95 percent of individuals were uniquely identifiable from as few as four points—think check-ins or shared photos with geo-location metadata. Even the most devout de-identificationists admit there’s no robust way to anonymize location data. 

Experts don’t know how vulnerable data is

In a case study of the meticulously de-identified Heritage Health Prize dataset, which contains the medical records of 113,000 patients, the University of Ottawa professor and de-identification expert Khaled El Emam estimated that less that 1 percent of patients could be re-identified. Narayanan, on the other hand, estimated that over 12 percent of patients in the data were identifiable. If an attack is informed by additional, specific information—for example, in an attempt to defame a known figure by exposing private information—it could be orders of magnitude easier to finger an individual within a dataset.

De-identification is hard, and re-identification is forever

De-identifying data is challenging and error-prone. In a recently released dataset of 173 million taxi rides in New York City, it turned out that individual taxis, and even their drivers, could be identified because the hashing (a mathematical function that disguises numbers) of license plate numbers in the data was shoddy.

The thing is, when a person’s anonymity is publicly compromised, it’s immortalized online. That can be an even worse problem than a data breach at a company or web app. When a company’s security is breached, cleanup is messy but doable: the flaw is patched, users are alerted, and life goes on. But abandoning a compromised account is more feasible than abandoning an entire identity.

So should we smash our smartphones, swear off health care, and head for the hills? Not according to the de-identification defender El Emam. He points out that Narayanan did not actually manage to re-identify a single patient in the Heritage Health Prize dataset. “If he is one of the leading re-identification people around,” El Emam says, “then that is pretty strong evidence that de-identification, when done properly, is viable and works well.”

That’s good news for all us human beings who make up big data. But just because the anonymity of big data hasn’t been definitively broken yet doesn’t mean it’s unbreakable.

(Image via Bruce Rolff/

Threatwatch Alert

Thousands of cyber attacks occur each day

See the latest threats


Close [ x ] More from Nextgov

Thank you for subscribing to newsletters from
We think these reports might interest you:

  • It’s Time for the Federal Government to Embrace Wireless and Mobility

    The United States has turned a corner on the adoption of mobile phones, tablets and other smart devices, outpacing traditional desktop and laptop sales by a wide margin. This issue brief discusses the state of wireless and mobility in federal government and outlines why now is the time to embrace these technologies in government.

  • Featured Content from RSA Conference: Dissed by NIST

    Learn more about the latest draft of the U.S. National Institute of Standards and Technology guidance document on authentication and lifecycle management.

  • A New Security Architecture for Federal Networks

    Federal government networks are under constant attack, and the number of those attacks is increasing. This issue brief discusses today's threats and a new model for the future.

  • Going Agile:Revolutionizing Federal Digital Services Delivery

    Here’s one indication that times have changed: Harriet Tubman is going to be the next face of the twenty dollar bill. Another sign of change? The way in which the federal government arrived at that decision.

  • Software-Defined Networking

    So many demands are being placed on federal information technology networks, which must handle vast amounts of data, accommodate voice and video, and cope with a multitude of highly connected devices while keeping government information secure from cyber threats. This issue brief discusses the state of SDN in the federal government and the path forward.

  • The New IP: Moving Government Agencies Toward the Network of The Future

    Federal IT managers are looking to modernize legacy network infrastructures that are taxed by growing demands from mobile devices, video, vast amounts of data, and more. This issue brief discusses the federal government network landscape, as well as market, financial force drivers for network modernization.


When you download a report, your information may be shared with the underwriters of that document.