recommended reading

Here’s Why You May Never Be Truly Anonymous in a Big Data World

Bruce Rolff/

Big data—the kind that statisticians and computer scientists scour for insights on human beings and our societies—is cooked up using a recipe that’s been used a thousand times. Here’s how it goes: Acquire a trove of people’s highly personal data—say, medical records or shopping history. Run that huge set through a “de-identification” process to anonymize the data. And voila—individuals become anonymous, chartable, and unencumbered by personal privacy concerns.

So what’s the problem? It turns out that all that de-identified data may not be so anonymous after all.

So argues Arvind Narayanan, a Princeton computer scientist who first made waves in the privacy community by co-authoring a 2006 paper showing that Netflix users and their entire rental histories could be identified by cross-referencing supposedly anonymous Netflix ratings with the Internet Movie Database. Narayanan and fellow Princeton professor Edward Felten delivered the latest blow to the case of de-identification proponents (those who maintain that de-identification is viable) with a July 9 paper that makes a serious case for data paranoia.

They argue that de-identification doesn’t work—in theory or in practice—and that those who say it does are promoting a “false sense of security” by naively underestimating the attackers who might try to deduce personal information from big data. Here are Narayanan and Felten’s main points:

Personal location data isn’t really anonymous

A 2013 study showed that given a large dataset of human mobility data collected from smartphones, 95 percent of individuals were uniquely identifiable from as few as four points—think check-ins or shared photos with geo-location metadata. Even the most devout de-identificationists admit there’s no robust way to anonymize location data. 

Experts don’t know how vulnerable data is

In a case study of the meticulously de-identified Heritage Health Prize dataset, which contains the medical records of 113,000 patients, the University of Ottawa professor and de-identification expert Khaled El Emam estimated that less that 1 percent of patients could be re-identified. Narayanan, on the other hand, estimated that over 12 percent of patients in the data were identifiable. If an attack is informed by additional, specific information—for example, in an attempt to defame a known figure by exposing private information—it could be orders of magnitude easier to finger an individual within a dataset.

De-identification is hard, and re-identification is forever

De-identifying data is challenging and error-prone. In a recently released dataset of 173 million taxi rides in New York City, it turned out that individual taxis, and even their drivers, could be identified because the hashing (a mathematical function that disguises numbers) of license plate numbers in the data was shoddy.

The thing is, when a person’s anonymity is publicly compromised, it’s immortalized online. That can be an even worse problem than a data breach at a company or web app. When a company’s security is breached, cleanup is messy but doable: the flaw is patched, users are alerted, and life goes on. But abandoning a compromised account is more feasible than abandoning an entire identity.

So should we smash our smartphones, swear off health care, and head for the hills? Not according to the de-identification defender El Emam. He points out that Narayanan did not actually manage to re-identify a single patient in the Heritage Health Prize dataset. “If he is one of the leading re-identification people around,” El Emam says, “then that is pretty strong evidence that de-identification, when done properly, is viable and works well.”

That’s good news for all us human beings who make up big data. But just because the anonymity of big data hasn’t been definitively broken yet doesn’t mean it’s unbreakable.

(Image via Bruce Rolff/

Threatwatch Alert

Stolen credentials

Hackers Steal $31M from Russian Central Bank

See threatwatch report


Close [ x ] More from Nextgov

Thank you for subscribing to newsletters from
We think these reports might interest you:

  • Data-Centric Security vs. Database-Level Security

    Database-level encryption had its origins in the 1990s and early 2000s in response to very basic risks which largely revolved around the theft of servers, backup tapes and other physical-layer assets. As noted in Verizon’s 2014, Data Breach Investigations Report (DBIR)1, threats today are far more advanced and dangerous.

  • Featured Content from RSA Conference: Dissed by NIST

    Learn more about the latest draft of the U.S. National Institute of Standards and Technology guidance document on authentication and lifecycle management.

  • PIV- I And Multifactor Authentication: The Best Defense for Federal Government Contractors

    This white paper explores NIST SP 800-171 and why compliance is critical to federal government contractors, especially those that work with the Department of Defense, as well as how leveraging PIV-I credentialing with multifactor authentication can be used as a defense against cyberattacks

  • Toward A More Innovative Government

    This research study aims to understand how state and local leaders regard their agency’s innovation efforts and what they are doing to overcome the challenges they face in successfully implementing these efforts.

  • From Volume to Value: UK’s NHS Digital Provides U.S. Healthcare Agencies A Roadmap For Value-Based Payment Models

    The U.S. healthcare industry is rapidly moving away from traditional fee-for-service models and towards value-based purchasing that reimburses physicians for quality of care in place of frequency of care.

  • GBC Flash Poll: Is Your Agency Safe?

    Federal leaders weigh in on the state of information security


When you download a report, your information may be shared with the underwriters of that document.