recommended reading

Here’s Why You May Never Be Truly Anonymous in a Big Data World

Bruce Rolff/

Big data—the kind that statisticians and computer scientists scour for insights on human beings and our societies—is cooked up using a recipe that’s been used a thousand times. Here’s how it goes: Acquire a trove of people’s highly personal data—say, medical records or shopping history. Run that huge set through a “de-identification” process to anonymize the data. And voila—individuals become anonymous, chartable, and unencumbered by personal privacy concerns.

So what’s the problem? It turns out that all that de-identified data may not be so anonymous after all.

So argues Arvind Narayanan, a Princeton computer scientist who first made waves in the privacy community by co-authoring a 2006 paper showing that Netflix users and their entire rental histories could be identified by cross-referencing supposedly anonymous Netflix ratings with the Internet Movie Database. Narayanan and fellow Princeton professor Edward Felten delivered the latest blow to the case of de-identification proponents (those who maintain that de-identification is viable) with a July 9 paper that makes a serious case for data paranoia.

They argue that de-identification doesn’t work—in theory or in practice—and that those who say it does are promoting a “false sense of security” by naively underestimating the attackers who might try to deduce personal information from big data. Here are Narayanan and Felten’s main points:

Personal location data isn’t really anonymous

A 2013 study showed that given a large dataset of human mobility data collected from smartphones, 95 percent of individuals were uniquely identifiable from as few as four points—think check-ins or shared photos with geo-location metadata. Even the most devout de-identificationists admit there’s no robust way to anonymize location data. 

Experts don’t know how vulnerable data is

In a case study of the meticulously de-identified Heritage Health Prize dataset, which contains the medical records of 113,000 patients, the University of Ottawa professor and de-identification expert Khaled El Emam estimated that less that 1 percent of patients could be re-identified. Narayanan, on the other hand, estimated that over 12 percent of patients in the data were identifiable. If an attack is informed by additional, specific information—for example, in an attempt to defame a known figure by exposing private information—it could be orders of magnitude easier to finger an individual within a dataset.

De-identification is hard, and re-identification is forever

De-identifying data is challenging and error-prone. In a recently released dataset of 173 million taxi rides in New York City, it turned out that individual taxis, and even their drivers, could be identified because the hashing (a mathematical function that disguises numbers) of license plate numbers in the data was shoddy.

The thing is, when a person’s anonymity is publicly compromised, it’s immortalized online. That can be an even worse problem than a data breach at a company or web app. When a company’s security is breached, cleanup is messy but doable: the flaw is patched, users are alerted, and life goes on. But abandoning a compromised account is more feasible than abandoning an entire identity.

So should we smash our smartphones, swear off health care, and head for the hills? Not according to the de-identification defender El Emam. He points out that Narayanan did not actually manage to re-identify a single patient in the Heritage Health Prize dataset. “If he is one of the leading re-identification people around,” El Emam says, “then that is pretty strong evidence that de-identification, when done properly, is viable and works well.”

That’s good news for all us human beings who make up big data. But just because the anonymity of big data hasn’t been definitively broken yet doesn’t mean it’s unbreakable.

(Image via Bruce Rolff/

Threatwatch Alert

Thousands of cyber attacks occur each day

See the latest threats


Close [ x ] More from Nextgov

Thank you for subscribing to newsletters from
We think these reports might interest you:

  • Modernizing IT for Mission Success

    Surveying Federal and Defense Leaders on Priorities and Challenges at the Tactical Edge

  • Communicating Innovation in Federal Government

    Federal Government spending on ‘obsolete technology’ continues to increase. Supporting the twin pillars of improved digital service delivery for citizens on the one hand, and the increasingly optimized and flexible working practices for federal employees on the other, are neither easy nor inexpensive tasks. This whitepaper explores how federal agencies can leverage the value of existing agency technology assets while offering IT leaders the ability to implement the kind of employee productivity, citizen service improvements and security demanded by federal oversight.

  • Effective Ransomware Response

    This whitepaper provides an overview and understanding of ransomware and how to successfully combat it.

  • Forecasting Cloud's Future

    Conversations with Federal, State, and Local Technology Leaders on Cloud-Driven Digital Transformation

  • IT Transformation Trends: Flash Storage as a Strategic IT Asset

    MIT Technology Review: Flash Storage As a Strategic IT Asset For the first time in decades, IT leaders now consider all-flash storage as a strategic IT asset. IT has become a new operating model that enables self-service with high performance, density and resiliency. It also offers the self-service agility of the public cloud combined with the security, performance, and cost-effectiveness of a private cloud. Download this MIT Technology Review paper to learn more about how all-flash storage is transforming the data center.


When you download a report, your information may be shared with the underwriters of that document.