Big data sets can be used to uniquely identify individuals.
The National Security Agency is collecting the telephone data of millions of Verizon customers in the United States, according to a Wednesday Guardian report, and the information collected could be incredibly revealing even if it doesn't seem so at first. That's because big data sets -- even supposedly anonymized ones -- can often be used to uniquely identify individuals.
The secret court order forcing Verizon to give up the data over a three-month period doesn't cover the contents of messages or personal information of individual subscribers, The Guardian reported. Instead:
It specifies that the records to be produced include "session identifying information", such as "originating and terminating number," the duration of each call, telephone calling card numbers, trunk identifiers, International Mobile Subscriber Identity (IMSI) number, and "comprehensive communication routing information."
The information is classed as "metadata," or transactional information, rather than communications, and so does not require individual warrants to access. The document also specifies that such "metadata" is not limited to the aforementioned items. A 2005 court ruling judged that cell site location data -- the nearest cell tower a phone was connected to -- was also transactional data, and so could potentially fall under the scope of the order.
The legal underpinnings of the data collection are not dissimilar to those behind the recent Justice Department subpoena for the phone records of Associated Press staff.
And, as a concurrent Guardian report points out, the government has long argued that this kind of data is perfectly legal to collect because it's similar to collecting the information on the outside of an envelope. But even that so-called "transactional" data --phone numbers, phone serial numbers, time and length of calls -- can represent a goldmine of information. Collect a ton of data and you can use it later to identify individuals.
That's a fact researchers at MIT and the Université Catholique de Louvain, in Belgium, recently highlighted in their own study of a giant set of phone data. After analyzing 1.5 million cellphone users over the course of 15 months, the researchers found they could uniquely identify 95 percent of cellphone users based on just four data points -- that is, just four instances of where they were and what hour of the day it was just four times in one year. With just two data points, they could identify more than half of the users. And the researchers suggested that the study may underestimate how easy it is:
For the purpose of re-identification, more sophisticated approaches could collect points that are more likely to reduce the uncertainty, exploit irregularities in an individual's behaviour, or implicitly take into account information such as home and work- place or travels abroad. Such approaches are likely to reduce the number of locations required to identify an individual, vis-a`-vis the average uniqueness of traces.
And it's not just phone records that can reveal who you really are. A 2006 New York Times story made it clear just how simple it is to figure out a person's identity based on their web searches. A then-62-year-old woman in Lilburn, Georgia, thought she was perfectly anonymous in her searches for "numb fingers," "60 single men," and "dog that urinates on everything."
But when her AOL search data was released online, it didn't take much to lead reporters from Internet user No. 4417749 to Thelma Arnold. AOL later removed the search data and apologized for its apparently unauthorized release, but it serves as illustration that it doesn't take too much to figure who a person is based on what they're Googling for in the supposed privacy of their own home.