How 'Anonymous' Shopping Data Reveals Your Identity

Credit: Dean Bertoncelj/Shutterstock — (Epitome credit: Dean Bertoncelj/Shutterstock)

Your personal shopping habits tin be used to identify you with xc percent accuracy — and the trackers don't need your name, accost or even credit-card number, a new study finds.

Women are easier to identify than men, rich people are easier to identify than poorer ones, and, given a large enough data set, true anonymity may be mathematically impossible, according to the written report, published in the Jan. thirty issue of the journal Science. The findings may require re-exam of the entire do of gathering "big information."

More than: thirteen Security and Privacy Tips for the Truly Paranoid

Even many people who aren't tech-savvy know that breadcrumbs of data tin can be used to track an individual'southward movements, provided that the tracker is armed with a proper name or another kind of personally identifiable information (PII).

Yves-Alexandre deMontjoye, a graduate student at the Massachusetts Institute of Technology, looked at the metadata from credit-menu records — not what was bought and who bought it, but instead the time, date, place and cost of each transaction. Cardholder names and any other obvious identifiers were scrubbed out, and carte du jour business relationship numbers were replaced by randomly assigned ID numbers.

In 90 pct of the cases, it was possible to link those random ID numbers to individuals from only four pieces of metadata — and sometimes only three, since the fourth dimension of day wasn't always necessary, de Montjoye told Tom'due south Guide.

"Nosotros were actually trying to quantify how many pieces of information were needed," de Montjoye said.

The credit-card data was provided by a single banking company in an unnamed state and covered the three months from January. 1, 2014 to March 31, 2014, yielding information from ane.one meg cards used in x,000 shops. DeMontjoye wouldn't name the country involved, but said it was i of the 34 members of the Organization for Economic Cooperation and Development — a rich, probably Western, country.

Putting a confront to a number

The reason for the accuracy of identification is actually pretty uncomplicated. For example, Jane Doe — whom the researcher would know as just an alphanumerical ID such every bit "7abc123a" — might be one of 1,000 people to use a credit carte du jour in a certain pizza store on a given day. Simply there would be far fewer people who would use credit cards at both that pizza shop and a certain shoe store on that day, and fewer still who would buy things at three different specific shops on the same day.

From that point, it would exist possible to track down other places to which Jane Doe had gone by combing through the database of 1.1 million cards to pull out all her activity. Tack on the toll of each transaction — even a price range will exercise — and the odds of linking Jane to a real name go up dramatically. In fact, with but a few information points, you lot'd be able to identify an individual user roughly 90 percent of the time, de Montjoye plant.

Say Ms. Doe begins each weekday past ownership java at a Starbucks almost Union Square in Manhattan. She often buys lunch at whatsoever of half a dozen markets and takeout restaurants nearby. Simply she buys a subway MetroCard in Park Slope, Brooklyn, and used her credit card at a drycleaner'due south in the same neighborhood.

We've established roughly where Jane Doe lives and works. Simply she also buys clothes at the upscale department store Barneys, and often uses the online taxi service Uber to get around New York on nights and weekends. At present nosotros know that she makes a comfy income.

Collect such data over three months, build up a profile, and then correlate information technology with publicly bachelor data — such as personal profiles on LinkedIn or Facebook, or where people "cheque in" on Square — near individuals who fit that contour, and you'll probably exist able to match the randomized ID with Jane Doe.

The method is even more useful if you already know who you are seeking. Say you're the FBI and you want to track Jane Doe, but only have a proper noun and accost and a stack of anonymous credit card information. Montjoye's method makes it unproblematic to friction match the two up.

"We as well studied the effects of gender and income on the likelihood of re-identification," de Montjoye wrote in the paper. "The higher somebody'south income is, the easier information technology is to re-identify him or her. … The odds of women existence re-identified are 1.214 times greater than for men."

Crunching data to spit out names

This isn't the first time anyone has studied re-identification of individuals. In 2006, America Online released a database of the search queries of 650,000 AOL users, and researchers quickly plant out how to match them upwards with names using publicly available information. They were able to do then because the anonymization consisted merely of replacing the names with a unique identifier.

The same year, Netflix published a trove of motion picture recommendations and asked for help from the public in coming up with a improve algorithm. But Arvind Narayanan and Vitaly Shmatikov, researchers at the Academy of Texas at Austin, were able to reconstruct the names fastened to them past comparing the information to public information on the Cyberspace Flick Database (imdb.com) — in that case, recommendations from users.

De Montjoye's study takes those methods one stride further. It shows that even within a database, information technology'southward possible to entirely remove personally identifiable information and still finish up with unique identifiers. Simply a few data points are needed, and, from there, it's no corking feat to merge it with some other data set.

More to the bespeak, typical methods of anonymization probably won't piece of work, de Montjoye said. The implication is that given a big enough data fix, true anonymization of information might be a mathematical impossibility.

The implication is that "Large Data" tin can never truly be anonymized. Given enough information — but far less than what'south available to Google, Facebook, Amazon, Apple or Microsoft, not to mention a marketing-enquiry company such as Acxiom— it's almost certain that a data set tin can be matched to a existent name.

The findings don't surprise Susan Landau, a professor of cybersecurity policy at Worcester Polytechnic Plant in Massachusetts.

"I use an anonymous travel card," Landau said. "The travel nosotros do as tourists — if you lot know the area of the hotel, combined with the day, you lot could figure out who we were."

Can yous limit deanonymization?

For organizations such as the National Security Bureau or Facebook, the deanonymization provided by large data sets is a characteristic, non a problems. The NSA wants to see as much data about every bit many individuals every bit possible, in the name of security, and a big piece of Facebook'south business model is selling ads tailored to the interests of users. There are also legitimate reasons to assemble large amounts of information such equally these for medical and population studies.

As long as the data is going to exist nerveless, Landau said, the central to privacy is to command the data's utilize and the information derived from information technology. She noted that, in the medical-enquiry customs, scientists who leak personal information tin can be denied access to the information sets for a time. In that case, the people who use the information cocky-police.

"If you lot tin't get the data, you're done as a researcher," Landau said.

Lee Tien, senior staff attorney at the Electronic Frontier Foundation, a digital-rights and privacy advocacy group in San Francisco, said such problems should prompt organization designers to rethink how information is gathered. Rather than picking up as much information equally possible, Tien said, information technology might be amend to think through exactly what'southward needed and, most chiefly, not keep it around for long.

"One way to do this." Tien said, "is to say [that] entities should not choice information technology upwards unless information technology's admittedly necessary."

More than: Can You Hide Anything from the NSA?

It's also possible to offer data that has the aforementioned statistical relationships every bit the data 1 wants to written report, simply to pepper it with "false" information in the fields that aren't relevant, Tien said. He noted that the U.Southward. Census Bureau does this when giving out data to researchers.

De Montjoye added that his research really suggests that the concept of personal information should be rethought. The French agency that governs information privacy, the Commission nationale de l'informatique et des libertés, approaches private information by asking that data sets be "provably bearding."

"That doesn't calibration," de Montjoye said, "and it's probably not achievable."

De Montjoye said his findings don't indicate that the practice of gathering data is bad itself, but rather that it might be necessary to come with a amend notion of what kinds of information are truly personal.

Data gathering at present "relies on this vague notion of personal data, either defined every bit names or PII," he said. "We're showing this is not enough."