Keeping Secrets: Anonymous Data Isn’t Always Anonymous

Data collection and analysis have enormous potential. Information collected from patients could be used to track the effects of health care reform and to improve any broken models. Genetic data combined with clinical outcomes data could help researchers link genetic variations to the risk of developing specific diseases or the outcomes of various different treatments.

These and other applications have researchers and statisticians excited, but they come at a substantial risk to privacy, since data can fall into the wrong hands. For example, one survey of doctor-patient data in the state of Washington suggested that information can find its way into a variety of hands, including insurance companies and even human resources departments. If patients could be identified, this information could influence hiring and firing decisions.

It is critical that sensitive information such as health care or legal data be stripped of all identifiers, so that individuals remain anonymous and the information cannot be used to discriminate against anyone. If such privacy cannot be maintained, individuals are unlikely to participate in data collection programs at all or restrictions could make such studies difficult.

In principle, stripping data identifiers seems pretty straightforward. Fields containing names, social security numbers, and similar tell-tale data can be easily removed from a database. In practice, however, it is often possible to reconstruct an individual’s identity by cross-referencing anonymized data to other databases that contain identifiers but no sensitive information.

The classic example of this problem occurred in 1997, when Latanya Sweeney, who was then a graduate student at MIT, found the medical records of Massachusetts Governor William Weld, who had collapsed during a public ceremony. She used Weld’s readily available zip code and birth date to scan the Massachusetts Group Insurance Commission (GIC) database for his records and confirmed the identity using voter-registration records from Cambridge, Massachusetts.

Some have cited this as an unusual example, given that it involved a high-profile public figure, which may not be generally repeatable. However, at the American Association for the Advancement of Science meeting in Chicago earlier this month, Sweeney, who is now a computer science professor at Harvard, submitted the results of another sting operation: This time, she purchased a $50 database from the state of Washington that included all hospitalization records for one year. The data included patient demographic information, diagnoses, the identity of the attending physicians, the hospital, and the method used to pay the bill. It had no patient names or addresses, but it included the zip code.

Sweeney then conducted a search of all news stories in the state that contained the word ‘hospitalized’ during the same period. With a little sleuthing, they found they could exactly match the information from an article to the database in 43 percent of the cases (they hired a reporter to confirm the identifications), essentially allowing them to place a name on an anomymized health record.

The results alarm some, but others at the session maintained that the identification required significant effort. It was generally agreed that it will never be possible to completely anonymize data, but much depends on the difficulty involved and the potential payoff. “In the real world, the payoffs can be made low enough that people don’t pursue re-identification,” suggested Daniel Barth-Jones, an infectious disease epidemiologist at Columbia University who utilizes anonymized data in his own research.

That is no doubt a reasonable goal, but research like Sweeney’s suggests that more refinement of current data practices and regulations may need to be made before there is a general comfort with anonymized data. If the general population is not comfortable with data gathering, then restrictions on use could hamstring the impact of big data.