Building an Anonymization Pipeline - Creating Safe Data

Data anonymization is not as simple as it seems. This book helps you understand the common terms and issues. It helps to find the right spot on the Identifiability Spectrum (share everything vs. share nothing). Then, apply the "Five Safes" risk management framework...

Building an Anonymization Pipeline - Creating Safe Data

Imagine you share a fridge with your coworkers and keep your lunch there. You like sushi, and so does your friend Jake. Sometimes, you visit Japanese restaurants together. One day, Jake missed breakfast. He was so hungry that he decided to eat your sushi roll while you were in a meeting. After 5 minutes, he ate another one, and then another one. By the time the meeting was over, your lunch was over too. Your lunch plans were ruined. The food was gone, and you had a busy day ahead. You expressed your frustration to another friend, Ann - "Jake ate my lunch." Yet, there was one more person in the room who liked to gossip. The next day, everyone in the office discussed the incident. Later, Jake came to you and begrudged that you could have talked to him first so he could buy you lunch.

You began thinking what I could do better to avoid it... You, as data owner (custodian), could say "X ate my lunch." Is it enough to protect Jake's identity? Well, in this case, you applied pseudonymization, which is subject to re-identification. If the "gossip person" knew about your mutual love of sushi, Jake's identity could be established. If you could say "ate my lunch" (dropped name), then you have done anonymization. Jake's identity is protected, but is it useful information? Does Ann understand the problem, or does she assume that you are not hungry?

Data anonymization is not as simple as it seems. This book helps you understand the common terms and issues. It helps to find the right spot on the Identifiability Spectrum (share everything vs. share nothing). Then, apply the "Five Safes" risk management framework (Projects, People, Settings, Data, Outputs).

It is written for data architects, engineers, and executives. Data analysts, scientists, and privacy experts might also find it useful. It has only 167 pages and feels like a mix of philosophical and regulatory texts. You won't find a single line of code here but will learn more about data privacy, HIPAA, and GDPR. I personally like the authors' stories about AOL, Netflix, and especially credit card transactions.

"Researchers were given access to a sample of financial transactions, including date, place, and amount paid. For example, a person buys a pair of shoes at their favorite store on one day, buys groceries at their local market on another day, and visits a particular coffee shop on another day, leaving a record of how much they paid for each transaction... only four transactions were needed to make 90% of people unique in the data."

If you share sensitive data, like medical records, please be careful. I recommend that you learn about the topic first and then follow the guidelines. Sometimes you have to do way more than change the data. As authors stated, "We have worked with organizations that have spun off new companies that would work only from anonymized data they would provide."

Building an Anonymization Pipeline - Creating Safe Data (O'Reilly, 2020, 167 pages)


Looking for a help? Reach me any time.

Subscribe to AWS by Vlad Frantskevich

Don’t miss out on the latest issues. Sign up now to get updates.
jamie@example.com
Subscribe