What Is Differential Privacy?
When the United States gathers census data or a hospital shares medical information of patients for data analysis, personal information of participants is at risk. From each individual database, the risk can be relatively small. Multiple databases together containing small pieces of anonymized information can potentially be used to identify an individual participant. This is the problem differential privacy has been designed to solve.
At its roots, differential privacy is a mathematical way to protect individuals when their data is used in data sets. It ensures that an individual will experience no difference whether they participate in information collection or not. This means that no harm will come to the participant as a result of providing data.
How Differential Privacy Works
Differential privacy is a state-of-the-art definition of privacy used when analyzing large data sets. It guarantees that adversaries cannot discover an individual within the protected data set by comparing the data with other data sets.
In this context, an adversary is a person or organization attempting to glean information about specific individuals within a data set. This could range from a neighbor who wants to gossip to an activist group targeting employees at a company. No matter the adversary, differential privacy is achieved using mathematics, artificial intelligence, intelligent control, machine learning, and other technologies.
Defining Differential Privacy
Differential privacy in relation to data analysis can be informally defined using a before-and-after approach. That is, the analyst should not know more about any individual after analyzing data. Further, any adversary should not have too different a view of any individual after having access to a database.
In a more technical definition, differential privacy provides privacy by process. Specifically, the process introduces randomness into a data set. The process must achieve this without altering the eventual analysis of the data. Present and future sources of auxiliary information (such as other data sets) must not compromise an individual’s privacy.
How the Math Behind Differential Privacy Works
Differential privacy functions using various mathematical mechanisms. The first is the Laplace mechanism. The Laplace mechanism is a mathematical formula that adds noise to a data set. The formula determines how much noise to add by examining the quantity of data in a set and adding enough noise to ensure differential privacy.
The Laplace mechanism is a general-purpose way of achieving differential privacy and is an additive noise mechanism. The Gaussian mechanism is another example of an additive noise mechanism. The Gaussian mechanism adds noise based on the Gaussian probability distribution, factoring in sensitivity and privacy parameters.
Another differential privacy mechanism used is the exponential mechanism. Rather than adding noise to a data set, the exponential mechanism draws an output from a data set using a probability distribution. This means that using the exponential mechanism, data is drawn at random to accurately answer queries.
Data That Should Be Kept Invariant in Differential Privacy
In many cases, some data must be kept invariant, or unchanged, with differential privacy for analysis to remain accurate. One example of a data set that needs to be differentially private is biometric data from eye tracking. Virtual reality headsets collect this biometric data to function, and the movement data must remain invariant.
In another example, for the US census, invariant data includes total state population and number of housing units in each census block. Population counts must stay accurate.
In general, the data that is preserved is vital to achieving the results a particular data set is intended to provide.
When Differential Privacy Is Most Useful
Differential privacy is most useful in cases where sensitive data is involved. This can range from big data analytics (PDF, 382 KB) of 5G technology to medical records in a major hospital. While differential privacy is useful for many data sets, it excels in protection of user data for large data collections.
What Differential Privacy Protects
Differential privacy protects user data from being traced back to individual users. The parameters involved are known as the privacy budget. This is a metric of privacy loss based on adding or removing one entry in a data set.
Ensuring that the removal of a single data point cannot alter the overall data protects users. Differential privacy can help a user feel comfortable participating in a poll, answering census questions, or allowing a hospital to share medical information for scientific use.
The Most Common Approaches to Differential Privacy
The Laplace and exponential mechanisms represent the mathematics of differential privacy. These differential privacy mechanisms are the most common formulas for achieving differential privacy. However, other approaches have been developed to ensure privacy protection of user data.
One common approach for differential privacy is based on importance weighting. The method computes weights for an existing data set with no confidentiality issues and makes these weights analogous to a private data set. This method has a provable privacy guarantee: statistical queries are answered approximately in a way that maintains the privacy of confidential information.
Circumstances That Warrant the Use of Differential Privacy
The richer the data in a data set, the more useful that information is for analysis. In the past, this has led to the removal of personally identifiable information, like names, in the hope of anonymizing data. Differential privacy is a good replacement for this outdated practice.
The problem with anonymization is that the richness of the data still allows an adversary to discover an individual within the data set. A classic example of this issue dates from 1997. As discussed by Daniel Barth-Jones in an article for SSRN Electronic Journal, a graduate student identified the medical records of William Weld, then governor of Massachusetts, amid an anonymized data set. She did so by comparing voter registration records that included zip codes, date of birth, and gender with the anonymized data. This allowed her to identify Weld’s medical records.
How to Implement Differential Privacy
Differential privacy is used in tandem with technologies to develop secure databases across a variety of fields. With internet access widely available in many communities and data sources available for purchase, differential privacy is becoming ever more valuable. Implementation of differential privacy gets more complicated with advancing technologies. Privacy concerns raised by smart cities (PDF, 4 MB) could be solved by differential privacy, but implementing it across interconnected devices can be challenging.
Challenges and Limitations of Differential Privacy
Using Laplace noise helps disguise data, but too much noise renders data useless for analysis. Differential privacy is designed for interactive statistical queries made to a database. This means that the presence or absence of a single user’s data should not affect the results of a query to the data set. Small amounts of noise are required for averages and general conclusions derived from analysis. Queries returning specific records need substantially more noise.
The US Census Bureau used differential privacy to disseminate the 2020 census results. Using differential privacy in this context poses a challenge. Census data does not require masking respondent characteristics, something differential privacy does well, but data users must not be able to identify individual respondents.
Microdata, or individual data derived from real people, can challenge the core of differential privacy. The smaller the set of data points collected, the easier it is to identify individual participants.
Continuous data collection also imposes limits on the use of differential privacy. If the differentially private data is published in multiple instances over time, knowing when a user’s data was added allows adversaries to pinpoint their data.
Steps Needed to Implement Differential Privacy
To successfully implement differential privacy into a data set, the process requires smart decisions. Mathematicians designing and implementing differential privacy algorithms weigh accuracy of data against privacy of the users. The amount of noise added to a data set depends on how sensitive the information is. It’s necessary to maintain the granularity of data, especially for data sets with diverse use cases.
An example of implementation of differential privacy into a technology is machine learning. Machine learning algorithms cannot encode general patterns without also gathering specific user data. They cannot determine that smokers are more likely to have lung cancer without learning that Jane Doe, who participated, smokes and has cancer.
To make machine learning a differentially private algorithm, a method called Model Agnostic Private Learning can be used. This process is designed to work across many platforms and assumes access to a limited amount of public data. The goal of this process is to provide a private data set that is safe to publish.
The Need for a Mathematical Definition of Privacy
Privacy can have a variety of ambiguous definitions. Privacy might be defined differently in separate fields. For differential privacy to have value, privacy must be rigidly defined. Any definition of privacy must be mathematically rigorous, so technologies like machine learning can utilize it. The mathematics allow for universal use of the term privacy when it comes to protecting user data. Fields like social science that analyze big data sets can rely on the mathematical definition of privacy.
Use Cases for Differential Privacy
Differential privacy can be used in a variety of fields to ensure data privacy. The US 2020 Census used differential privacy, and it is being applied in machine learning and a variety of other fields.
Broad Use Cases for Differential Privacy
Current and upcoming technologies are making data collection and data traffic more common. One major example of this is the advent of 5G and 5G connectivity. 5G will make it easier to collect statistical information from user behavior than ever before. Differential privacy can be used to keep this data relevant while protecting the data user.
Differential privacy is also broadly usable in the mathematics of new and emerging technologies. The mathematical framework of differential privacy is useful for machine learning, deep learning algorithms, the neural network, and more.
Advantages of Differential Privacy Over Other Privacy Measures
There are many advantages of using differential privacy, the most obvious of which is protecting the identities of the users within data sets. The exponential mechanism will reliably provide differentially private data when a data set is queried.
With differential privacy, privacy loss is controllable and can be measured. Differential privacy is more resistant to privacy attacks on the basis of auxiliary information, or information from separately available data sets. In general, differential privacy protects personal data far better than traditional methods while enabling data mining and statistics queries.
Why Differential Privacy Is a Game Changer
Differential privacy allows large data sets to be useful without endangering the contributors of the data set. Because noise is added and random data is pulled, the result is equitable analysis and improved security.
Differential privacy techniques can be used across multiple fields of study, and because user privacy is mathematically defined, these fields can work together without confusion. Thus, differential privacy benefits both data users and data scientists who analyze private information.
Differential Privacy in the Future
Differential privacy is the foundation for the future of data privacy. Mathematicians and scientists are already building on the Laplace and exponential mechanisms to create more complicated privacy algorithms. Engineers are incorporating differential privacy into machine learning for satellite antennae. Companies and governments are using differential privacy to help keep individuals safe from a data breach or privacy budget loss. Differential privacy provides the potential for more privacy at a time when more data is readily available for study and analysis.
Interested in joining IEEE Digital Privacy? IEEE Digital Privacy is an IEEE-wide effort dedicated to champion the digital privacy needs of the individuals. This initiative strives to bring the voice of technologists to the digital privacy discussion and solutions, incorporating a holistic approach to address privacy that also includes economic, legal, and social perspectives. Join the IEEE Digital Privacy Community to stay involved with the initiative program activities and connect with others in the field.