Differential Privacy

Why other approaches to privacy not enough?

Most of the approaches used for privacy protection are susceptible to certain attacks. For example, the Homogeneity attack is a known attack on k-Anonymity, allowing adversaries to obtain the exact values of sensitive attributes without re-identifying them from released data. All these approaches to privacy do not provide the privacy guarantee. But DP provides robust privacy guarantees against a wide range of privacy attacks. It makes sure that the data subject is not affected by their entry or participation in a database, while maximizing utility/data accuracy (as opposed to random/empty outputs) for queries to a database. It guarantees that the impact on the participant of a study is the same, regardless of whether they were in the study or not. It is the conclusions reached in the study that affect the participant, not his presence or absence in the data set.

How DP protects privacy?

Differential Privacy is a method of protecting individuals’ privacy when their data is collected and analyzed. When managing a sensitive database, it’s essential to ensure that adversaries cannot learn confidential or sensitive information from the statistics of the data. Differential Privacy makes it harder for an adversary to breach privacy. It adds noise to the ground truth of queries to a database. DP makes sure that the output of two neighbouring databases after querying the database is pretty much the same. Two databases, \(x\) and \(x'\) are called neighbouring databases if they differ in the data of a single individual. DP makes sure that the observed output of a query does not reveal which of \(x\) or \(x'\) was the input database.

Formal definition of Differential Privacy

We say that a mechanism \(M\) satisfies differential privacy if for all neighbouring databases \(x\) and \(x'\), and all possible outputs \(S\),

\[P[M(x)=S] \le e^\epsilon P[M(x')=S]\]

must be true for all possible outputs \(S\)

The mechanism is \(\epsilon\)-differentially private in such a case. For any pair of databases, we cannot gain more than a small amount of probabilistic information about a single individual. When adding or removing an individual in the database, the output distribution changes, but this ensures that the difference between the probabilities of any two outputs is bounded by a factor \(\epsilon\). \(\epsilon\) in the above definition is called privacy budget and \(e^\epsilon\) is called privacy parameter or the privacy loss parameter.

How is \(\epsilon\) related to levels of privacy? If \(\epsilon\) is small, the mechanism provides very similar outputs when given similar inputs which implies higher levels of privacy. But if \(\epsilon\) is large, the mechanism provides less similar outputs and therefore, lesser privacy.

How to achieve DP?

Easiest way to achieve DP for a query is to add random noise to its answer. It can be obtained by

  1. Computing a function \(F\) of the data,
  2. Adding some noise to the result value.

The noise must be large enough to hide an individual contribution so as to satisfy the definition of differential privacy but not so much that the result becomes too noisy to be useful. But for some functions F, an individual contribution can change the true result more than for other functions. This concept is captured by sensitivity: the higher the possible change, the higher the sensitivity. The sensitivity of a function measures the degree to which a single individual’s data can change the function in the worst case. And typically, to achieve \(\epsilon\)-differential privacy with a fixed \(\epsilon\), you have to add more noise when the sensitivity of the function is higher (to compensate).

For a function \(f:D\rightarrow R^k\), sensitivity of \(f\) is:

\[\Delta f=\underset{D_1,D_2}{\max}||f(D_1)-f(D_2)||_1\]

where datasets \(D_1\) and \(D_2\) are neighbouring datasets. This is called \(l_1\)- sensitivity of function \(f\). The \(l_1\)- sensitivity gives an upper bound on how much we need to perturb the data to preserve privacy. This discussion brings query sensitivity into the picture. Let’s talk about that.

Query Sensitivity

Sensitivity of a query function is defined as the maximum absolute difference between query results of the neighbouring databases over all possible neighbouring databases, for a given input dataset. There are different types of query, each with its own sensitivity. One type of query is the count query. Count query counts the number of rows in the dataset which satisfy the property specified in the query. For example, a count query might count the number of rows in a dataset where a certain attribute equals a specific value. The \(l_1\)- sensitivity of counting queries is 1 as adding a row to the dataset can increase the output of the query by at most 1: either the new row has the desired property, and the count increases by 1, or it does not, and the count stays the same (the count may correspondingly decrease when a row is removed). Therefore, the sensitivity of the count query is 1. In contrast, the sensitivity of a sum query depends on the contents of the dataset. Sum query sums up the attribute values of dataset rows. In this case, adding a new row to the dataset will increase the query result by at most the maximum attribute value across all rows, which could be any value and is not necessarily 1 as in the count query.

\((\epsilon, \delta)\)-DP

A mechanism \(M\) satisfies \((\epsilon, \delta)\)- differential privacy if for all neighbouring databases \(x\) and \(x'\), and all possible outputs \(S\),

\[P[M(x)=S] \le e^\epsilon P[M(x')=S] +\delta\]

\((\epsilon, \delta)\)- DP ensures that for all neighbouring databases, the absolute value of privacy loss will be bounded by \(\epsilon\) with probability at least \(1-\delta\). \(\epsilon\) is independent of the size of the database, whereas in case of \(\delta\), the chances of privacy leak might increase with the size of the database. Hence, ideally, we would want to set the \(\delta\) value to be less than the inverse of the size of the database. However, it is worth noting that \((\epsilon, \delta)\)-DP may not always succeed, as there is a possibility of output that occurs with a non-zero probability \(\delta\) when the individual’s data is present, but never happens otherwise.