Introduction

This is an interactive supplement to the lecture "Privacy with Ɛ-Differential Privacy". Currently, this is not a stand-alone manuscript and is only really complete in combination with the presentation, but the content will be continuously expanded and improved.

Our data

In the following, we use sample data describing the income of different individuals. We examine how adding a single data point changes the outcome of statistical analyses on the dataset, and how this compromises the privacy of the person to whom the added data belongs.

In the following, we represent individual records as individual blocks.

Data set $ D $

Data set $ D' $

Data set $ D $

The dataset $ D $ contains all points from dataset $ D ' $, except for one which is added to it.

Data set $ D' $

The data set $ D' $ is identical to the data set $ D $ except for a single added data point . In the following, we call this added data point the difference point.

A first statistic

One of the simplest insights we can gain from our data is the distribution of the incomes of the individuals in the dataset. To do this, we first form income groups. For example, a group may contain all records that have an income between 50-60k€. We then count the number of records in each group, which corresponds to the statistical frequency of that group.

\begin{equation} X = X_t = \sum\limits_{i=1}^N x_i \end{equation}

Where $ x _ i = 1 $ if the income of the data point $ i $ is in the range of income from group $ g $ $ [E^g _ \mathrm{min}, E^g _ \mathrm{max}] $ , and $ x _ i = 0 $ otherwise:

\begin{equation} x_i = \left\{\begin{array}{rcl} 1 & , & E_i \in [E^g_\mathrm{min},E^g_\mathrm{max}] \\ 0 & , & E_i \notin [E^g_\mathrm{min},E^g_\mathrm{max}] \\ \end{array}\right. \end{equation}

Conversion to code

// We calculate the frequencies in data sets D and D'
    const frequency = (d) => 
      d.filter(row => row.income >= incomeGroup.min
                   && row.income < incomeGroup.max).length

Result

Attack on the anonymized data

An attacker who knows all data values $x _ i$ except for a single value $x _ j $ can easily calculate the missing value $ x _ j $ from the result $ X _ t $:

\begin{equation} x_j = X_t - \sum\limits_{i \ne j}^N x_i \end{equation}

Since the relationship between the output value and the input values is deterministic, the attacker can predict with 100% confidence whether the added data point is in the income range under consideration:

Adding noise

To make such attacks more difficult, we could add a random value $n$ to the true result value $X _ t$: $ X = X _ t + n $. This makes it difficult for the attacker to estimate the original value $ x _ j $, because he/she does not know the added random value $ n $:

\begin{equation} x_j = X - \sum\limits_{i \ne j}^N x_i - n \end{equation}

However, there are still edge cases where such noisy data can reveal origin values. Here is an example: We add a random value in the range $[-3, 3]$ to the result from above. We repeat this and plot the frequency of the observed values. As before, green bars mark the result for the data set $D$, and red bars mark the result for the data set $D'$.

Do you notice a problem here? No? Then take a look at the edges of the frequency distribution: By adding the difference data set, the probability distribution shifts to the right by a maximum amount of 1 (if the data point is part of the group under consideration). I.e., if an attacker observes the value on the far right, he/she immediately knows that the data point they are looking for must be in the group, and has thus uncovered the person's income. Why? Let's look at the frequencies for this:

Crucial for the attacker is the ratio of the probabilities of the observed values: If a given value in $ D $ and $ D' $ is equally likely, the attacker can at best only guess how the data point $ x _ j $ contributed to the result. However, the more the ratio of one differs from 1, the more information the observed result provides to the attacker. In the above case, the probability that an attacker will uncover our data is still 25%!

That is, crucial to the security of our noise-based anonymization is the minimum (or maximum) ratio of the probabilities for a given outcome value for the two difference datasets $ D $ and $ D' $:

\begin{equation} \frac{\mathrm{P}(X = x|D)}{\mathrm{P}(X = x|D')} \end{equation}

To find the worst possible case, we need to consider this likelihood ratio over all possible outcome sets:

\begin{equation} \alpha = \max\limits_{x} \frac{\mathrm{P}(X = x|D)}{\mathrm{P}(X = x|D')} \end{equation}

The higher the value $\alpha$, the more information an attacker can derive from an observed result value in the worst case. In practice, we additionally write $\alpha = \exp{\epsilon}$, as this allows us to estimate the privacy loss even for more complex cases. Indeed, we often do not want to publish only one statistic, but equal several. For example, for our income data we might publish a mean, the frequencies considered above, and quantile values. Each individual data point would then contribute to all of these values. Accordingly, we need to consider not only the likelihood ratio for individual values, but for all values together to obtain an estimate of the privacy loss. For example, if our data point goes into two different values $X$ and $Y$, an attacker could again look at the probabilities for combinations of values $(X, Y)$. If the values $X$ and $Y$ are independent, then their probabilities are

\begin{equation} \frac{\mathrm{P}(X = x, Y = y|D)}{\mathrm{P}(X = x, Y = y|D')} = \frac{\mathrm{P}(X = x|D)}{\mathrm{P}(X = x|D')}\frac{\mathrm{P}(Y = y|D)}{\mathrm{P}(Y = y|D')} \le \alpha^2 = \exp{2\epsilon} \end{equation}

assuming that the two probability values satisfy DP with value $\epsilon$ respectively. In the case that the values $X$ and $Y$ are not independent, the value remains below the bound (the proof of this is a bit complicated, though). The privacy loss defined above is thus additive, which is a very useful property: if we know that we want to publish a total of $n$ results based on a data value $x$, we can easily estimate the maximum privacy loss as $n\cdot\epsilon$. We can thus define a privacy budget against which we can plan our publication.

Example: Geometric mechanism

The geometric mechanism adds to a discrete result value a random value whose distribution follows the geometric distribution:

\begin{equation} P(X = x) = (1-p)^n\cdot p \end{equation}

This distribution applies to the range of values $\mathbb{N}^0 = {1,2,3,\ldots}$. The geometric distribution can also be two-sided, i.e. all integer values $\mathbb{N}$ are defined:

\begin{equation} P(X = x) = \left\{ \ \begin{array}{lc} \ p & x = 0 \\ \ 0.5(1-p)^n\cdot p & x \ne 0 \ \end{array} \ \right. \end{equation}

The following code snippet implements the two-sided function:

const geometricNoise = (epsilon, symmetric) => {
  let p = Math.exp(-epsilon)
  let pv = Math.random()
  if (symmetric){
    if (pv < (1-p)/(1+p)) {
      return 0
    }
  } else if (pv > p) {
    return 0
  }
  if (p < 1e-6) {
    return 0
  }
  pv = Math.random()
  let pe = 1.0 - p + p*pv
  let k = Math.floor(Math.log(1-pe)/Math.log(p))
  if (symmetric && Math.random() < 0.5) {
    return -k
  }
  return k
}

Epsilon

Change the value of $ \epsilon $ to get a sense of how the parameter affects the accuracy of the result values and the likelihood of success for an attacker. In general, the smaller $ \epsilon $, the lower the potential privacy loss for victims, but the higher the standard deviation of the resulting data.

Sensitivity

In our example above, adding a data point to our data set changed the result by at most an amount of 1 (since we were calculating frequencies). But what if we want to calculate a function where a single data point has a larger effect on the result? For example, we might be interested in the mean, which is calculated as

\begin{equation} \bar{E} = \frac{1}{N}\sum\limits_{i=1}^N e_i \end{equation}

where $e _ i$ is the respective income of a person. The extent to which a single data point for this function can influence the result depends on the one hand on the possible range of values (in this case the possible salary range), and on the other hand on the number of data points $N$. If $e _ \mathrm{max}$ is the maximum salary to be considered, the sensitivity of the mean value $\bar{E}$ is therefore approximately

\begin{equation} \delta f(\bar{E}) \approx \frac{e_\mathrm{max}}{N} - \ldots \end{equation}

Our data set has entries. If we assume a maximum income of 100,000 €, the sensitivity is approximately $\delta f(\bar{E}) = $ . To protect the mean using differential privacy, we would actually need a different mechanism, since the value is reel and the geometric mechanism can only be applied to discrete data. However, we can discretize the mean to be able to process it using the mechanism. If we choose the discretization interval identical to the sensitivity $\delta f$, we do not need to modify our mechanism above. If we want greater accuracy, we need to modify the mechanism accordingly.

// We calculate the mean value of the incomes
const mean = (d, min, max) => {
  if (d.length === 0)
    throw 'empty list received'
  let m = 0
  d.forEach(row => {
    if (row.income < min || row.income > max)
      throw 'out of bounds value detected'
    m += row.income
  })
  return m/d.length
}

Introduction

Our data

Data set $ D $

Data set $ D' $

Data set $ D $

Data set $ D' $

A first statistic

Conversion to code

Result

Attack on the anonymized data

Adding noise

Example: Geometric mechanism

Epsilon

Sensitivity

Testing DP mechanisms