Every individual has a (potentially incomplete, potentially of any length) vector of features: $x_i$.

We need a function to generate arbitrary summaries of these features. A linear combination seems the most obvious, but we also want to ensure that information about individual features (and what about small groups of features?) is not revealed.

Neil suggested that the sum of a subset of features would work - I think we also need to normalise the features to make that work (e.g. age+income might not make sense by default). Maybe I'm thinking too literally...

Anyway. We have some function $f(x_i,w_t)$ which generates a summary from these features. The matrix $w_t$ tells the function which features to use. Presumably we could add noise (i.e. differential privacy like ideas).

We, as the verifier, also hold some of $x_i$ ourselves.

The aim is to: 1. Determine if the prover is the person, $i$. 2. Avoid the prover releasing any data about themselves (e.g. the summary can't provide a method for revealing, e.g. income, etc).

We must assume that: 1. There are correlations in the values of $\mathbf{x}$.

A simple example

A person has four features: - age: 30 - income (in £k/year): 25 - distance from London (in 10s of km): 20 - weight in stone: 9

(ok, slightly ludicrious example! but trying to avoid the normalisation debate for now. Ignore all issues around e.g. how to include children, non-earners, etc, and non-normal...etc).

To illustrate, consider data about 1000 people, for these four features:

In [1]:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#UK average age: 39.9, average income: 21.3, avg distance 20, weight:11.9
mu = np.array([39.9,21.3,20,11.9])
cov = np.diag([11**2,11**2,10**2,3**2]);
cov[0,1] = cov[1,0] = 10.5**2
cov[0,2] = cov[2,0] = -6**2
cov[1,2] = cov[2,1] = -2.8**2
cov[0,3] = cov[3,0] = 3.45**2
cov[1,3] = cov[3,1] = 4.5**2
cov[2,3] = cov[3,2] = 0

print (np.linalg.det(cov))
x = np.random.multivariate_normal(mu,cov,1000)
x[0,:] = np.array([30,25,20,9])

#plt.plot(x[:,2],x[:,1],'x')
#plt.plot(x[0,2],x[0,1],'ro')

plt.plot(x[:,0],x[:,1],'x')
plt.plot(x[0,0],x[0,1],'ro')

270504.0

Out[1]:

[<matplotlib.lines.Line2D at 0x7fb2f64046d0>]

We ask the prover $i=0$ for a summary with $w_t = \left( 1,0,1,0 \right)^\top$. We'll assume that $f$ is the sum of these features.

In [2]:

w = np.array([1,0,1,0])
f = np.dot(x[0,:],w)+np.random.randn()*4.; #added some randomness (to do: determine correct value for this)
print f

49.9422381407

How likely is this? We know some of the features of this user, e.g. $k_i = [1,1,0,1]^\top$. We can create a new distribution, with the summary as one of the dimensions:

($x_2, x_4, f$)

If the previous covariance matrix was:

$\Sigma =\left( \begin{array}{cccc} \Sigma_{11} & \Sigma_{12} & \Sigma_{13} & \Sigma_{14} \\ \Sigma_{21} & \Sigma_{22} & \Sigma_{23} & \Sigma_{24} \\ \Sigma_{31} & \Sigma_{32} & \Sigma_{33} & \Sigma_{34} \\ \Sigma_{41} & \Sigma_{42} & \Sigma_{43} & \Sigma_{44} \\ \end{array} \right) $

It is now:

$\Sigma =\left( \begin{array}{cccc} \Sigma_{ff} & \Sigma_{f2} & \Sigma_{f4} \\ \Sigma_{2f} & \Sigma_{22} & \Sigma_{24} \\ \Sigma_{4f} & \Sigma_{42} & \Sigma_{44} \\ \end{array} \right) $

Where: $\Sigma_{fi} = \sum_{j \in F} \Sigma_{ij}$

(F is the set of features included, i.e. the non-zeros in $w_t$).

$\mu_{f} = \sum_{j \in F} \mu_{j}$

Side note on covariances Ignoring the means for now, assume three random variables: $X_i, X_j, X_k$. We assume our summary is the sum of two parameters $X_i + X_j$. The covariance between this summary and the third parameter is $<(X_i + X_j) X_k> = <(X_i X_k) + (X_j X_k)> = <X_i X_k> + <X_j X_k> = \Sigma_{ij} + \Sigma_{jk}$

Back to the problem.

We can test the likelihood of the summary by looking at the pdf.

In [3]:

import math
def log_multv_normal(mu,cov,xs):
        logp = np.log(math.pow(np.linalg.det(cov),-.5));        
        logp += np.log(math.pow(2*np.pi,-len(mu)/2));
        inv = np.linalg.inv(cov)
        logps = np.zeros([xs.shape[0],1])
        for i,x in enumerate(xs):
            logps[i] = logp + (-.5 * np.dot(np.dot((x-mu),inv),(x-mu).T));
      #  logp += -.5 * np.dot(np.dot((x-mu*x.shape[1]).T,inv),(x-mu*x.shape[1]));

        return logps

In [4]:

''' Determines pdf of x being sampled from distribution. w is a vector of elements included in summary...'''
def popsample(mu,cov,w,f,x):
    exc = np.nonzero(w==0)[0]; #list of row indices included and excluded from summary
    inc = np.nonzero(w==1)[0]; 
    newcov = np.zeros([1+len(exc),1+len(exc)])
    newcov[0,0] = cov[np.ix_(inc,inc)].sum()
    for i,idx in enumerate(exc):
        newcov[i+1][0] = cov[idx,w==1].sum()
        newcov[0][i+1] = cov[idx,w==1].sum()
    newcov[np.ix_(range(1,1+len(exc)),range(1,1+len(exc)))] = cov[np.ix_(np.nonzero(w==0)[0],np.nonzero(w==0)[0])]
    
    newmu = np.zeros(1+len(exc))
    newmu[0] = mu[inc].sum()
    newmu[1:] = mu[exc]
    
    #assume we know all of the excluded parameters - we need to marginalise if we don't.
    
    t = np.hstack([f,x[exc]])
    t = np.array([t]).T  #arrg, get it into the right shape
    l = log_multv_normal(newmu,newcov,t.T)
    return l,newmu,newcov,t

print x[0,:]
l,newmu,newcov,t = popsample(mu,cov,w,f,x[0,:])

[ 30.  25.  20.   9.]

In [5]:

import scipy.stats
inv = np.linalg.inv(newcov)
r = np.dot(np.dot((t.T-newmu),inv),(t.T-newmu).T)
print "CDF outside Mahalanobis radius %0.3f of centre: %0.3f" % (r,1-scipy.stats.chi2.cdf(r, len(newmu)))

CDF outside Mahalanobis radius 7.429 of centre: 0.059

In [6]:

#alternative less likely example
x[0,:] = [39,29,10,11.9]
f = np.dot(x[0,:],w)+np.random.randn()*4.;
l,newmu,newcov,t = popsample(mu,cov,w,f,x[0,:])
inv = np.linalg.inv(newcov)
r = np.dot(np.dot((t.T-newmu),inv),(t.T-newmu).T)
print "CDF outside Mahalanobis radius %0.3f of centre: %0.3f" % (r,1-scipy.stats.chi2.cdf(r, len(newmu)))

CDF outside Mahalanobis radius 6.395 of centre: 0.094

Conditional probability distribution of a multivariate gaussian

Given $\mathbf{x}$ made up of two vectors $\mathbf{x_i}$ & $\mathbf{x_j}$ is Normally Distributed, with mean $\mathbf{\mu}$ and covariance $\Sigma$. Where both parameters are made up of components associated with the two vectors.

$\mathbf{\mu} =\left( \begin{array}{c} \mu_{i} \\ \mu_{j} \\ \end{array} \right) $

$\Sigma =\left( \begin{array}{cc} \Sigma_{ii} & \Sigma_{ij} \\ \Sigma_{ji} & \Sigma_{jj} \\ \end{array} \right) $

The conditional distribution of $\mathbf{x_i}$ given $\mathbf{x_j}$ is normal with mean:

$\mathbf{\mu_{i|j}} = \mathbf{\mu_i} + \Sigma_{ij} \Sigma_{jj}^{-1} \left(\mathbf{x_j}-\mathbf{\mu_j} \right)$

$\Sigma_{i|j} = \Sigma_{ii} - \Sigma_{ij} \Sigma_{jj}^{-1} \Sigma_{ij}^\top$

The probability density of $f=\sum_i X_i$

We want to understand how much knowledge we gain about a given feature $x_i$.

We know from above that, given the features ($x_1...$) included in the summary random variable $F$, the mean and variance is:

$\mu_{f} = \sum_{j \in F} \mu_{j}$

$\Sigma_{fi} = \sum_{j \in F} \Sigma_{ij}$

$\Sigma_{ff} = \sum_{i \in F} \sum_{j \in F} \Sigma_{ij}$

We are interested in what our new estimates are for $X_i$.

$\mathbf{\mu_{i|f}} = \mathbf{\mu_i} + \Sigma_{if} \Sigma_{ff}^{-1} \left( f_v-\mu_f \right)$

$\Sigma_{i|f} = \Sigma_{ii} - \Sigma_{if} \Sigma_{ff}^{-1} \Sigma_{if}^\top$

Example:

$\mathbf{\mu} =\left( \begin{array}{c} 1 \\ 2 \\ \end{array} \right) $

$=(

)

In [7]:

D = 2;
mu = np.array([1,2]);
cov = np.array([[3,2],[2,4]]);
x = np.random.multivariate_normal(mu,cov,3000);
plt.plot(x[:,0],x[:,1],'+')
plt.xlim([-6,6])
plt.ylim([-6,6])
plt.plot([1],[2],'or')
plt.figure()

Out[7]:

<matplotlib.figure.Figure at 0x7fb2eca0cfd0>

<matplotlib.figure.Figure at 0x7fb2eca0cfd0>

We are told that the feature parameter $f=5$ so this means we know that $X_1 + X_2 = 5$. What do we now know about $X_1$ and $X_2$?

Before, the marginal on $X_1$ had: $\mu_1 = 1$ $\sigma_1 = 3$.

Substituting into the earlier equations:

$\mu_{f} = 1 + 2 = 3$

$\Sigma_{f1} = 3+2 = 5$

$\Sigma_{f2} = 2+4 = 6$

$\Sigma_{ff} = 3+2+2+4 = 11$

$\mathbf{\mu_{1|f}} = 1 + 5 \times 11^{-1} \times \left( 5 - 3 \right) = 1.91$

$\Sigma_{1|f} = 3 - 5 \times 11^{-1} \times 5 = 3 - 2.27 = 0.72$

$\mathbf{\mu_{2|f}} = 2 + 6 \times 11^{-1} \times \left( 5 - 3 \right) = 3.09$

$\Sigma_{2|f} = 4 - 6 \times 11^{-1} \times 6 = 4 - 3.27 = 0.72$

In [8]:

D = 2;
mu = np.array([1,2]);
cov = np.array([[3,2],[2,4]]);
x = np.random.multivariate_normal(mu,cov,3000);
plt.plot(x[:,0],x[:,1],'+')
plt.xlim([-6,6])
plt.ylim([-6,6])
plt.plot([1],[2],'or')
plt.plot([10,-5],[-5,10],'k-')
plt.vlines(0,-6,6)
plt.hlines(0,-6,6)
plt.plot([1.91,1.91],[3.09-np.sqrt(0.72),3.09+np.sqrt(0.72)],'k-')
plt.plot([1,1],[2-np.sqrt(4),2+np.sqrt(4)],'k-')

Out[8]:

[<matplotlib.lines.Line2D at 0x7fb2eca60d10>]

The two vertical lines indicate the spread of likely $x_2$ values before (with red dot) and after the summary is revealed, given the correlation structure.

To add random noise to the summary value $f$ we can modify the covariances. One might assume the addition of an uncorrelated multivariate (covariance s, is diagonal $\mathbf{\sigma}$) would make sense. However the articles I've read so far suggest that non-gaussians are better(to check/read).

$\mu_{f} = \sum_{j \in F} \mu_{j}$

$\Sigma_{fi} = \sum_{j \in F} \Sigma_{ij}$

$\Sigma_{ff} = \sum_{i \in F} \sum_{j \in F} \Sigma_{ij} + s_ij^2$

$s_ij=\sigma_i$ if $i=j$

$s_ij=0$ otherwise

so can just add $\sum_i \sigma_i^2$ to $\Sigma_{ff}$:

$\Sigma_{ff} = \sum_{i \in F} \sum_{j \in F} \Sigma_{ij} + \sum_i \sigma_i^2$

Note: Adding uncertainty in one direction doesn't help any more than another. Not a surprise as we just get a single, summed value. What would help would be to scale some values (as this will drag them more into the noise generated by this extra term).

So the extra information we gain is

Differential Privacy

From The Algorithmic Foundations of Differential Privacy by Dwork and Roth

General definition (ideal): A promise by a curator to a data subject that they will not be affected by being included in a database, regardless of what other sources are available. An analyst will learn nothing about an individual while learning something about the population.

Key phrase: Linkage attack - using data from a second, auxilary data source to infer additional information from our 'anonymised' database.

Features of differential privacy and privacy in general

Data cannot be fully anonymised and remain useful.
Identification of 'anonymised' records is not the only risk.
Queries over large sets are not protective (e.g. cases in set minus cases in set/bob)
Query auditing is problematic (e.g. trying to ensure anonymity per query - computationally intractable, and might reveal something by not disclosing!)
Summary statistics are not safe
'ordinary facts' (e.g. an individual's bread-buying) are not 'safe' (e.g. give a clue re diabetes).
'Just a few' people affected argument fraught with issues (often outliers need most protection). Also could apply logic to just release the complete data of 1% of rows.

The Model of Computation

Key phrase: Non-interactive model (a database that will only have a fixed, known set of queries made against it). This gives best accuracy, for a given disclosure as it can correlate noise, knowing the structure of the queries. Maybe useful for the privacy mechanism you're envisaging?

Side note: Surveying illegal behaviour: The participant is asked to flip a coin. If tails they report truth. If heads, they flip the coin again. Report yes if heads, no if tails.

Flipping coins to randomise answers

"Any non-trivial privacy guarantee that holds, regardless of future sources of auxilar information requires randomisation.

Distance between databases

Use the $l_1$ norm. $||x||_1 = \sum_{i=1}^{|\chi|} |x_i|$ where $x_i$ is the number of entries of type i. So the distance between two databases is $||x-y||_1$.

The Differential Privacy Rule

A randomisation algorithm $M$, is $(\varepsilon, \delta)$-differentially private if $\forall S \subseteq Range(M)$ and for all $x,y \in N^{|\chi|}$ such that $||x-y|| \leq 1$:

$P[M(x) \in S] \leq e^\varepsilon P[M(y) \in S] + \delta$

In simipler terms

If we have a database $x$ that has been modified by at most one item, to become database $y$. The probability of the output of the randomisation function we have applied to $x$ will be similar to $y$, with a bound of:

$P_x \leq e^\varepsilon P_y + \delta$

which means that the two probabilities $P_x$ and $P_y$ must be similar.

If $\delta \neq 0$ there might be some database $y$ for which $M(y)$ is far more likely than $M(x)$.

Privacy Loss

Given the output of the randomisation function is $\xi$, the privacy loss is defined as the log-likelihood ratio of the two databases:

$L^{(\xi)}_{M(x)||M(y)} = ln {{P[M(x)=\xi]} \over {P[M(y)=\xi]}}$

Summary:

The $(\varepsilon, \delta)$ - differential privacy rule is:

$P[M(x) \in S] \leq e^\varepsilon P[M(y) \in S] + \delta$

which implies that:

for all adjacent x,y, the absolute value of $L$ will be bounded by $\varepsilon$ with probability of at least $1-\delta$.

Separate queries

Worst case, if $k$ queries are made, then:

$\varepsilon' = k \varepsilon$

$\delta' = k \delta$

Group privacy

(particularly relevant)

Worst case, if $k$ rows are included, then:

$\varepsilon' = k \varepsilon$

$\delta' = k e^{(k-1)\varepsilon} \delta$

Individual utility

An individual's utility will not be harmed by more than a factor of $e^\varepsilon \approx 1 + \varepsilon$

A simple example

Conditional probability distribution of a multivariate gaussian

The probability density of \(f=\sum_i X_i\)

Differential Privacy

Features of differential privacy and privacy in general

The Model of Computation

Distance between databases

The Differential Privacy Rule

In simipler terms

Privacy Loss

Summary:

Separate queries

Group privacy

Individual utility