Monte Carlo Markov Chain, Bayes and Matrix Factorisation

The method in the previous notebook was based on minimising the sum-squared-error. There are fundamentally good reasons for choosing this as a cost function, which might become apparent in the following. In this notebook, I review the idea of the monte carlo markov chain, and Gibbs sampling (a simple MCMC method), most of this is from Bishop. I next apply these ideas to the paper by Salakhutdinov and Mnih (2008).

Monte Carlo Markov Chain is a method for estimating the distribution of a probability distribution. It works by repeatedly estimating a new state from a previous state. More likely states will occur more frequently, and so, over time the states visited will begin to represent the true distribution. There are proper proofs of this!

We have some mysterious high-dimensional distribution \(p(\mathbf{z})\), which we want to estimate.

If we start in state \(\mathbf{z}^{(1)}\) then we create a new state \(\mathbf{z}^{(2)}\) sampled from \(q(\mathbf{z} | \mathbf{z}^{(1)})\)

Basic Metropolis

Based on Bishop

We consider a vector \(\mathbf{z}\), which has a probability of \(p(\mathbf{z}) = \tilde{p}(\mathbf{z}) / Z_p\). Where \(\tilde{p}(\mathbf{z})\) is an un-normalised distribution and \(Z_p\) is the normalising constant.

For the basic metropolis version, we assume that the direction of events doesn't matter: \(q(\mathbf{z}^{(1)} | \mathbf{z}^{(2)}) = q(\mathbf{z}^{(2)} | \mathbf{z}^{(1)})\) (symmetric).

There are a few other rules, but we'll first make a little function to generate a new vector, and another to tell us its probability.

In [445]:
import numpy as np

#Generate a new vector, sampled from a gaussian distribution centred at oldVector, with standard deviaton s
def genNewVector(oldVector=[0.,0.],s=2):
    return oldVector + np.random.randn(1,2)*s

#Probability of vect
def prob(vect):
    mean = np.array([3,3])
 #   covar = np.array([[3,3],[0,1]])
    covar = np.array([[2,1],[1,2]])
    #left out normalisation term
    return np.exp(-1/2*(np.dot(np.dot((vect - mean) , np.linalg.inv(covar)) , (vect-mean).T)))

So we now do the Metropolis sampling...

A new sample \(\mathbf{z}^*\) is made, and is accepted with probability \(min \left(1,p(\mathbf{z}^*)/p(\mathbf{z}^{(\tau)})\right)\). In other words, it is accepted with a probability that is the proportion of the likelihood of the new one over the last one, and if the new one is better then it is accepted with 100% probability.

In [446]:
def doMetropolis(s=2,N=1000):
    samples = np.zeros([0,2]);
    rejects = np.zeros([0,2]);
    v = genNewVector();
    oldp = prob(v)
    for it in range(N): 
        newv = genNewVector(v,s)
        if (np.random.sample()<prob(newv)/prob(oldp)):
            v = newv.copy()
        else:
            rejects = np.vstack([rejects,newv])
        samples = np.vstack([samples,v])
    return samples, rejects

We now apply the above algorithm. In the plot below we can see the random walk that the Metropolis sampling takes. With red crosses being rejected samples. Note also that many of the sample dots will have several samples at that location. I've jittered the points to show them more clearly.

In [460]:
%matplotlib inline
import matplotlib.pyplot as plt

samples, rejects = doMetropolis(2)
plt.scatter(samples[0::10,0]+np.random.randn(len(samples[0::10,1]))*.1,samples[0::10,1]+np.random.randn(len(samples[0::10,1]))*.1,marker='.')
plt.scatter(rejects[0::10,0],rejects[0::10,1],marker='x',color='r')
plt.xlim([-10,15]);
plt.ylim([-5,15]);

Random walk very slow at covering state space: distance covered \(\propto \sqrt{\tau}\)

Bit of a problem if the new vector generation has too small a variance:

In [406]:
samples, rejects = doMetropolis(0.2)
plt.scatter(samples[0::10,0],samples[0::10,1],marker='.',color='b')
plt.scatter(rejects[0::10,0],rejects[0::10,1],marker='x',color='r')
plt.xlim([-15,15]);
plt.ylim([-5,15]);

...or too big...

In [407]:
samples, rejects = doMetropolis(20)
plt.scatter(samples[0::10,0],samples[0::10,1],marker='.',color='b')
plt.scatter(rejects[0::10,0],rejects[0::10,1],marker='x',color='r')
plt.xlim([-15,15]);
plt.ylim([-5,15]);

Gibbs Sampling

Instead of resampling the whole vector \(\mathbf{z}\) we sample just one value, conditioned on the remaining element(s):

\(p(z_i|\mathbf{z}_{\i})\)

E.g. for one iteration, we must first find a new value for \(z_1\):

\(z_1^{(2)} = p(z_1|\mathbf{z}_{\1}^{(1)})\)

And again for \(z_2\):

\(z_2^{(2)} = p(z_2|\mathbf{z}_{\2}^{(1)})\)

Note on the Metropolis-Hastings Algorithm [recommend one skips this section]

The Metropolis-Hastings algorithm (which is an extension of the Metropolis algorithm) can handle the situation in which transition probabilities vary, depending on direction. It does this by dividing the probability of a vector by the probability of arriving at that location, and finding the ratio of this value with the value at the prior vector location:

\(A(\mathbf{z^*},\mathbf{z}) = {{p(\mathbf{z^*})}\over{q_k(\mathbf{z^*}|\mathbf{z}})} / {{p(\mathbf{z})}\over{q_k(\mathbf{z}|\mathbf{z^*}})}\)

TODO: Explain why

In this case, the chance of getting there is the very conditional distribution we're sampling from, so these two terms cancel out, and so the new vector is accepted each time! TODO: Explain in more detail

Posterior and Marginals of a Multivariate Gaussian

To apply this to a multivariate normal distribution, we use the follow features of the conditional distribution.

Given a multivariate normal distribution \(p(\mathbf{x}_i, \mathbf{x}_j)\) with a mean vector made up of \(\mathbf{\mu}_{i}\) and \(\mathbf{\mu}_{j}\), and a covariance matrix made up of four matrices; \(\Sigma_{ii}, \Sigma_{ij}, \Sigma_{ji}, \Sigma_{jj}\). The mean and covariance of the conditional distribution of \(p(\mathbf{x}_i|\mathbf{x}_j)\)

\(\mathbf{\mu}_{i|j} = \mathbf{\mu}_i + \Sigma_{ij} \Sigma_{jj}^{-1} (\mathbf{x_j}-\mathbf{\mu}_j)\)

\(\Sigma_{i|j} = \Sigma_ii - \Sigma_{ij} \Sigma_{jj}^{-1} \Sigma_{ji}\)

In [463]:
def genNewVector(sampleIndex=1,oldVector=np.array([0.,0.])):
    #our 'secret' hidden distribution we are trying to determine:
    mean = np.array([3,3])
    covar = np.array([[2,1],[1,2]])
    
    #I'm not sure how to handle this, the syntax seems to need to change depending on whether the components of the covar matrix are scalar or not
    conditionIndices = int(np.hstack([range(sampleIndex),range(sampleIndex+1,oldVector.shape[0])]))
    posterior_mean = mean[sampleIndex] + np.dot(np.dot(covar[sampleIndex,conditionIndices],(covar[conditionIndices][conditionIndices])**-1),oldVector[conditionIndices]-mean[conditionIndices])
#    posterior_mean = mean[sampleIndex] + np.dot(np.dot(covar[sampleIndex,conditionIndices],np.linalg.inv(covar[conditionIndices][:,conditionIndices])),oldVector[conditionIndices]-mean[conditionIndices])
    posterior_covar = covar[sampleIndex][sampleIndex] - np.dot(np.dot(covar[sampleIndex][conditionIndices],(covar[conditionIndices][conditionIndices])**-1),covar[sampleIndex,conditionIndices].T)
#    posterior_covar = covar[sampleIndex][sampleIndex] - np.dot(np.dot(covar[sampleIndex][conditionIndices],np.linalg.inv(covar[conditionIndices][:,conditionIndices])),covar[sampleIndex,conditionIndices].T)
    
    newSample = np.random.randn() * np.sqrt(posterior_covar) + posterior_mean
    newVector = oldVector.copy()
    newVector[sampleIndex] = newSample
    return newVector

#Repeatedly sample using Gibbs Sampling
def doGibbs(N=100):
    samples = np.zeros([0,2]);
    v = genNewVector();
    oldp = prob(v)
    for it in range(N):   #repeated sampling...
        for axis in [0,1]: #sample in the two axes in the data.
            newv = genNewVector(axis,v)
            v = newv
            samples = np.vstack([samples,v])
    return samples

We can see how Gibbs sampling works by plotting these samples:

In [464]:
import matplotlib.pyplot as plt
samples = doGibbs(100)
plt.figure(num=None, figsize=(16, 12), dpi=80, facecolor='w', edgecolor='k')
plt.plot(samples[:,0],samples[:,1])
Out[464]:
[<matplotlib.lines.Line2D at 0x7f03f2988c10>]
In [462]:
np.set_printoptions(precision=3)
np.set_printoptions(suppress=True)

Gibb's sampling for probabilistic matrix factorization

The SVD and similar techniques give us estimates for particular elements of \(R_ij\) but fail to tell us how accurate these estimates might be.

We can consider the estimates of \(R_ij\) as samples from a normal distribution, to take into account the uncertainty. Given a user specific latent feature vector \(\mathbf{U_i}\) and a movie latent feature vector \(\mathbf{V_j}\) our estimate for a given rating is \(\mathbf{U_i}^\top \mathbf{V_j}\). So the probability of a rating \(R_{ij}\) is:

\(P(R_{ij}|U,V,\alpha) = N(R_{ij}|U_i^\top V_j, \alpha)\)

If we assume all rating are independent, we can find the product over all \(i\) and \(j\).

Gibbs Sampling

From Salakhutdinov and Mnih (2008).

For Gibbs sampling to work we want to sample the conditional distribution over one of the feature vectors, given all the other features, etc. In otherwords, we want:

\(P(U_{i}|R,V,\alpha)\)

We can use Bayes' rule:

\(P(U_{i}|R,V,\alpha) = {{P(R|U_{i}, V, \alpha) P(U_{i}|V \alpha)} \over {P(R|V, \alpha)}}\)

For now I'm going to assume the prior is flat over the range we're interested in, and I'm going to ignore the denominator as this is constant over \(U_{i}\).

The next thing to note is that, by writing \(P(R|U_{i}, V, \alpha^{-1})\) as:

\(P(R_{1:}, R_{2:},...,R_{m:}|U_{i}, V, \alpha^{-1})\)

Where \(R_{1:}\) means the 1st row. We can assume that, as above the terms of \(R\) are independent:

\(P(R_{1:}, R_{2:},...,R_{m:}|U_{i}, V, \alpha^{-1}) = P(R_{1:}|U_{i}, V, \alpha^{-1}) \times P(R_{2:}|U_{i}, V, \alpha^{-1}) \times ... \times P(R_{m:}|U_{i}, V, \alpha^{-1})\)

We can see that the only term on the right which is dependent on \(U_i\) is \(P(R_{i:}|U_{i}, V, \alpha^{-1})\), and so, as the other terms are constant we can write that:

\(P(U_{i}|R,V,\alpha^{-1}) \propto P(R_{i:}|U_{i}, V, \alpha^{-1})\)

Factorising the \(R_{i:}\) vector again;

\(P(U_{i}|R,V,\alpha^{-1}) \propto \prod_{j=1}^M P(R_{ij}|U_{i}, V_{j}, \alpha^{-1})\)

Writing this expression using the normal distribution:

\(P(U_{i}|R,V,\alpha^{-1}) \propto \prod_{j=1}^M {1 \over {2 \pi \sqrt{\alpha^{-1}}}} exp \left[ -{{(R_{ij}-{U_i^\top V_j})^2} \over {2 \alpha^{-1}}} \right]\)

Moving the normalisation term into the 'proportional' part:

\(P(U_{i}|R,V,\alpha^{-1}) \propto exp \left[ \sum_{j=1}^M -{{( R_{ij}-{U_i^\top V_j})^2} \over {2 \alpha^{-1}}} \right] = exp \left[ \sum_{j=1}^M -{1 \over {2 \alpha^{-1}}}{(R_{ij}^2- 2 R_{ij}{U_i^\top V_j} - ({U_i^\top V_j})^2)} \right]\)

\(exp \left[ -{1 \over {2 \alpha^{-1}}} \left( \sum_{j=1}^M R_{ij}^2 - \sum_{j=1}^M {2 R_{ij}{U_i^\top V_j}} + \sum_{j=1}^M({U_i^\top V_j})^2) \right) \right]\)

Note we can write \((U_i^\top V_j)^2 = U_i^\top V_j V_j^\top U_i = U_i^\top (V_j V_j^\top) U_i\)

and further can move the \(U_i^\top U_i\) outside the brackets to one side.

\(exp \left[ -{1 \over {2 \alpha^{-1}}} \left( \sum_{j=1}^M R_{ij}^2 - \sum_{j=1}^M {2 R_{ij}{U_i^\top V_j}} + U_i U_i^\top \sum_{j=1}^M(V_j V_j^\top) \right) \right]\)

\(exp \left[ -{\sum_{j=1}^M(V_j V_j^\top) \over {2 \alpha^{-1}}} \left( \sum_{j=1}^M R_{ij}^2/\sum_{j=1}^M(V_j V_j^\top) - U_i^\top \sum_{j=1}^M {2 R_{ij}{V_j}}/\sum_{j=1}^M(V_j V_j^\top) + U_i U_i^\top \sum_{j=1}^M(V_j V_j^\top)/\sum_{j=1}^M(V_j V_j^\top) \right) \right]\)

\(exp \left[ -{\sum_{j=1}^M(V_j V_j^\top) \over {2 \alpha^{-1}}} \left( \sum_{j=1}^M R_{ij}^2/\sum_{j=1}^M(V_j V_j^\top) - U_i^\top \sum_{j=1}^M {2 R_{ij}{V_j}}/\sum_{j=1}^M(V_j V_j^\top) + U_i U_i^\top \right) \right]\)

If we write:

\(\Delta = \alpha \sum_{j=1}^M V_j V_j^\top\)

Then substitute this in:

\(exp \left[ -{\Delta \over 2} \left( \sum_{j=1}^M R_{ij}^2 \Delta^{-1} \alpha - U_i^\top \sum_{j=1}^M {2 R_{ij}{V_j}} \Delta^{-1} \alpha + U_i U_i^\top \right) \right]\)

We can write this as:

\(exp \left[ -{\Delta \over 2} \left( U_i - \sum_{j=1}^M {R_{ij}{V_j}} \Delta^{-1} \alpha \right)^2 \right]\)

This can be seen with a bit of handwaving, in particular, that \(\left( \sum_{j=1}^M {R_{ij}{V_j}} \Delta^{-1} \alpha \right)^2 = \sum_{j=1}^M R_{ij}^2 \Delta^{-1} \alpha\)

\((RV \alpha^{-1} V^{-2} \alpha)^2 = R^2 \alpha^{-1} V^{-2} \alpha\)

\((RV V^{-2})^2 = R^2 V^{-2}\)

\((RV^{-1})^2 = R^2 V^{-2}\)

\(R^2 V^{-2} = R^2 V^{-2}\)

In [479]:
In [479]:
In [480]:
#Calculates the mean and InverseVariance of the Normal Dist for P(U_index|R,V,alpha)
def CalcCondDistU(index,R,V,alpha):
    #Calculate outer product of V_j: with itself, sum them
    VVsum = np.zeros([V.shape[1],V.shape[1]])
    for j in range(V.shape[0]):
        VVsum += np.outer(V[j,:],V[j,:]) 
    
    #Inverse Covariance
    Lambda = alpha*VVsum
    #Calculate the sum of the products of V_j: with R_index,j
    VRsum = np.zeros([V.shape[1]])       
    for j in range(V.shape[0]):
        VRsum += R[index,j]*V[j,:]
        
    mu = alpha*np.dot(np.linalg.inv(Lambda),VRsum)
    return mu,Lambda
In [509]:
#Generate a target matrix...
U = np.array([[1,0],[2,1]]);
V = np.array([[1,2],[1,3],[2,1]]);
R = np.dot(U,V.transpose())

#Now scramble up the factor matrices....
U = np.random.randn(U.shape[0],U.shape[1])
V = np.random.randn(V.shape[0],V.shape[1])

#These matrices are used to store the sum of all the estimtes, which we then use to find the mean with.
newU = np.zeros(U.shape)
newV = np.zeros(V.shape)

Nit = 50 #number of iterations
Us = np.zeros(U.shape)
Vs = np.zeros(V.shape)
for it in range(Nit):
    for index in range(R.shape[0]):
        mu,InvCov = CalcCondDistU(index,R,V,1000.)
        newU[index,:] = np.random.multivariate_normal(mu,np.linalg.inv(InvCov))
    U = newU
    for index in range(R.shape[1]):
        mu,InvCov = CalcCondDistU(index,R.transpose(),U,1000.) #we can use the other function, just transpose etc...
        newV[index,:] = np.random.multivariate_normal(mu,np.linalg.inv(InvCov))
    V = newV
    
    Us += U
    Vs += V

U=Us/Nit
V=Vs/Nit

print "Estimate"
print np.dot(U,V.transpose())
print "Target"
print R
Estimate
[[ 0.998  1.002  2.015]
 [ 3.985  4.987  4.965]]
Target
[[1 1 2]
 [4 5 5]]

Although this can find the answer, it is quite unstable and needs regularisation. This is achieved by using a prior. Something we will look at in the next lesson....