Lesson 3: Matrix Factorisation - sum squared error

Simon Funk's Version

This tutorial is completely based on the explanation given by Simon Funk.

First the quick matrix version;

We have a matrix \(A\) which we are trying to approximate with the product of \(U\) and \(V\):

\(A \approx UV\)

We do this by minimising the sum squared error on all the elements in the matrix. If the row of \(U\) can be written \(u_i\) and the column of \(V\) can be written \(v_j\); the calculation for the \(i,j\) element of \(A\) will be the matrix product of these two vectors; \(u_i . v_j\). So we can write the error function as:

\(E = \sum_i \sum_j \left(A_{ij} - u_i v_j \right)^2\)

We want to minimise \(E\), with respect to these two vectors, \(U_i\) first:

\({{\delta E \over \delta u_i}} = 2 \sum_j \left(A_{ij} - u_i v_j \right) v_j\)

To detour slightly, and check this long-hand (i.e. avoid vector calculus):

The earlier expression for \(E\) can be written as:

\(E = \sum_i \sum_j \left(A_{ij} - \sum_f (U_{if} V_{fj}) \right)^2\)

Where \(f\) is the index of the 'feature'.

We can again differentiate, but this time with respect to \(U_{if}\):

\({{\delta E \over \delta U_{if}}} = 2 \sum_j \left(A_{ij} - \sum_f (U_{if} V_{fj}) \right) V_{fj}\)

Where we can simplify f (U{if} V_{fj}) as \(u_i v_j\) again, and note that if we apply the rule across all features \(f\) we will have the matrix-expression from earlier.

The same logic can be applied to find the derivative wrt \(v_j\), leaving us with:

\({{\delta E \over \delta v_j}} = 2 \sum_j \left(A_{ij} - u_i v_j \right) u_i\)

We can solve this numerically (which will be useful later) by simply following the gradient of the error function. Note also we can do this with a loop, over all j, (ignore the 2).

So one iteration looks like:

\(D = A - U.V\)

for every value of \(j\): \(\Delta u_i = r_{l} D_{ij} v_j\)

for every value of \(i\): \(\Delta v_j = r_{l} D_{ij} u_i\)

where \(r_l\) is the learning rate.

Properly, \(U\) and \(V\) should be updated once per iteration, but this will do for now (with a low learning rate).

The data from the last notebook;

In [19]:
import numpy as np
np.set_printoptions(precision=2)
np.set_printoptions(suppress=True)

A=np.array([[5,0,5,0],[4,1,3,0],[0,4,1,5],[5,1,3,1],[4,0,4,1],[1,3,0,4],[1,3,0,3],[3,2,4,1],[0,5,0,5],[0,4,1,4]])
A
Out[19]:
array([[5, 0, 5, 0],
       [4, 1, 3, 0],
       [0, 4, 1, 5],
       [5, 1, 3, 1],
       [4, 0, 4, 1],
       [1, 3, 0, 4],
       [1, 3, 0, 3],
       [3, 2, 4, 1],
       [0, 5, 0, 5],
       [0, 4, 1, 4]])
In [20]:
W=np.array([[1,1,0,1],[0,1,1,0],[1,1,0,1],[1,0,1,1],[1,0,1,1],[1,0,0,1],[0,1,1,1],[0,1,1,0],[1,1,1,0],[1,1,1,1]])
W
Out[20]:
array([[1, 1, 0, 1],
       [0, 1, 1, 0],
       [1, 1, 0, 1],
       [1, 0, 1, 1],
       [1, 0, 1, 1],
       [1, 0, 0, 1],
       [0, 1, 1, 1],
       [0, 1, 1, 0],
       [1, 1, 1, 0],
       [1, 1, 1, 1]])

The code to perform the algorithm described above,

In [26]:
#factorises matrix A, using features number of factors
def factoriseMatrix(A,features):
    U = np.random.randn(A.shape[0],features) #initialising with random numbers
    V = np.random.randn(features,A.shape[1]) #seems to be good, if non-deterministic
    lr = 0.05      #the learning rate
    for it in range(500):  #could replace with a check on when err**2 has stopped changing much
        lr = lr * 0.995    #simulated annealing sort of
        err = A - np.dot(U,V)  #how far we are from the actual matrix
        deltaU = np.zeros(U.shape);
        deltaV = np.zeros(V.shape);
        for i in range(A.shape[0]): #for each user...
            for j in range(A.shape[1]): #for each film...     
                deltaU[i,:] += err[i,j]*V[:,j]; #add the derivative to the vector
                deltaV[:,j] += err[i,j]*U[i,:];
        U += deltaU * lr;
        V += deltaV * lr;
    return U,V,np.sum(err**2) 

#run the algorithm
U,V,e = factoriseMatrix(A,2)
print "%0.2f" % e
6.71

Compare the approximation with 2 factors with the actual data (with four columns):

In [27]:
print np.dot(U,V)
print A
[[ 5.25  0.13  4.71 -0.1 ]
 [ 3.71  0.54  3.34  0.41]
 [ 0.41  4.34  0.53  4.71]
 [ 4.24  1.04  3.84  0.94]
 [ 4.18  0.6   3.77  0.46]
 [ 0.48  3.36  0.56  3.65]
 [ 0.5   2.87  0.56  3.1 ]
 [ 3.64  1.51  3.31  1.48]
 [-0.09  4.77  0.11  5.21]
 [ 0.43  3.84  0.53  4.17]]
[[5 0 5 0]
 [4 1 3 0]
 [0 4 1 5]
 [5 1 3 1]
 [4 0 4 1]
 [1 3 0 4]
 [1 3 0 3]
 [3 2 4 1]
 [0 5 0 5]
 [0 4 1 4]]

We can see something of how the two feature vectors are organised by looking at \(U\) and \(V\). It is interesting that even these two rows aren't very independent, but if we reduce the number of features to one the estimate becomes very poor. This is because we've not subtracted the mean, explaining why two features are required.

In [28]:
V
Out[28]:
array([[-1.1 , -2.13, -1.07, -2.28],
       [ 1.77, -0.9 ,  1.55, -1.06]])

If we subtract the mean from the two dimensions of A, and try with just one factor,

In [29]:
U,V,e = factoriseMatrix(A,1)
print "Error, with one factor, without demeaning: %0.2f" % e

meanA = A.mean(axis=1)
B = A - meanA[:, None]

meanB = B.mean(axis=0)
C = B - meanB[None, :]

U,V,e = factoriseMatrix(C,1)
print "Error, with one factor, *with* demeaning: %0.2f" % e
print np.dot(U,V)+meanA[:, None]+meanB[None, :]
print A
print "---"
print "The V matrix:"
print V
Error, with one factor, without demeaning: 134.92
Error, with one factor, *with* demeaning: 6.62
[[ 5.28  0.14  4.69 -0.11]
 [ 3.75  0.53  3.3   0.42]
 [ 0.42  4.35  0.52  4.71]
 [ 4.25  1.03  3.81  0.92]
 [ 4.21  0.6   3.74  0.46]
 [ 0.5   3.35  0.51  3.64]
 [ 0.54  2.85  0.51  3.1 ]
 [ 3.67  1.53  3.31  1.49]
 [-0.07  4.78  0.09  5.2 ]
 [ 0.46  3.85  0.51  4.17]]
[[5 0 5 0]
 [4 1 3 0]
 [0 4 1 5]
 [5 1 3 1]
 [4 0 4 1]
 [1 3 0 4]
 [1 3 0 3]
 [3 2 4 1]
 [0 5 0 5]
 [0 4 1 4]]
---
The V matrix:
[[ 1.7  -1.47  1.46 -1.68]]

Clearly, to use one factor, we needed to subtract the mean. Hopefully it's fairly clear why this is the case (hint: If the mean isn't subtracted, one factor is effectively dedicated to adding the mean, while the other factor can deal with the differences between columns).

Missing Data

Finally, handling missing data. This is roughly the same function, but in this case only the non-zero elements of \(W\) are used. Note that \(A\) is really still a complete matrix, so we can compare the results from the function with the elements the function ignored, effectively using them as test data.

In [31]:
#W is a matrix of missing features (1=available, 0=missing)
#this function also subtracts the mean of A
def factoriseMatrixMissingData(A,W,features):
        
    meanA = A.mean(axis=1)
    A = A - meanA[:, None]
    
    meanB = A.mean(axis=0)
    A = A - meanB[None, :]

    U = np.random.randn(A.shape[0],features)
    V = np.random.randn(features,A.shape[1])
    #U = np.ones([A.shape[0],2])*.0
    #V = np.ones([2,A.shape[1]])*.0
    lr = 0.05
    for it in range(200):
        lr = lr * 0.99
        err = (A - np.dot(U,V))*W
        deltaU = np.zeros(U.shape);
        deltaV = np.zeros(V.shape);
        for i in range(A.shape[0]): #for each user...
            for j in range(A.shape[1]): #for each film...     
                deltaU[i,:] += err[i,j]*V[:,j];
                deltaV[:,j] += err[i,j]*U[i,:];
        U += deltaU * lr;
        V += deltaV * lr;
    guess = np.dot(U,V)+meanA[:, None]+meanB[None, :]
    TrainErr =np.sum(((A - guess)*W)**2)
    TestErr =np.sum(((A - guess)*(1-W))**2)

    return U,V,TrainErr,TestErr, guess

Running the above function with different numbers of features (between 1 to 4), shows that the test data becomes less well fitted with more features (an example of over fitting). Not surprisingly, looking at the data, there's only one clear feature, which is why that does best.

In [32]:
errs = [];
for testit in range(10):
    U,V,TrainErr,TestErr,guess = factoriseMatrixMissingData(A,W,1) #ONE FACTOR
    errs.append(TestErr);

print errs
print guess
print A
print V
[74.467275062268968, 74.467295168100677, 74.467168944365369, 74.467091780692087, 74.467121089418882, 74.467296872929225, 74.467301111848627, 74.467326024472044, 74.46659382048972, 74.466155160927954]
[[ 5.29  0.3   4.59  0.06]
 [ 3.38  0.93  2.93  0.87]
 [ 0.17  4.42  0.4   4.81]
 [ 4.32  1.08  3.79  0.96]
 [ 4.08  0.82  3.55  0.7 ]
 [ 0.54  3.22  0.61  3.5 ]
 [ 0.25  3.    0.33  3.29]
 [ 3.89  1.43  3.44  1.36]
 [-0.25  4.76  0.06  5.2 ]
 [ 0.37  3.81  0.52  4.14]]
[[5 0 5 0]
 [4 1 3 0]
 [0 4 1 5]
 [5 1 3 1]
 [4 0 4 1]
 [1 3 0 4]
 [1 3 0 3]
 [3 2 4 1]
 [0 5 0 5]
 [0 4 1 4]]
[[ 1.52 -1.22  1.24 -1.41]]

In [33]:
errs = [];
for testit in range(10):
    U,V,TrainErr,TestErr,guess = factoriseMatrixMissingData(A,W,2) #TWO FACTORS
    errs.append(TestErr);

print np.mean(errs)
90.2272590926

In [34]:
errs = [];
for testit in range(10):
    U,V,TrainErr,TestErr,guess = factoriseMatrixMissingData(A,W,3) #THREE FACTORS
    errs.append(TestErr);

print np.mean(errs)
99.1708252073

In [35]:
errs = [];
for testit in range(10):
    U,V,TrainErr,TestErr,guess = factoriseMatrixMissingData(A,W,4) #FOUR FACTORS
    errs.append(TestErr);

print np.mean(errs)
103.168736084

Next, applying this method to the movieLens data...