To remind ourselves of the problem:

Imagine you have a million users and a thousand films. Each user has reviewed a few dozen films. The task is usually thought of as trying to predict what rating users would have given films that they have not yet rated.

This is an example of 'Collaborative filtering', defined by wikipedia as either;

The process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.
A method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).

Let's load our data, and organise it. For this task we want the data in a matrix (called "'data'"), of users vs films, with the ratings filling the elements. We also need a weight matrix ("'w'") of which elements are ratings. Note that these matrices are very sparse, and in a real implementation one would want to make use of this sparsity to optimise the algorithms.

In [2]:

import pandas as pd
ratings = pd.read_table('ml-1m/ratings.dat',sep='::',names=['user','movie','rating','time'])
     
#for the purposes of this notebook we're just using the ratings...
            #users = pd.read_table('ml-1m/users.dat',sep='::',names=['user','gender','age','occupation','zip'])
            #movies = pd.read_table('ml-1m/movies.dat',sep='::',names=['movie','title','genre'])
            #movielens = pd.merge(pd.merge(ratings,users),movies)

Putting this data into a massive matrix:

In [3]:

import numpy as np
data = np.zeros([np.max(ratings.user),np.max(ratings.movie)])
w = np.zeros([np.max(ratings.user),np.max(ratings.movie)])
for d in ratings.values:
    data[d[0]-1,d[1]-1]=d[2]
    w[d[0]-1,d[1]-1]=1

The function from the last lesson, to minimise the sum-squared-error of the U'V estimate of A, given features number of factors

In [4]:

#W is a matrix of missing features (1=available, 0=missing)
#this function also subtracts the mean of A
def factoriseMatrixMissingData(A,W,features):
        
    meanA = A.mean(axis=1)
    A = A - meanA[:, None]
    
    meanB = A.mean(axis=0)
    A = A - meanB[None, :]

    U = np.random.randn(A.shape[0],features)
    V = np.random.randn(features,A.shape[1])
    #U = np.ones([A.shape[0],2])*.0
    #V = np.ones([2,A.shape[1]])*.0
    lr = 0.005
    for it in range(200):        
        lr = lr * 0.99
        err = (A - np.dot(U,V))*W
        deltaU = np.zeros(U.shape);
        deltaV = np.zeros(V.shape);
        for i in range(A.shape[0]): #for each user...
            for j in range(A.shape[1]): #for each film...     
                deltaU[i,:] += err[i,j]*V[:,j];
                deltaV[:,j] += err[i,j]*U[i,:];
        U += deltaU * lr;
        V += deltaV * lr;
        
        
        guess = np.dot(U,V)+meanA[:, None]+meanB[None, :]
        TrainErr =np.sum(((A - guess)*W)**2)
        TestErr =np.sum(((A - guess)*(1-W))**2)
    #    print "it %i of 200 [%0.2f, %0.2f]" % (it,TrainErr,TestErr)
    return U,V,TrainErr,TestErr, guess

I'm impatient, and this code is NOT fast or clever, so I decided just to look at a subset of users vs a subset of films. I've printed the first 10x10 chunk of the matrix, illustrating its sparsity.

In [14]:

subsetSizeUsers = 400
subsetSizeMovies = 800
smalldata = data[0:subsetSizeUsers,0:subsetSizeMovies]
smallw = w[0:subsetSizeUsers,0:subsetSizeMovies]

np.set_printoptions(precision=2)
np.set_printoptions(suppress=True)

print smalldata[0:10,0:10]

[[ 5.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  2.  0.  0.  0.  0.]
 [ 4.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  4.  0.  0.  0.  0.]
 [ 4.  0.  0.  3.  0.  0.  0.  0.  0.  0.]
 [ 5.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 5.  5.  0.  0.  0.  0.  4.  0.  0.  0.]]

In [15]:

U,V,TrainErr,TestErr, guess = factoriseMatrixMissingData(smalldata,smallw,20)

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-15-8e0ab6f0f6f9> in <module>()
----> 1 U,V,TrainErr,TestErr, guess = factoriseMatrixMissingData(smalldata,smallw,20)

<ipython-input-4-2c99f3e64f2a> in factoriseMatrixMissingData(A, W, features)
     22             for j in range(A.shape[1]): #for each film...
     23                 deltaU[i,:] += err[i,j]*V[:,j];
---> 24                 deltaV[:,j] += err[i,j]*U[i,:];
     25         U += deltaU * lr;
     26         V += deltaV * lr;

KeyboardInterrupt:

In []:

print guess[:10,:10]
print smalldata[:10,:10]

The training data in the predicted matrix fits nicely the actual values, but we need to do some leave-n-out validation to belive this.

A bit of python code fiddling gets us the xs and ys of the elements we know. We then iterate through these pairs, making a temporary copy of the weight matrix, with the element removed (set to zero). We then try to estimate the value of this missing element, and see how close our guess was.

In [8]:

xs,ys = np.nonzero(smallw)

In [16]:

results = np.empty((0,2), float)
it = 0
for x,y in zip(xs,ys):
    print "%d" % it
    it += 1
    trainw = smallw.copy()
    trainw[x,y] = 0
    U,V,TrainErr,TestErr, guess = factoriseMatrixMissingData(smalldata,trainw,10)
    g = guess[x,y]
    a = smalldata[x,y]    
    results = np.vstack([results,[g,a]])
    if (it>50):
        break

In [17]:

results

Out[17]:

array([[ 5.02,  5.  ],
       [ 3.05,  5.  ],
       [ 3.31,  5.  ],
       [ 5.51,  4.  ],
       [ 5.02,  1.  ],
       [ 1.99,  2.  ],
       [ 4.11,  5.  ],
       [ 4.5 ,  4.  ],
       [ 4.66,  3.  ],
       [ 5.33,  3.  ],
       [ 4.1 ,  4.  ],
       [ 2.68,  3.  ],
       [ 4.83,  5.  ],
       [ 4.28,  4.  ],
       [ 4.5 ,  5.  ],
       [ 5.11,  4.  ],
       [ 3.21,  5.  ],
       [ 3.22,  4.  ],
       [ 3.64,  5.  ],
       [ 2.95,  5.  ],
       [ 3.35,  2.  ],
       [ 3.36,  3.  ],
       [ 1.38,  1.  ],
       [ 3.8 ,  5.  ],
       [ 3.2 ,  4.  ],
       [ 4.08,  4.  ],
       [ 3.52,  3.  ],
       [ 2.44,  3.  ],
       [ 2.33,  4.  ],
       [ 2.89,  3.  ],
       [ 3.22,  5.  ],
       [ 2.24,  2.  ],
       [ 2.83,  2.  ],
       [ 2.1 ,  4.  ],
       [ 2.38,  4.  ],
       [ 4.01,  3.  ],
       [ 1.92,  2.  ],
       [ 2.21,  3.  ],
       [ 3.69,  3.  ],
       [ 0.24,  3.  ],
       [ 3.51,  3.  ],
       [ 3.78,  3.  ],
       [ 1.48,  2.  ],
       [ 3.88,  4.  ],
       [-1.15,  3.  ],
       [ 3.9 ,  3.  ],
       [ 4.55,  3.  ],
       [ 3.61,  4.  ],
       [ 3.78,  2.  ],
       [ 4.83,  1.  ],
       [ 2.56,  2.  ]])

In [18]:

np.mean(np.abs(np.round(results[:,0])-results[:,1]))

Out[18]:

1.1764705882352942

In [19]:

#THIS IS NOT HOW TO DO PERMUTATION TESTING, PLEASE IGNORE.
#null=np.mean(np.abs(np.random.permutation(results[:,0])-results[:,1]))
#np.mean(np.abs(np.round(np.random.permutation(results[:,0]))-results[:,1]))

Out[19]:

1.2549019607843137

Ideally we would do some permutation testing to see if we're doing better than chance. But that'll take ages.

In [25]:

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(results[:,0],results[:,1],'x')
plt.plot([1,5],[1,5])

Out[25]:

[<matplotlib.lines.Line2D at 0x7f8c9e729810>]

Lesson 4, Matrix Factorisation for the MovieLens Data¶