This and following notebooks are principally for me to remind myself how to do some basic things with pandas, python and Bayes, and to try out various methods and ideas.
Before we even start to get anywhere, the first two notebooks are detours into 1) pandas and 2) SVD
The standard example used in explaining matrix factorisation is that of ratings of films. Imagine you have a million users and a thousand films. Each user has reviewed a few dozen films. The task is usually thought of as trying to predict what rating users would have given films that they have not yet rated.
Before we worry about that task, this notebook takes a detour into the data, how to process it, and what it consists of.
The data we'll use in this example is the MovieLens dataset (see http://files.grouplens.org/datasets/movielens/ )
The first part of this discussion is a rehashing of several helpful and detailed tutorials, including Greg Reda's http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/ and Bengfort's https://districtdatalabs.silvrback.com/computing-a-bayesian-estimate-of-star-rating-means and others! (sorry, can't remember where this all came from).
The movielens data consists of movies.dat, ratings.dat and users.dat.
As an example, the ratings file looks like:1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 ...etc
We know what the columns mean from the readme file (user, movie, rating and the time/date when the rating was given).
import pandas as pd
ratings = pd.read_table('ml-1m/ratings.dat',sep='::',names=['user','movie','rating','time'])
users = pd.read_table('ml-1m/users.dat',sep='::',names=['user','gender','age','occupation','zip'])
movies = pd.read_table('ml-1m/movies.dat',sep='::',names=['movie','title','genre'])
We can merge these three tables, pandas is clever enough to realise that columns with the same name need to be joined.
movielens = pd.merge(pd.merge(ratings,users),movies)
A few examples of messing about with this table:
mean_ratings = movielens.pivot_table('rating',rows=['title'],cols='gender',aggfunc='mean')
temp = movielens.pivot_table('rating',rows=['title'],cols='gender',aggfunc='count')
temp[:10]
Let's just get those titles which have at least 250 reviews...
ratings_count = movielens.groupby('title').size()
len(ratings_count.index[ratings_count>250])
Find the mean and standard deviation of all the films. Note we can get more detail by adding:
rows=['movie']or
rows=['user']or
rows=['age']
etc....
print movielens.pivot_table('rating',cols='gender',rows=['age'],aggfunc='mean')
print movielens.pivot_table('rating',cols='gender',rows=['age'],aggfunc='std')
We can see what films men and women prefer
topM = mean_ratings[ratings_count>250].sort_index(by='M', ascending=False)[:10]
topF = mean_ratings[ratings_count>250].sort_index(by='F', ascending=False)[:10]
topM
topF
Female preferred:
ratings_active = mean_ratings[ratings_count>250]
ratings_active['diff'] = (ratings_active['F']-ratings_active['M'])
ratings_active.sort_index(by='diff',ascending=False)[:5]
Male preferred:
ratings_active = mean_ratings[ratings_count>250]
ratings_active['diff'] = (ratings_active['M']-ratings_active['F'])
ratings_active.sort_index(by='diff',ascending=False)[:5]
Next, matrix factorisation...