Lesson 1: (Detour) Python, pandas and Movie Ratings (MovieLens)

This and following notebooks are principally for me to remind myself how to do some basic things with pandas, python and Bayes, and to try out various methods and ideas.

Before we even start to get anywhere, the first two notebooks are detours into 1) pandas and 2) SVD

The standard example used in explaining matrix factorisation is that of ratings of films. Imagine you have a million users and a thousand films. Each user has reviewed a few dozen films. The task is usually thought of as trying to predict what rating users would have given films that they have not yet rated.

Before we worry about that task, this notebook takes a detour into the data, how to process it, and what it consists of.

The data we'll use in this example is the MovieLens dataset (see http://files.grouplens.org/datasets/movielens/ )

The first part of this discussion is a rehashing of several helpful and detailed tutorials, including Greg Reda's http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/ and Bengfort's https://districtdatalabs.silvrback.com/computing-a-bayesian-estimate-of-star-rating-means and others! (sorry, can't remember where this all came from).

Loading the data

The movielens data consists of movies.dat, ratings.dat and users.dat.

As an example, the ratings file looks like:
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
...etc

We know what the columns mean from the readme file (user, movie, rating and the time/date when the rating was given).

In [36]:
import pandas as pd
ratings = pd.read_table('ml-1m/ratings.dat',sep='::',names=['user','movie','rating','time'])
users = pd.read_table('ml-1m/users.dat',sep='::',names=['user','gender','age','occupation','zip'])
movies = pd.read_table('ml-1m/movies.dat',sep='::',names=['movie','title','genre'])

We can merge these three tables, pandas is clever enough to realise that columns with the same name need to be joined.

In [3]:
movielens = pd.merge(pd.merge(ratings,users),movies)

Accessing the data

A few examples of messing about with this table:

  1. How many were rated by each gender?

Some films have very few ratings...
In [42]:
mean_ratings = movielens.pivot_table('rating',rows=['title'],cols='gender',aggfunc='mean')
temp = movielens.pivot_table('rating',rows=['title'],cols='gender',aggfunc='count')
temp[:10]
Out[42]:
gender F M
title
$1,000,000 Duck (1971) 16 21
'Night Mother (1986) 36 34
'Til There Was You (1997) 37 15
'burbs, The (1989) 92 211
...And Justice for All (1979) 35 164
1-900 (1994) 1 1
10 Things I Hate About You (1999) 232 468
101 Dalmatians (1961) 187 378
101 Dalmatians (1996) 150 214
12 Angry Men (1957) 141 475

10 rows × 2 columns

  1. Just use films with sufficient numbers of reviews.

Let's just get those titles which have at least 250 reviews...

In [30]:
ratings_count = movielens.groupby('title').size()
len(ratings_count.index[ratings_count>250])
Out[30]:
1214

  1. Mean and standard deviation

Find the mean and standard deviation of all the films. Note we can get more detail by adding:

rows=['movie']
or
rows=['user']
or
rows=['age']

etc....

In [44]:
print movielens.pivot_table('rating',cols='gender',rows=['age'],aggfunc='mean')
print movielens.pivot_table('rating',cols='gender',rows=['age'],aggfunc='std')
gender         F         M
age                       
1       3.616291  3.517461
18      3.453145  3.525476
25      3.606700  3.526780
35      3.659653  3.604434
45      3.663044  3.627942
50      3.797110  3.687098
56      3.915534  3.720327

[7 rows x 2 columns]
gender         F         M
age                       
1       1.192325  1.214797
18      1.177290  1.161670
25      1.106069  1.132786
35      1.076955  1.078132
45      1.072365  1.062387
50      1.033605  1.069039
56      1.036587  1.066283

[7 rows x 2 columns]

  1. Gender preferences...

We can see what films men and women prefer

In [45]:
topM = mean_ratings[ratings_count>250].sort_index(by='M', ascending=False)[:10]
topF = mean_ratings[ratings_count>250].sort_index(by='F', ascending=False)[:10]
In [46]:
topM
Out[46]:
gender F M
title
Godfather, The (1972) 4.314700 4.583333
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 4.481132 4.576628
Shawshank Redemption, The (1994) 4.539075 4.560625
Raiders of the Lost Ark (1981) 4.332168 4.520597
Usual Suspects, The (1995) 4.513317 4.518248
Star Wars: Episode IV - A New Hope (1977) 4.302937 4.495307
Schindler's List (1993) 4.562602 4.491415
Wrong Trousers, The (1993) 4.588235 4.478261
Close Shave, A (1995) 4.644444 4.473795
Rear Window (1954) 4.484536 4.472991

10 rows × 2 columns

In [47]:
topF
Out[47]:
gender F M
title
Close Shave, A (1995) 4.644444 4.473795
Wrong Trousers, The (1993) 4.588235 4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075
Schindler's List (1993) 4.562602 4.491415
Shawshank Redemption, The (1994) 4.539075 4.560625
Grand Day Out, A (1992) 4.537879 4.293255
To Kill a Mockingbird (1962) 4.536667 4.372611
Creature Comforts (1990) 4.513889 4.272277
Usual Suspects, The (1995) 4.513317 4.518248

10 rows × 2 columns

  1. Or look at the most divisive films

Female preferred:

In [48]:
ratings_active = mean_ratings[ratings_count>250]
ratings_active['diff'] = (ratings_active['F']-ratings_active['M'])
ratings_active.sort_index(by='diff',ascending=False)[:5]
Out[48]:
gender F M diff
title
Dirty Dancing (1987) 3.790378 2.959596 0.830782
Jumpin' Jack Flash (1986) 3.254717 2.578358 0.676359
Grease (1978) 3.975265 3.367041 0.608224
Little Women (1994) 3.870588 3.321739 0.548849
Steel Magnolias (1989) 3.901734 3.365957 0.535777

5 rows × 3 columns

Male preferred:

In [49]:
ratings_active = mean_ratings[ratings_count>250]
ratings_active['diff'] = (ratings_active['M']-ratings_active['F'])
ratings_active.sort_index(by='diff',ascending=False)[:5]
Out[49]:
gender F M diff
title
Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
Dumb & Dumber (1994) 2.697987 3.336595 0.638608
Longest Day, The (1962) 3.411765 4.031447 0.619682
Cable Guy, The (1996) 2.250000 2.863787 0.613787

5 rows × 3 columns

Next, matrix factorisation...