If not many people have rated a film then if someone reports they've not seen it, it has little effect on their probability.
Quick example:
If 1 of 100 young people have reviewed film A, and 3 of 100 old people have reviewed film A.
If we're told that the user hasn't reviewed film A, we want the age, \(a\): \(P(a|\neg A)\)
Consider: \(P(a|r_A) = {{P(r_A|a) P(a)} \over {P(r_A)}} \propto P(r_A|a) P(a)\)
Assuming \(P(a)\) is flat. If they saw the film,
\(P(a=Young|r_A=True) \propto P(r_A=True | a=Young) \propto 1/100\)
\(P(a=Old|r_A=True) \propto P(r_A=True | a=Old) \propto 3/100\)
So \(P(a=Young|r_A=True) = {{1/100} \over {1/100 + 3/100}} = 1 / (1+3) = 1/4\)
Quite a strong signal.
If they didn't review the film,
\(P(a=Young|r_A=False) \propto P(r_A=False | a=Young) \propto 99/100\)
\(P(a=Old|r_A=False) \propto P(r_A=False | a=Old) \propto 97/100\)
So \(P(a=Young|r_A=False) = {{99/100} \over {99/100 + 97/100}} = 99 / (99+97) = 0.5051\)
Almost no signal.
A big problem is that, for a given person in the database, they have reviewed fewer films than they have seen.
import pandas as pd
ratings = pd.read_table('ml-1m/ratings.dat',sep='::',names=['user','movie','rating','time'])
users = pd.read_table('ml-1m/users.dat',sep='::',names=['user','gender','age','occupation','zip'])
movies = pd.read_table('ml-1m/movies.dat',sep='::',names=['movie','title','genre'])
movielens = pd.merge(pd.merge(ratings,users),movies)
count_ratings = movielens.pivot_table('rating',rows=['movie','title'],cols=['age'],aggfunc='count')
maxmovie = count_ratings.max()
totalcount = count_ratings.sum()
1.0*maxmovie/totalcount
So in the movieLens database the most popular film has still only been rated by 0.3% of the users.
In particular the number of films each user has rated has a median of 96.
count_ratings_by_user = movielens.pivot_table('rating',rows=['user'],cols=[],aggfunc='count')
count_ratings_by_user.median()
Maybe we could correct for this.
We have currently inforamation from the movie database and the census.
In particular, - number of times the film has been rated by each age, which gives us \(P(r_i | a)\) (where \(r_i\) is if the film's been rated, and a is the person's age). - age distribution in the OA, which gives us \(P(a | l)\) (where l is location)
\(P(a | r_i, l) = {{P(r_i | a, l) P(a | l)} \over {P(r_i | l)}}\)
We assume \(P(r_i | a, l) = P(r_i | a)\) (i.e. is conditionally independent on location (given age).
So:
\(P(a | r_i, l) \propto P(r_i | a) P(a | l)\)
We know \(P(r_i | a)\) from the MovieLens database and \(P(a | l)\) from the census.
First how to get the possible values of a field (useful if we want to automatically generate questions).
import json
import urllib2
#get the religion options (for all of England+Wales)
j=json.load(urllib2.urlopen('http://data.ons.gov.uk/ons/api/data/dataset/QS208EW.json?context=Census&geog=2011WARDH&dm/2011WARDH=K04000001&totals=false&apikey=cHkIiioOQX'))
# print json.dumps(j, indent=4) #j['ons.dataPackage']['ons.structureData'])
descs = j['ons.dataPackage']['ons.structureData']['message.CodeLists']['structure.CodeList'][1]['structure.Code'];
for d in descs:
if type(d['structure.Description']) is list:
for item in d['structure.Description']:
if (item['@xml.lang']=='en'):
print item['$'];
#structure.Code[2]
#structure.Description
List all possible "concepts" available:
#list all concepts...
j=json.load(open('census_concepts.json'));
for concept in j['ons']['conceptList']['concept']:
print concept['names']['name'][0]['$']
Example of getting age info from the ONS:
import xml.etree.ElementTree as ET
import urllib2
from StringIO import StringIO
from zipfile import ZipFile
import pandas as pd
import numpy as np
def getAgeDist(geo,postcode):
pathToONS = 'http://data.ons.gov.uk/ons/api/data/dataset/';
dataSet = 'QS103EW'; #QS103EW = age by year...
apiKey = 'cHkIiioOQX';
#TODO: Check if postcode's been resolved
#TODO: Handle postcodes of wrong form (i.e. not of 'S1 2AB').
geoArea = geo[geo.PCD7==postcode.upper()].OA11CD.values[0] #output area code
geographicalHierarchy = '2011STATH';
url = ('%s%s/dwn.csv?context=Census&geog=%s&dm/%s=%s&totals=false&apikey=%s' %
(pathToONS,dataSet,geographicalHierarchy,geographicalHierarchy,geoArea,apiKey))
response = urllib2.urlopen(url);
xml_data = response.read();
root = ET.fromstring(xml_data);
href = root[1][0][0].text #TO DO: Need to get the path to the href using names not indices.
url = urllib2.urlopen(href);
zipfile = ZipFile(StringIO(url.read()))
for filename in zipfile.namelist():
if (filename[-3:]=='csv'):
data = pd.read_csv(zipfile.open(filename),skiprows=np.append(np.array(range(8)),10))
# print filename
#print zipfile.open(filename).read()
data = data.T
popages = data[0].values[3:]
return popages
geo = pd.read_csv('census_geography/PCD11_OA11_LSOA11_MSOA11_LAD11_EW_LU.csv');
getAgeDist(geo,'s6 1ab')