Experimentation: Accessing the ONS Census database, figuring out the psychic and other related things

How does NOT rating a film affect the result

If not many people have rated a film then if someone reports they've not seen it, it has little effect on their probability.

Quick example:

If 1 of 100 young people have reviewed film A, and 3 of 100 old people have reviewed film A.

If we're told that the user hasn't reviewed film A, we want the age, \(a\): \(P(a|\neg A)\)

Consider: \(P(a|r_A) = {{P(r_A|a) P(a)} \over {P(r_A)}} \propto P(r_A|a) P(a)\)

Assuming \(P(a)\) is flat. If they saw the film,

\(P(a=Young|r_A=True) \propto P(r_A=True | a=Young) \propto 1/100\)

\(P(a=Old|r_A=True) \propto P(r_A=True | a=Old) \propto 3/100\)

So \(P(a=Young|r_A=True) = {{1/100} \over {1/100 + 3/100}} = 1 / (1+3) = 1/4\)

Quite a strong signal.

If they didn't review the film,

\(P(a=Young|r_A=False) \propto P(r_A=False | a=Young) \propto 99/100\)

\(P(a=Old|r_A=False) \propto P(r_A=False | a=Old) \propto 97/100\)

So \(P(a=Young|r_A=False) = {{99/100} \over {99/100 + 97/100}} = 99 / (99+97) = 0.5051\)

Almost no signal.

A big problem is that, for a given person in the database, they have reviewed fewer films than they have seen.

In [36]:
import pandas as pd
ratings = pd.read_table('ml-1m/ratings.dat',sep='::',names=['user','movie','rating','time'])
users = pd.read_table('ml-1m/users.dat',sep='::',names=['user','gender','age','occupation','zip'])
movies = pd.read_table('ml-1m/movies.dat',sep='::',names=['movie','title','genre'])
movielens = pd.merge(pd.merge(ratings,users),movies)
count_ratings = movielens.pivot_table('rating',rows=['movie','title'],cols=['age'],aggfunc='count')
maxmovie = count_ratings.max()
totalcount = count_ratings.sum()
1.0*maxmovie/totalcount
Out[36]:
age
1      0.004116
18     0.003896
25     0.003372
35     0.003146
45     0.003085
50     0.003421
56     0.004745
dtype: float64

So in the movieLens database the most popular film has still only been rated by 0.3% of the users.

In particular the number of films each user has rated has a median of 96.

In [60]:
count_ratings_by_user = movielens.pivot_table('rating',rows=['user'],cols=[],aggfunc='count')
count_ratings_by_user.median()
Out[60]:
96.0

Maybe we could correct for this.

Combining the probabilities... [this bit doesn't yet work]

We have currently inforamation from the movie database and the census.

In particular, - number of times the film has been rated by each age, which gives us \(P(r_i | a)\) (where \(r_i\) is if the film's been rated, and a is the person's age). - age distribution in the OA, which gives us \(P(a | l)\) (where l is location)

\(P(a | r_i, l) = {{P(r_i | a, l) P(a | l)} \over {P(r_i | l)}}\)

We assume \(P(r_i | a, l) = P(r_i | a)\) (i.e. is conditionally independent on location (given age).

So:

\(P(a | r_i, l) \propto P(r_i | a) P(a | l)\)

We know \(P(r_i | a)\) from the MovieLens database and \(P(a | l)\) from the census.

Accessing the ONS Census data

First how to get the possible values of a field (useful if we want to automatically generate questions).

In [70]:
import json
import urllib2

#get the religion options (for all of England+Wales)
j=json.load(urllib2.urlopen('http://data.ons.gov.uk/ons/api/data/dataset/QS208EW.json?context=Census&geog=2011WARDH&dm/2011WARDH=K04000001&totals=false&apikey=cHkIiioOQX'))

  #  print json.dumps(j, indent=4) #j['ons.dataPackage']['ons.structureData'])
descs = j['ons.dataPackage']['ons.structureData']['message.CodeLists']['structure.CodeList'][1]['structure.Code']; 
for d in descs:
    if type(d['structure.Description']) is list:
        for item in d['structure.Description']:
            if (item['@xml.lang']=='en'):
                print item['$'];
    
#structure.Code[2]
#structure.Description
All categories: Religion
Christian
Buddhist
Hindu
Jewish
Muslim
Sikh
Other religion
No religion
Religion not stated

List all possible "concepts" available:

In [71]:
#list all concepts...
j=json.load(open('census_concepts.json'));
for concept in j['ons']['conceptList']['concept']:
    print concept['names']['name'][0]['$']
Family type
Marital and civil partnership status
Sex
Second address
Residence type
Population density
People sleeeping rough
Living arrangements
Household deprivation dimensions
Household composition
Dependent children
Age
Welsh language
Religion
Proficiency in English
Passports
National identity
Main language
Ethnic group
Country of birth
Provision of unpaid care
Long-term health problem or disability
General health
Tenure
Persons per room
Persons per bedroom
Occupancy rating
Number of rooms
Number of bedrooms
Household space status
Household size
Dwellings
Communal establishments
Central heating
Cars or vans
Accommodation type
Year last worked
Social grade
Occupation
National Statistics Socio-economic Classification
Industry
Hours worked
Economic activity
Method of travel to work
Year of arrival in the UK
Length of residence in the UK
Age of arrival in the UK
Highest level of qualification
Qualifications
Migration
Travel to work
Labour Market
Qualifications
Housing and Accommodation
Health
Ethnicity, Identity, Language and Religion
Demography

Example of getting age info from the ONS:

In [2]:
import xml.etree.ElementTree as ET
import urllib2
from StringIO import StringIO
from zipfile import ZipFile
import pandas as pd
import numpy as np

def getAgeDist(geo,postcode):
    pathToONS = 'http://data.ons.gov.uk/ons/api/data/dataset/';
    dataSet = 'QS103EW'; #QS103EW = age by year...
    apiKey = 'cHkIiioOQX';
    #TODO: Check if postcode's been resolved
    #TODO: Handle postcodes of wrong form (i.e. not of 'S1  2AB').
    geoArea = geo[geo.PCD7==postcode.upper()].OA11CD.values[0] #output area code
    geographicalHierarchy = '2011STATH';
    
    url = ('%s%s/dwn.csv?context=Census&geog=%s&dm/%s=%s&totals=false&apikey=%s' % 
    (pathToONS,dataSet,geographicalHierarchy,geographicalHierarchy,geoArea,apiKey))
    response = urllib2.urlopen(url);
    xml_data = response.read();
    root = ET.fromstring(xml_data);
    href = root[1][0][0].text #TO DO: Need to get the path to the href using names not indices.
    
    url = urllib2.urlopen(href);
    zipfile = ZipFile(StringIO(url.read()))
    for filename in zipfile.namelist():
        if (filename[-3:]=='csv'):
            data = pd.read_csv(zipfile.open(filename),skiprows=np.append(np.array(range(8)),10))
           # print filename
            #print zipfile.open(filename).read()
            
    data = data.T
    popages = data[0].values[3:]
    return popages

geo = pd.read_csv('census_geography/PCD11_OA11_LSOA11_MSOA11_LAD11_EW_LU.csv');
getAgeDist(geo,'s6  1ab')
Out[2]:
array([5, 5, 2, 1, 4, 6, 4, 2, 0, 3, 2, 1, 1, 1, 2, 1, 1, 5, 3, 5, 3, 1, 5,
       3, 6, 1, 5, 3, 2, 4, 2, 1, 1, 5, 1, 2, 2, 2, 1, 1, 0, 2, 3, 3, 3, 1,
       5, 1, 7, 6, 3, 11, 7, 5, 7, 1, 0, 2, 2, 2, 3, 4, 1, 1, 3, 3, 3, 1,
       2, 2, 1, 5, 0, 2, 1, 0, 4, 3, 0, 0, 3, 2, 0, 0, 0, 2, 0, 2, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=object)
In [155]:
In [156]:
In [155]:
In [155]:
In [155]:
In [135]:
In []: