MovieLens Case Study with Python

Problem Objective :

Here, we ask you to perform the analysis using the Exploratory Data Analysis technique. You need to find features affecting the ratings of any particular movie and build a model to predict the movie ratings. Domain: Entertainment

Dataset Description :

These files contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000.

Analysis Tasks to be performed:

1. Import the three datasets

In [1]:
import numpy as np #import the numpy library for numerical operations
import pandas as pd #import the pandas library for data wrangling operations
In [2]:
colnames_movies = ['MovieID','Title','Genres']
df_movies = pd.read_csv('movies.dat', delimiter='::', engine='python', names=colnames_movies) #open the movies.dat datafile
df_movies.head()
Out[2]:
MovieIDTitleGenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
In [3]:
df_movies.shape #movies.dat has 3882 observations & 3 variables
Out[3]:
(3883, 3)
In [4]:
colnames_ratings=['UserID','MovieID','Rating','Timestamp']
df_ratings = pd.read_csv('ratings.dat', delimiter='::', engine='python', names=colnames_ratings) #open the ratings.dat datafile
df_ratings.head()
Out[4]:
UserIDMovieIDRatingTimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
In [5]:
df_ratings.shape
Out[5]:
(1000209, 4)
In [6]:
colnames_users=['UserID','Gender','Age','Occupation','Zip-code']
df_users = pd.read_csv('users.dat', delimiter='::', engine='python', names=colnames_users) #open the users.dat datafile
df_users.head()
Out[6]:
UserIDGenderAgeOccupationZip-code
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455
In [7]:
df_users.shape
Out[7]:
(6040, 5)

Create a new dataset [Master_Data] with the following columns

MovieID Title UserID Age Gender Occupation Rating.

(Hint: (i) Merge two tables at a time. (ii) Merge the tables using two primary keys MovieID & UserId)

In [8]:
#merge the ratings dataframe with movies dataframe on column MovieID
df_ratings_movies = pd.merge(df_ratings, df_movies, how='inner', on='MovieID', left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False)
df_ratings_movies.shape
Out[8]:
(1000209, 6)
In [9]:
df_ratings_movies.head()
Out[9]:
UserIDMovieIDRatingTimestampTitleGenres
0115978824268Toy Story (1995)Animation|Children's|Comedy
1614978237008Toy Story (1995)Animation|Children's|Comedy
2814978233496Toy Story (1995)Animation|Children's|Comedy
3915978225952Toy Story (1995)Animation|Children's|Comedy
41015978226474Toy Story (1995)Animation|Children's|Comedy
In [10]:
#now merge the users dataframe with the ratings_movies dataframe
df_ratings_movies_users = pd.merge(df_ratings_movies, df_users, how='inner', on='UserID', left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False)
df_ratings_movies_users.shape
Out[10]:
(1000209, 10)
In [11]:
df_ratings_movies_users.head()
Out[11]:
UserIDMovieIDRatingTimestampTitleGenresGenderAgeOccupationZip-code
0115978824268Toy Story (1995)Animation|Children's|ComedyF11048067
11485978824351Pocahontas (1995)Animation|Children's|Musical|RomanceF11048067
211505978301777Apollo 13 (1995)DramaF11048067
312604978300760Star Wars: Episode IV - A New Hope (1977)Action|Adventure|Fantasy|Sci-FiF11048067
415275978824195Schindler's List (1993)Drama|WarF11048067
In [12]:
df_master_data = df_ratings_movies_users[['MovieID', 'Title', 'UserID', 'Age', 'Gender', 'Occupation', 'Rating','Genres']]
In [13]:
df_master_data.head()
Out[13]:
MovieIDTitleUserIDAgeGenderOccupationRatingGenres
01Toy Story (1995)11F105Animation|Children's|Comedy
148Pocahontas (1995)11F105Animation|Children's|Musical|Romance
2150Apollo 13 (1995)11F105Drama
3260Star Wars: Episode IV - A New Hope (1977)11F104Action|Adventure|Fantasy|Sci-Fi
4527Schindler's List (1993)11F105Drama|War
In [14]:
df_master_data.shape
Out[14]:
(1000209, 8)

Explore the datasets using visual representations (graphs or tables), also include your comments on the following:

1. User Age Distribution

In [15]:
from matplotlib import pyplot as plt #import the matpllotlib pyplot subpackage
In [16]:
plt.hist(df_master_data.Age, bins = 7)
plt.show()
#below histogram shows that age group 25 (i.e. 25-34 years) have voted the maximum times

2.User rating of the movie “Toy Story”

In [17]:
df_bytitle = df_master_data.groupby('Title')
In [18]:
df_bytitle.get_group('Toy Story (1995)')
Out[18]:
MovieIDTitleUserIDAgeGenderOccupationRatingGenres
01Toy Story (1995)11F105Animation|Children's|Comedy
4521Toy Story (1995)650F94Animation|Children's|Comedy
5541Toy Story (1995)825M124Animation|Children's|Comedy
6931Toy Story (1995)925M175Animation|Children's|Comedy
7991Toy Story (1995)1035F15Animation|Children's|Comedy
...........................
9972481Toy Story (1995)602225M175Animation|Children's|Comedy
9975381Toy Story (1995)602525F15Animation|Children's|Comedy
9981701Toy Story (1995)603245M74Animation|Children's|Comedy
9983551Toy Story (1995)603525F14Animation|Children's|Comedy
9998681Toy Story (1995)604025M63Animation|Children's|Comedy

2077 rows × 8 columns

In [19]:
#Get the average ratings for all the movies
df_master_data.groupby(['Title'])['Rating'].mean()
Out[19]:
Title
$1,000,000 Duck (1971)                        3.027027
'Night Mother (1986)                          3.371429
'Til There Was You (1997)                     2.692308
'burbs, The (1989)                            2.910891
...And Justice for All (1979)                 3.713568
                                                ...   
Zed & Two Noughts, A (1985)                   3.413793
Zero Effect (1998)                            3.750831
Zero Kelvin (Kjærlighetens kjøtere) (1995)    3.500000
Zeus and Roxanne (1997)                       2.521739
eXistenZ (1999)                               3.256098
Name: Rating, Length: 3706, dtype: float64
In [20]:
#get the average user rating for the movie - Toy Story (1995)
np.average(df_bytitle.get_group('Toy Story (1995)').Rating)
#shows that Toy Story (1995) has an average user rating of 4.14
Out[20]:
4.146846413095811

3.Top 25 movies by viewership rating

In [21]:
Top25_Movies = df_master_data.groupby(['MovieID','Title'])['Rating'].mean().sort_values(ascending=False)
In [22]:
Top25_Movies
Out[22]:
MovieID  Title                                                     
3382     Song of Freedom (1936)                                        5.0
3172     Ulysses (Ulisse) (1954)                                       5.0
3607     One Little Indian (1973)                                      5.0
3656     Lured (1947)                                                  5.0
3280     Baby, The (1973)                                              5.0
                                                                      ... 
3228     Wirey Spindell (1999)                                         1.0
3651     Blood Spattered Bride, The (La Novia Ensangrentada) (1972)    1.0
641      Little Indian, Big City (Un indien dans la ville) (1994)      1.0
1165     Bloody Child, The (1996)                                      1.0
1430     Underworld (1997)                                             1.0
Name: Rating, Length: 3706, dtype: float64
In [ ]:
 

4.Find the ratings for all the movies reviewed by for a particular user of user id = 2696

In [23]:
df_byuserID = df_master_data.groupby('UserID') 
In [24]:
df_byuserID.get_group(2696) #output is not much impressive. Let us try an alternate method
Out[24]:
MovieIDTitleUserIDAgeGenderOccupationRatingGenres
440667350Client, The (1994)269625M73Drama|Mystery|Thriller
440668800Lone Star (1996)269625M75Drama|Mystery
4406691092Basic Instinct (1992)269625M74Mystery|Thriller
4406701097E.T. the Extra-Terrestrial (1982)269625M73Children's|Drama|Fantasy|Sci-Fi
4406711258Shining, The (1980)269625M74Horror
4406721270Back to the Future (1985)269625M72Comedy|Sci-Fi
4406731589Cop Land (1997)269625M73Crime|Drama|Mystery
4406741617L.A. Confidential (1997)269625M74Crime|Film-Noir|Mystery|Thriller
4406751625Game, The (1997)269625M74Mystery|Thriller
4406761644I Know What You Did Last Summer (1997)269625M72Horror|Mystery|Thriller
4406771645Devil's Advocate, The (1997)269625M74Crime|Horror|Mystery|Thriller
4406781711Midnight in the Garden of Good and Evil (1997)269625M74Comedy|Crime|Drama|Mystery
4406791783Palmetto (1998)269625M74Film-Noir|Mystery|Thriller
4406801805Wild Things (1998)269625M74Crime|Drama|Mystery|Thriller
4406811892Perfect Murder, A (1998)269625M74Mystery|Thriller
4406822338I Still Know What You Did Last Summer (1998)269625M72Horror|Mystery|Thriller
4406832389Psycho (1998)269625M74Crime|Horror|Thriller
4406842713Lake Placid (1999)269625M71Horror|Thriller
4406853176Talented Mr. Ripley, The (1999)269625M74Drama|Mystery|Thriller
4406863386JFK (1991)269625M71Drama|Mystery
In [25]:
#alternative method
df_master_data[df_master_data['UserID'] == 2696][['Title','Rating']]
Out[25]:
TitleRating
440667Client, The (1994)3
440668Lone Star (1996)5
440669Basic Instinct (1992)4
440670E.T. the Extra-Terrestrial (1982)3
440671Shining, The (1980)4
440672Back to the Future (1985)2
440673Cop Land (1997)3
440674L.A. Confidential (1997)4
440675Game, The (1997)4
440676I Know What You Did Last Summer (1997)2
440677Devil's Advocate, The (1997)4
440678Midnight in the Garden of Good and Evil (1997)4
440679Palmetto (1998)4
440680Wild Things (1998)4
440681Perfect Murder, A (1998)4
440682I Still Know What You Did Last Summer (1998)2
440683Psycho (1998)4
440684Lake Placid (1999)1
440685Talented Mr. Ripley, The (1999)4
440686JFK (1991)1

Feature Engineering

Find out all the unique genres (Hint: split the data in column genre making a list and then process the data to find out only the unique categories of genres)

In [26]:
df_master_data.head()
Out[26]:
MovieIDTitleUserIDAgeGenderOccupationRatingGenres
01Toy Story (1995)11F105Animation|Children's|Comedy
148Pocahontas (1995)11F105Animation|Children's|Musical|Romance
2150Apollo 13 (1995)11F105Drama
3260Star Wars: Episode IV - A New Hope (1977)11F104Action|Adventure|Fantasy|Sci-Fi
4527Schindler's List (1993)11F105Drama|War
In [27]:
#below is an inefficient way of getting the genres
genres = ['Action','Adventure','Animation','Children''s','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir',
          'Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western']

Create a separate column for each genre category with a one-hot encoding ( 1 and 0) whether or not the movie belongs to that genre.

In [28]:
#below is an efficient way by python. It will add dummy columns
df_genres = df_master_data['Genres'].str.get_dummies("|")
In [29]:
df_genres
Out[29]:
ActionAdventureAnimationChildren'sComedyCrimeDocumentaryDramaFantasyFilm-NoirHorrorMusicalMysteryRomanceSci-FiThrillerWarWestern
0001110000000000000
1001100000001010000
2000000010000000000
3110000001000001000
4000000010000000010
.........................................................
1000204000000010100000000
1000205100000000000001000
1000206000001010000000000
1000207001110000000000000
1000208000010000000000000

1000209 rows × 18 columns

In [30]:
df_master_data = pd.merge(df_master_data, df_genres, how='inner', left_on=None, right_on=None,
         left_index=True, right_index=True, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False) #Merge by Index; however it is not a good practice
In [31]:
df_master_data.head()
Out[31]:
MovieIDTitleUserIDAgeGenderOccupationRatingGenresActionAdventure...FantasyFilm-NoirHorrorMusicalMysteryRomanceSci-FiThrillerWarWestern
01Toy Story (1995)11F105Animation|Children's|Comedy00...0000000000
148Pocahontas (1995)11F105Animation|Children's|Musical|Romance00...0001010000
2150Apollo 13 (1995)11F105Drama00...0000000000
3260Star Wars: Episode IV - A New Hope (1977)11F104Action|Adventure|Fantasy|Sci-Fi11...1000001000
4527Schindler's List (1993)11F105Drama|War00...0000000010

5 rows × 26 columns

Determine the features affecting the ratings of any particular movie.

Develop an appropriate model to predict the movie ratings

In [32]:
#This is a classification problem. 
#We need to predict the Rating of a movie and find out variables that are significant
In [33]:
df_master_data['Rating'].hasnans
Out[33]:
False
In [34]:
df_master_data['Rating']=df_master_data.Rating.astype('int') #convert Rating to integer
In [35]:
df_master_data['Age']=df_master_data.Age.astype('int') #convert Age to integer
In [36]:
df_master_data['Occupation']=df_master_data.Occupation.astype('int') #convert Occupation to integer

Model Selection Process

In [37]:
from sklearn.model_selection import train_test_split #used to train and test existing dataset

Perform Exploratory Data Analysis (EDA) for the Master Data Set

In [38]:
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [39]:
# machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
In [40]:
#Visualize user age distribution
df_master_data['Age'].value_counts().plot(kind='barh',alpha=0.7,figsize=(10,10))
plt.show()
In [41]:
df_master_data.Age.plot.hist(bins=25)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('Age')
Out[41]:
Text(0.5, 0, 'Age')
In [42]:
#Visualize overall rating by users
df_master_data['Rating'].value_counts().plot(kind='bar',alpha=0.7,figsize=(10,10))
plt.show()

Perform Machine Learning Algorithms

In [43]:
#Use the following features:movie id,age,occupation
features = df_master_data[['MovieID','Age','Occupation']].values
In [44]:
#Use rating as label i.e response variable
labels = df_master_data[['Rating']].values
In [45]:
#Create train and test data set
train, test, train_labels, test_labels = train_test_split(features,labels,test_size=0.25,random_state=1)
In [46]:
#Create a histogram for movie
df_master_data.Age.plot.hist(bins=25)
plt.title("Movie & Rating")
plt.ylabel('MovieID')
plt.xlabel('Ratings')
Out[46]:
Text(0.5, 0, 'Ratings')
In [47]:
#Create a histogram for age
df_master_data.Age.plot.hist(bins=25)
plt.title("Age & Rating")
plt.ylabel('Age')
plt.xlabel('Ratings')
Out[47]:
Text(0.5, 0, 'Ratings')
In [48]:
#Create a histogram for occupation
df_master_data.Age.plot.hist(bins=25)
plt.title("Occupation & Rating")
plt.ylabel('Occupation')
plt.xlabel('Ratings')
Out[48]:
Text(0.5, 0, 'Ratings')
In [49]:
# Logistic Regression

logreg = LogisticRegression()
logreg.fit(train, train_labels)
Y_pred = logreg.predict(test)
acc_log = round(logreg.score(train, train_labels) * 100, 2)
acc_log
C:\Users\ns45237\Anaconda3\lib\site-packages\sklearn\utils\validation.py:73: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(**kwargs)
C:\Users\ns45237\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[49]:
34.87
# Support Vector Machines svc = SVC() svc.fit(train, train_labels) Y_pred = svc.predict(test) acc_svc = round(svc.score(train, train_labels) * 100, 2) acc_svc
In [50]:
# K Nearest Neighbors Classifier

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(train, train_labels)
Y_pred = knn.predict(test)
acc_knn = round(knn.score(train, train_labels) * 100, 2)
acc_knn
<ipython-input-50-595daf63fa48>:4: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  knn.fit(train, train_labels)
Out[50]:
43.99
In [51]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(train, train_labels)
Y_pred = gaussian.predict(test)
acc_gaussian = round(gaussian.score(train, train_labels) * 100, 2)
acc_gaussian
C:\Users\ns45237\Anaconda3\lib\site-packages\sklearn\utils\validation.py:73: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(**kwargs)
Out[51]:
34.88
In [52]:
# Perceptron

perceptron = Perceptron()
perceptron.fit(train, train_labels)
Y_pred = perceptron.predict(test)
acc_perceptron = round(perceptron.score(train, train_labels) * 100, 2)
acc_perceptron
C:\Users\ns45237\Anaconda3\lib\site-packages\sklearn\utils\validation.py:73: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(**kwargs)
Out[52]:
26.04
# Linear SVC linear_svc = LinearSVC() linear_svc.fit(train, train_labels) Y_pred = linear_svc.predict(test) acc_linear_svc = round(linear_svc.score(train, train_labels) * 100, 2) acc_linear_svc# Stochastic Gradient Descent sgd = SGDClassifier() sgd.fit(train, train_labels) Y_pred = sgd.predict(test) acc_sgd = round(sgd.score(train, train_labels) * 100, 2) acc_sgd
In [54]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(train, train_labels)
Y_pred = decision_tree.predict(test)
acc_decision_tree = round(decision_tree.score(train, train_labels) * 100, 2)
acc_decision_tree
Out[54]:
55.68
# Random Forest random_forest = RandomForestClassifier(n_estimators=100) random_forest.fit(train, train_labels) Y_pred = random_forest.predict(test) random_forest.score(train, train_labels) acc_random_forest = round(random_forest.score(train, train_labels) * 100, 2) acc_random_forest
In [58]:
models = pd.DataFrame({
    'Model': ['KNN', 'Logistic Regression', 
               'Naive Bayes', 'Perceptron', 
              'Decision Tree'],
    'Score': [acc_knn, acc_log, 
              acc_gaussian, acc_perceptron, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)
Out[58]:
ModelScore
4Decision Tree55.68
0KNN43.99
2Naive Bayes34.88
1Logistic Regression34.87
3Perceptron26.04

from the above accuracy scores, Decision Tree seems to be the most suitable Model with 55.68% accuracy

In [ ]:
 

Comments

Popular posts from this blog

Data Pre-processing for Machine Learning

Statistics in Data Science