Here, we ask you to perform the analysis using the Exploratory Data Analysis technique. You need to find features affecting the ratings of any particular movie and build a model to predict the movie ratings. Domain: Entertainment

Dataset Description :

These files contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000.

Analysis Tasks to be performed:

1. Import the three datasets

import numpy as np #import the numpy library for numerical operations
import pandas as pd #import the pandas library for data wrangling operations

colnames_movies = ['MovieID','Title','Genres']
df_movies = pd.read_csv('movies.dat', delimiter='::', engine='python', names=colnames_movies) #open the movies.dat datafile
df_movies.head()

df_movies.shape #movies.dat has 3882 observations & 3 variables

(3883, 3)

colnames_ratings=['UserID','MovieID','Rating','Timestamp']
df_ratings = pd.read_csv('ratings.dat', delimiter='::', engine='python', names=colnames_ratings) #open the ratings.dat datafile
df_ratings.head()

df_ratings.shape

(1000209, 4)

colnames_users=['UserID','Gender','Age','Occupation','Zip-code']
df_users = pd.read_csv('users.dat', delimiter='::', engine='python', names=colnames_users) #open the users.dat datafile
df_users.head()

df_users.shape

(6040, 5)

Create a new dataset [Master_Data] with the following columns¶

MovieID Title UserID Age Gender Occupation Rating.¶

(Hint: (i) Merge two tables at a time. (ii) Merge the tables using two primary keys MovieID & UserId)

#merge the ratings dataframe with movies dataframe on column MovieID
df_ratings_movies = pd.merge(df_ratings, df_movies, how='inner', on='MovieID', left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False)
df_ratings_movies.shape

(1000209, 6)

df_ratings_movies.head()

#now merge the users dataframe with the ratings_movies dataframe
df_ratings_movies_users = pd.merge(df_ratings_movies, df_users, how='inner', on='UserID', left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False)
df_ratings_movies_users.shape

(1000209, 10)

df_ratings_movies_users.head()

df_master_data = df_ratings_movies_users[['MovieID', 'Title', 'UserID', 'Age', 'Gender', 'Occupation', 'Rating','Genres']]

df_master_data.head()

df_master_data.shape

(1000209, 8)

Explore the datasets using visual representations (graphs or tables), also include your comments on the following:¶

1. User Age Distribution¶

from matplotlib import pyplot as plt #import the matpllotlib pyplot subpackage

plt.hist(df_master_data.Age, bins = 7)
plt.show()
#below histogram shows that age group 25 (i.e. 25-34 years) have voted the maximum times

2.User rating of the movie “Toy Story”¶

df_bytitle = df_master_data.groupby('Title')

df_bytitle.get_group('Toy Story (1995)')

#Get the average ratings for all the movies
df_master_data.groupby(['Title'])['Rating'].mean()

Title
$1,000,000 Duck (1971)                        3.027027
'Night Mother (1986)                          3.371429
'Til There Was You (1997)                     2.692308
'burbs, The (1989)                            2.910891
...And Justice for All (1979)                 3.713568
                                                ...   
Zed & Two Noughts, A (1985)                   3.413793
Zero Effect (1998)                            3.750831
Zero Kelvin (Kjærlighetens kjøtere) (1995)    3.500000
Zeus and Roxanne (1997)                       2.521739
eXistenZ (1999)                               3.256098
Name: Rating, Length: 3706, dtype: float64

#get the average user rating for the movie - Toy Story (1995)
np.average(df_bytitle.get_group('Toy Story (1995)').Rating)
#shows that Toy Story (1995) has an average user rating of 4.14

4.146846413095811

3.Top 25 movies by viewership rating¶

Top25_Movies = df_master_data.groupby(['MovieID','Title'])['Rating'].mean().sort_values(ascending=False)

Top25_Movies

MovieID  Title                                                     
3382     Song of Freedom (1936)                                        5.0
3172     Ulysses (Ulisse) (1954)                                       5.0
3607     One Little Indian (1973)                                      5.0
3656     Lured (1947)                                                  5.0
3280     Baby, The (1973)                                              5.0
                                                                      ... 
3228     Wirey Spindell (1999)                                         1.0
3651     Blood Spattered Bride, The (La Novia Ensangrentada) (1972)    1.0
641      Little Indian, Big City (Un indien dans la ville) (1994)      1.0
1165     Bloody Child, The (1996)                                      1.0
1430     Underworld (1997)                                             1.0
Name: Rating, Length: 3706, dtype: float64

4.Find the ratings for all the movies reviewed by for a particular user of user id = 2696¶

df_byuserID = df_master_data.groupby('UserID')

df_byuserID.get_group(2696) #output is not much impressive. Let us try an alternate method

#alternative method
df_master_data[df_master_data['UserID'] == 2696][['Title','Rating']]

Feature Engineering¶

Find out all the unique genres (Hint: split the data in column genre making a list and then process the data to find out only the unique categories of genres)

df_master_data.head()

#below is an inefficient way of getting the genres
genres = ['Action','Adventure','Animation','Children''s','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir',
          'Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western']

Create a separate column for each genre category with a one-hot encoding ( 1 and 0) whether or not the movie belongs to that genre.

#below is an efficient way by python. It will add dummy columns
df_genres = df_master_data['Genres'].str.get_dummies("|")

df_genres

df_master_data = pd.merge(df_master_data, df_genres, how='inner', left_on=None, right_on=None,
         left_index=True, right_index=True, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False) #Merge by Index; however it is not a good practice

df_master_data.head()

Determine the features affecting the ratings of any particular movie.¶

Develop an appropriate model to predict the movie ratings¶

#This is a classification problem. 
#We need to predict the Rating of a movie and find out variables that are significant

df_master_data['Rating'].hasnans

False

df_master_data['Rating']=df_master_data.Rating.astype('int') #convert Rating to integer

df_master_data['Age']=df_master_data.Age.astype('int') #convert Age to integer

df_master_data['Occupation']=df_master_data.Occupation.astype('int') #convert Occupation to integer

Model Selection Process¶

from sklearn.model_selection import train_test_split #used to train and test existing dataset

Perform Exploratory Data Analysis (EDA) for the Master Data Set

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

#Visualize user age distribution
df_master_data['Age'].value_counts().plot(kind='barh',alpha=0.7,figsize=(10,10))
plt.show()

df_master_data.Age.plot.hist(bins=25)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('Age')

Text(0.5, 0, 'Age')

#Visualize overall rating by users
df_master_data['Rating'].value_counts().plot(kind='bar',alpha=0.7,figsize=(10,10))
plt.show()

Perform Machine Learning Algorithms

#Use the following features:movie id,age,occupation
features = df_master_data[['MovieID','Age','Occupation']].values

#Use rating as label i.e response variable
labels = df_master_data[['Rating']].values

#Create train and test data set
train, test, train_labels, test_labels = train_test_split(features,labels,test_size=0.25,random_state=1)

#Create a histogram for movie
df_master_data.Age.plot.hist(bins=25)
plt.title("Movie & Rating")
plt.ylabel('MovieID')
plt.xlabel('Ratings')

Text(0.5, 0, 'Ratings')

#Create a histogram for age
df_master_data.Age.plot.hist(bins=25)
plt.title("Age & Rating")
plt.ylabel('Age')
plt.xlabel('Ratings')

Text(0.5, 0, 'Ratings')

#Create a histogram for occupation
df_master_data.Age.plot.hist(bins=25)
plt.title("Occupation & Rating")
plt.ylabel('Occupation')
plt.xlabel('Ratings')

Text(0.5, 0, 'Ratings')

# Logistic Regression

logreg = LogisticRegression()
logreg.fit(train, train_labels)
Y_pred = logreg.predict(test)
acc_log = round(logreg.score(train, train_labels) * 100, 2)
acc_log

C:\Users\ns45237\Anaconda3\lib\site-packages\sklearn\utils\validation.py:73: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(**kwargs)
C:\Users\ns45237\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

34.87

# K Nearest Neighbors Classifier

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(train, train_labels)
Y_pred = knn.predict(test)
acc_knn = round(knn.score(train, train_labels) * 100, 2)
acc_knn

<ipython-input-50-595daf63fa48>:4: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  knn.fit(train, train_labels)

43.99

# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(train, train_labels)
Y_pred = gaussian.predict(test)
acc_gaussian = round(gaussian.score(train, train_labels) * 100, 2)
acc_gaussian

C:\Users\ns45237\Anaconda3\lib\site-packages\sklearn\utils\validation.py:73: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(**kwargs)

34.88

# Perceptron

perceptron = Perceptron()
perceptron.fit(train, train_labels)
Y_pred = perceptron.predict(test)
acc_perceptron = round(perceptron.score(train, train_labels) * 100, 2)
acc_perceptron

C:\Users\ns45237\Anaconda3\lib\site-packages\sklearn\utils\validation.py:73: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(**kwargs)

26.04

# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(train, train_labels)
Y_pred = decision_tree.predict(test)
acc_decision_tree = round(decision_tree.score(train, train_labels) * 100, 2)
acc_decision_tree

55.68

models = pd.DataFrame({
    'Model': ['KNN', 'Logistic Regression', 
               'Naive Bayes', 'Perceptron', 
              'Decision Tree'],
    'Score': [acc_knn, acc_log, 
              acc_gaussian, acc_perceptron, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

from the above accuracy scores, Decision Tree seems to be the most suitable Model with 55.68% accuracy

	MovieID	Title	Genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

	UserID	MovieID	Rating	Timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

	UserID	MovieID	Rating	Timestamp	Title	Genres	Gender	Age	Occupation	Zip-code
0	1	1	5	978824268	Toy Story (1995)	Animation\|Children's\|Comedy	F	1	10	48067
1	1	48	5	978824351	Pocahontas (1995)	Animation\|Children's\|Musical\|Romance	F	1	10	48067
2	1	150	5	978301777	Apollo 13 (1995)	Drama	F	1	10	48067
3	1	260	4	978300760	Star Wars: Episode IV - A New Hope (1977)	Action\|Adventure\|Fantasy\|Sci-Fi	F	1	10	48067
4	1	527	5	978824195	Schindler's List (1993)	Drama\|War	F	1	10	48067

	MovieID	Title	UserID	Age	Gender	Occupation	Rating	Genres
440667	350	Client, The (1994)	2696	25	M	7	3	Drama\|Mystery\|Thriller
440668	800	Lone Star (1996)	2696	25	M	7	5	Drama\|Mystery
440669	1092	Basic Instinct (1992)	2696	25	M	7	4	Mystery\|Thriller
440670	1097	E.T. the Extra-Terrestrial (1982)	2696	25	M	7	3	Children's\|Drama\|Fantasy\|Sci-Fi
440671	1258	Shining, The (1980)	2696	25	M	7	4	Horror
440672	1270	Back to the Future (1985)	2696	25	M	7	2	Comedy\|Sci-Fi
440673	1589	Cop Land (1997)	2696	25	M	7	3	Crime\|Drama\|Mystery
440674	1617	L.A. Confidential (1997)	2696	25	M	7	4	Crime\|Film-Noir\|Mystery\|Thriller
440675	1625	Game, The (1997)	2696	25	M	7	4	Mystery\|Thriller
440676	1644	I Know What You Did Last Summer (1997)	2696	25	M	7	2	Horror\|Mystery\|Thriller
440677	1645	Devil's Advocate, The (1997)	2696	25	M	7	4	Crime\|Horror\|Mystery\|Thriller
440678	1711	Midnight in the Garden of Good and Evil (1997)	2696	25	M	7	4	Comedy\|Crime\|Drama\|Mystery
440679	1783	Palmetto (1998)	2696	25	M	7	4	Film-Noir\|Mystery\|Thriller
440680	1805	Wild Things (1998)	2696	25	M	7	4	Crime\|Drama\|Mystery\|Thriller
440681	1892	Perfect Murder, A (1998)	2696	25	M	7	4	Mystery\|Thriller
440682	2338	I Still Know What You Did Last Summer (1998)	2696	25	M	7	2	Horror\|Mystery\|Thriller
440683	2389	Psycho (1998)	2696	25	M	7	4	Crime\|Horror\|Thriller
440684	2713	Lake Placid (1999)	2696	25	M	7	1	Horror\|Thriller
440685	3176	Talented Mr. Ripley, The (1999)	2696	25	M	7	4	Drama\|Mystery\|Thriller
440686	3386	JFK (1991)	2696	25	M	7	1	Drama\|Mystery

	Title	Rating
440667	Client, The (1994)	3
440668	Lone Star (1996)	5
440669	Basic Instinct (1992)	4
440670	E.T. the Extra-Terrestrial (1982)	3
440671	Shining, The (1980)	4
440672	Back to the Future (1985)	2
440673	Cop Land (1997)	3
440674	L.A. Confidential (1997)	4
440675	Game, The (1997)	4
440676	I Know What You Did Last Summer (1997)	2
440677	Devil's Advocate, The (1997)	4
440678	Midnight in the Garden of Good and Evil (1997)	4
440679	Palmetto (1998)	4
440680	Wild Things (1998)	4
440681	Perfect Murder, A (1998)	4
440682	I Still Know What You Did Last Summer (1998)	2
440683	Psycho (1998)	4
440684	Lake Placid (1999)	1
440685	Talented Mr. Ripley, The (1999)	4
440686	JFK (1991)	1

Search This Blog

Data Science Experiments

MovieLens Case Study with Python