Data Pre-processing for Machine Learning

 

Traditional Computer Computation:

(Data & Program) -> Computer -> Output

Machine Learning:

(Data & Output) -> Computer -> Program/Model

The model that is generated is used to predict the results for the future

Machine Learning (ML) is subset of Artificial Intelligence (AI)

In this exercise, let us learn below Data Pre-processing Techniques: Data Acquisition Data Exploration Data Wrangling Data Manipulation
In [3]:
import os
os.getcwd() #gets current working directory
Out[3]:
'C:\\Users\\ns45237\\working_neeraj\\2_machine_learning'
In [38]:
#Data Acquisition
#Loading the CSV file in Python
import pandas as pd #pandas is python data wrangling package
df = pd.read_csv('BostonHousing.csv')    #read the CSV file into a Python DataFrame
#df.to_csv("/home/neerajshinde/Data/BostonHousing.csv")     #loading data into an existing CSV file
#df = pd.read_excel('BostonHousing.xlsx') #read the XLSX file into a Python DataFrame
#df.to_excel("/home/neerajshinde/Data/BostonHousing.xlsx")   #loading data into an existing XLSX file
In [21]:
df.head(5) #display the first 5 observations from the DataFrame
Out[21]:
crimzninduschasnoxrmagedisradtaxptratioblstatmedv
00.0063218.02.3100.5386.57565.24.0900129615.3396.904.9824.0
10.027310.07.0700.4696.42178.94.9671224217.8396.909.1421.6
20.027290.07.0700.4697.18561.14.9671224217.8392.834.0334.7
30.032370.02.1800.4586.99845.86.0622322218.7394.632.9433.4
40.069050.02.1800.4587.14754.26.0622322218.7396.905.3336.2
In [13]:
df.shape  #display the no. of rows and columns (observations and variables)
Out[13]:
(506, 14)
In [20]:
df.tail(5) #display the last 5 observation from the DataFrame
Out[20]:
crimzninduschasnoxrmagedisradtaxptratioblstatmedv
5010.062630.011.9300.5736.59369.12.4786127321.0391.999.6722.4
5020.045270.011.9300.5736.12076.72.2875127321.0396.909.0820.6
5030.060760.011.9300.5736.97691.02.1675127321.0396.905.6423.9
5040.109590.011.9300.5736.79489.32.3889127321.0393.456.4822.0
5050.047410.011.9300.5736.03080.82.5050127321.0396.907.8811.9
In [22]:
df.columns     #display the column names from the DataFrame
Out[22]:
Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'b', 'lstat', 'medv'],
      dtype='object')
In [25]:
#select specific rows from the DataFrame
df.iloc[0]     #displays 1st row - note index always starts from 0
Out[25]:
crim         0.00632
zn          18.00000
indus        2.31000
chas         0.00000
nox          0.53800
rm           6.57500
age         65.20000
dis          4.09000
rad          1.00000
tax        296.00000
ptratio     15.30000
b          396.90000
lstat        4.98000
medv        24.00000
Name: 0, dtype: float64
In [30]:
df.iloc[-1]     #displays last row in the DataFrame
Out[30]:
crim         0.04741
zn           0.00000
indus       11.93000
chas         0.00000
nox          0.57300
rm           6.03000
age         80.80000
dis          2.50500
rad          1.00000
tax        273.00000
ptratio     21.00000
b          396.90000
lstat        7.88000
medv        11.90000
Name: 505, dtype: float64
In [31]:
df.iloc[:,-1]     #display the last column in the DataFrame
Out[31]:
0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Name: medv, Length: 506, dtype: float64
In [32]:
df.iloc[:,0:2]     #display the 1st two columns
Out[32]:
crimzn
00.0063218.0
10.027310.0
20.027290.0
30.032370.0
40.069050.0
.........
5010.062630.0
5020.045270.0
5030.060760.0
5040.109590.0
5050.047410.0

506 rows × 2 columns

Let us try to read another dataset SalaryGender.csv

In [41]:
df_sg = pd.read_csv("SalaryGender.csv")     #read the CSV file into a DataFrame
df_sg                                       #display the DataFrame
Out[41]:
SalaryGenderAgePhD
0140.01471
130.00651
235.10560
330.01230
480.00531
...............
9518.61260
96152.01561
971.81280
9835.00440
994.00240

100 rows × 4 columns

In [43]:
df_sg.shape                                  #display the shape of the DataFrame rows X columns
Out[43]:
(100, 4)
In [48]:
#check the datatype of the column Salary (Specific column)
df_sg['Salary'].dtype
Out[48]:
dtype('float64')
In [47]:
#check the datatype of all columns in one shot
df_sg.dtypes
Out[47]:
Salary    float64
Gender      int64
Age         int64
PhD         int64
dtype: object
In [53]:
#display unique values in a column
df_sg['Age'].unique()
df_sg['Gender'].unique()
Out[53]:
array([1, 0], dtype=int64)
In [55]:
#display all values in a particular column
df_sg['Gender'].values
Out[55]:
array([1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0], dtype=int64)
In [59]:
#display statistical values for numeric (continous) columns
df_sg['Salary'].mean()
Out[59]:
52.52450000000001
In [60]:
df_sg['Salary'].median()
Out[60]:
39.3
In [63]:
df_sg['Salary'].mode()
Out[63]:
0    30.0
dtype: float64
In [66]:
df_sg.mean()      #display mean for all the columns. Note that results make sense for numeric continuous data 
Out[66]:
Salary    52.5245
Gender     0.5000
Age       46.8800
PhD        0.3900
dtype: float64
In [70]:
df_sg['Salary'].mean(axis=0)     #axis=0 refers to column; axis=1 refers to rows
Out[70]:
52.52450000000001

Using Seaborn Subpackage of Matplotlib to plot a Histogram to find out Correlation between the Data

In [71]:
import matplotlib.pyplot as plt
import seaborn as sns
correlations = df.corr()
In [87]:
correlations     #display the correlations
#correlation between 2 variables (columns) tell us if 2 variables have a relation or not.
#if correlation is Moving towards +1 then maximum correlation
#if correlation is Moving towards -1 then minimum correlation
Out[87]:
crimzninduschasnoxrmagedisradtaxptratioblstatmedv
crim1.000000-0.2004690.406583-0.0558920.420972-0.2192470.352734-0.3796700.6255050.5827640.289946-0.3850640.455621-0.388305
zn-0.2004691.000000-0.533828-0.042697-0.5166040.311991-0.5695370.664408-0.311948-0.314563-0.3916790.175520-0.4129950.360445
indus0.406583-0.5338281.0000000.0629380.763651-0.3916760.644779-0.7080270.5951290.7207600.383248-0.3569770.603800-0.483725
chas-0.055892-0.0426970.0629381.0000000.0912030.0912510.086518-0.099176-0.007368-0.035587-0.1215150.048788-0.0539290.175260
nox0.420972-0.5166040.7636510.0912031.000000-0.3021880.731470-0.7692300.6114410.6680230.188933-0.3800510.590879-0.427321
rm-0.2192470.311991-0.3916760.091251-0.3021881.000000-0.2402650.205246-0.209847-0.292048-0.3555010.128069-0.6138080.695360
age0.352734-0.5695370.6447790.0865180.731470-0.2402651.000000-0.7478810.4560220.5064560.261515-0.2735340.602339-0.376955
dis-0.3796700.664408-0.708027-0.099176-0.7692300.205246-0.7478811.000000-0.494588-0.534432-0.2324710.291512-0.4969960.249929
rad0.625505-0.3119480.595129-0.0073680.611441-0.2098470.456022-0.4945881.0000000.9102280.464741-0.4444130.488676-0.381626
tax0.582764-0.3145630.720760-0.0355870.668023-0.2920480.506456-0.5344320.9102281.0000000.460853-0.4418080.543993-0.468536
ptratio0.289946-0.3916790.383248-0.1215150.188933-0.3555010.261515-0.2324710.4647410.4608531.000000-0.1773830.374044-0.507787
b-0.3850640.175520-0.3569770.048788-0.3800510.128069-0.2735340.291512-0.444413-0.441808-0.1773831.000000-0.3660870.333461
lstat0.455621-0.4129950.603800-0.0539290.590879-0.6138080.602339-0.4969960.4886760.5439930.374044-0.3660871.000000-0.737663
medv-0.3883050.360445-0.4837250.175260-0.4273210.695360-0.3769550.249929-0.381626-0.468536-0.5077870.333461-0.7376631.000000
In [88]:
#plot the heatmap of the above correlation
sns.heatmap(data = correlations,square = True, cmap = "bwr")
#blue i.e. -1 signifies minimum correlation
#red i.e. +1 signifies maximum correlation
Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x19d227f6ca0>
In [76]:
plt.yticks(rotation=0)
plt.xticks(rotation=90)
Out[76]:
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]),
 <a list of 6 Text major ticklabel objects>)
# Problem Statement: Suppose you are a public school administrator. Some schools in your state of Tennessee are performing below average academically. Your superintendent under pressure from frustrated parents and voters approached you with the task of understanding why these schools are under performing. To improve school performance, you need to learn more about these schools and their students, just as a business needs to understand its own strengths and weaknesses and its customers. The data includes various demographic, school faculty, and income variables. # Objective: Perform exploratory data analysis which includes: determining the type of the data, correlation analysis over the same . You need to convert the data into useful information:  Read the data in pandas data frame  Describe the data to find more details  Find the correlation between ‘reduced_lunch’ and ‘school_ Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the respective fields, and click Login.
In [77]:
df_school = pd.read_csv('middle_tn_schools.csv')     #read the CSV file into a DataFrame
In [78]:
df_school.shape
Out[78]:
(347, 15)
In [79]:
df_school.head(5)
Out[79]:
nameschool_ratingsizereduced_lunchstate_percentile_16state_percentile_15stu_teach_ratioschool_typeavg_score_15avg_score_16full_time_teacherspercent_blackpercent_whitepercent_asianpercent_hispanic
0Allendale Elementary School5.0851.010.090.295.815.7Public89.485.254.02.985.51.65.6
1Anderson Elementary2.0412.071.032.837.312.8Public43.038.332.03.986.71.04.9
2Avoca Elementary4.0482.043.078.483.616.6Public75.773.029.01.091.51.24.4
3Bailey Middle0.0394.091.01.61.013.1Public Magnet2.14.430.080.711.72.34.3
4Barfield Elementary4.0948.026.085.389.214.8Public81.379.664.011.871.27.16.0
In [84]:
#Find the Correlation between 'reduced_lunch' and 'school_rating' columns
correlation_school = df_school[['reduced_lunch','school_rating']].corr()
In [85]:
correlation_school
Out[85]:
reduced_lunchschool_rating
reduced_lunch1.000000-0.815757
school_rating-0.8157571.000000
In [86]:
#plot the heatmap of the above correlation
sns.heatmap(data = correlation_school,square = True, cmap = "bwr")
Out[86]:
<matplotlib.axes._subplots.AxesSubplot at 0x19d2275b340>

Data Exploration

Problem Statement: Mtcars, an automobile company in Chambersburg, United States has recorded the production of its cars within a dataset. With respect to some of the feedback given by their customers they are coming up with a new model. As a result of it they have to explore the current dataset to derive further insights out if it. Objective: Import the dataset, explore for dimensionality, type and average value of the horsepower across all the cars. Also, identify few of mostly correlated features which would help in modification. Note: This practice is not graded. It is only intended for you to apply the knowledge you have gained to solve real world problems. Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the respective fields, and click Login
In [89]:
df_mtcars = pd.read_csv('mtcars.csv')      #read the CSV into a DataFrame
In [90]:
df_mtcars.head(5)                          #display the 1st 5 rows 
Out[90]:
modelmpgcyldisphpdratwtqsecvsamgearcarb
0Mazda RX421.06160.01103.902.62016.460144
1Mazda RX4 Wag21.06160.01103.902.87517.020144
2Datsun 71022.84108.0933.852.32018.611141
3Hornet 4 Drive21.46258.01103.083.21519.441031
4Hornet Sportabout18.78360.01753.153.44017.020032
In [92]:
df_mtcars.shape                            #display the shape of the DataSet
Out[92]:
(32, 12)
In [93]:
df_mtcars.dtypes                           #display the DataTypes of all the columns
Out[93]:
model     object
mpg      float64
cyl        int64
disp     float64
hp         int64
drat     float64
wt       float64
qsec     float64
vs         int64
am         int64
gear       int64
carb       int64
dtype: object
In [124]:
df_mtcars.groupby(['model'])['hp'].mean()                     #average size of horsepower across all the car models
Out[124]:
model
AMC Javelin            150
Cadillac Fleetwood     205
Camaro Z28             245
Chrysler Imperial      230
Datsun 710              93
Dodge Challenger       150
Duster 360             245
Ferrari Dino           175
Fiat 128                66
Fiat X1-9               66
Ford Pantera L         264
Honda Civic             52
Hornet 4 Drive         110
Hornet Sportabout      175
Lincoln Continental    215
Lotus Europa           113
Maserati Bora          335
Mazda RX4              110
Mazda RX4 Wag          110
Merc 230                95
Merc 240D               62
Merc 280               123
Merc 280C              123
Merc 450SE             180
Merc 450SL             180
Merc 450SLC            180
Pontiac Firebird       175
Porsche 914-2           91
Toyota Corolla          65
Toyota Corona           97
Valiant                105
Volvo 142E             109
Name: hp, dtype: int64
In [108]:
correlation_mtcars = df_mtcars.corr()     #derive the correlation amongst all the variables
correlation_mtcars                        #display the correlation table
Out[108]:
mpgcyldisphpdratwtqsecvsamgearcarb
mpg1.000000-0.852162-0.847551-0.7761680.681172-0.8676590.4186840.6640390.5998320.480285-0.550925
cyl-0.8521621.0000000.9020330.832447-0.6999380.782496-0.591242-0.810812-0.522607-0.4926870.526988
disp-0.8475510.9020331.0000000.790949-0.7102140.887980-0.433698-0.710416-0.591227-0.5555690.394977
hp-0.7761680.8324470.7909491.000000-0.4487590.658748-0.708223-0.723097-0.243204-0.1257040.749812
drat0.681172-0.699938-0.710214-0.4487591.000000-0.7124410.0912050.4402780.7127110.699610-0.090790
wt-0.8676590.7824960.8879800.658748-0.7124411.000000-0.174716-0.554916-0.692495-0.5832870.427606
qsec0.418684-0.591242-0.433698-0.7082230.091205-0.1747161.0000000.744535-0.229861-0.212682-0.656249
vs0.664039-0.810812-0.710416-0.7230970.440278-0.5549160.7445351.0000000.1683450.206023-0.569607
am0.599832-0.522607-0.591227-0.2432040.712711-0.692495-0.2298610.1683451.0000000.7940590.057534
gear0.480285-0.492687-0.555569-0.1257040.699610-0.583287-0.2126820.2060230.7940591.0000000.274073
carb-0.5509250.5269880.3949770.749812-0.0907900.427606-0.656249-0.5696070.0575340.2740731.000000
In [113]:
#plot the heatmap of the above correlation
sns.heatmap(data = correlation_mtcars,square = True, cmap = "Oranges")
Out[113]:
<matplotlib.axes._subplots.AxesSubplot at 0x19d2154b640>
# Interpretation: cyl and disp are highly correlated variables

Data Wrangling

Below problems are addressed by Data Wrangling: 1. Missing Data: Use IsNA function to find NaN values and Impute mean(), median() depending for numeric data depending on business decision 2. Noisy Data: Identify Outliers vs. Anamolies in Data. Detect & Treat outliers. 3. Inconsistent Data 4. Develop more accurate Model 5. Prevent Data Leakage# Problem Statement: Load the load_diabetes datasets internally from sklearn and check for any missing value or outlier data in the ‘data’ column. If any irregularities found treat them accordingly. # Objective: Perform missing value and outlier data treatment. #Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the respective fields, and click Login.
In [147]:
#load the load_diabetes dataset from SKLEARN
from sklearn.datasets import load_diabetes
In [148]:
load_diabetes = load_diabetes()       #load the dataset into a variable; Type = sklearn.utils.Bunch
print(load_diabetes.DESCR)            #describe the dataset
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, T-Cells (a type of white blood cells)
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone
      - s5      ltg, lamotrigine
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
In [139]:
#convert the dataset to a DataFrame
df_diabetes = pd.DataFrame(load_diabetes.data)
In [142]:
df_diabetes.head(5)       #display first 5 rows of the DataFrame
Out[142]:
0123456789
00.0380760.0506800.0616960.021872-0.044223-0.034821-0.043401-0.0025920.019908-0.017646
1-0.001882-0.044642-0.051474-0.026328-0.008449-0.0191630.074412-0.039493-0.068330-0.092204
20.0852990.0506800.044451-0.005671-0.045599-0.034194-0.032356-0.0025920.002864-0.025930
3-0.089063-0.044642-0.011595-0.0366560.0121910.024991-0.0360380.0343090.022692-0.009362
40.005383-0.044642-0.0363850.0218720.0039350.0155960.008142-0.002592-0.031991-0.046641
In [172]:
#give names to the columns
df_diabetes.columns = ['Column1','Column2','Column3','Column4','Column5','Column6','Column7','Column8','Column9','Column10']
df_diabetes.head(2)
Out[172]:
Column1Column2Column3Column4Column5Column6Column7Column8Column9Column10
00.0380760.0506800.0616960.021872-0.044223-0.034821-0.043401-0.0025920.019908-0.017646
1-0.001882-0.044642-0.051474-0.026328-0.008449-0.0191630.074412-0.039493-0.068330-0.092204
In [152]:
df_diabetes.shape        #display the shape of the data
Out[152]:
(442, 10)
In [151]:
#check if there are any null values in the data
df_diabetes.isna().any()
Out[151]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool
Interpretation: Since there are no Null values, we do not need to Impute Missing Values
In [167]:
#Detect Outliers in each column of the DataFrame using a BOXPLOT
sns.boxplot(x=df_diabetes.iloc[:,2])              #shows outliers in column 3
#shows 3 outliers
Out[167]:
<matplotlib.axes._subplots.AxesSubplot at 0x19d24f24cd0>
In [165]:
sns.boxplot(x=df_diabetes.iloc[:,0])              #shows outliers in column 1
#shows no outliers
Out[165]:
<matplotlib.axes._subplots.AxesSubplot at 0x19d24e7d0a0>
In [174]:
sns.boxplot(df_diabetes['Column5'])                #show outliers in Column5 by refering Column Name
Out[174]:
<matplotlib.axes._subplots.AxesSubplot at 0x19d24fd60d0>
In [179]:
#let us treat the outliers in Column5 and filter them out
filter = df_diabetes['Column5']>0.13       #filter the values > 0.13 so that outliers are removed
df1_out_rem = df_diabetes[filter]          #create a DataFrame with filtered Data
sns.boxplot(x=df1_out_rem['Column5'])      #BoxPlot for the Filtered Data (Outliers are filtered)
Out[179]:
<matplotlib.axes._subplots.AxesSubplot at 0x19d250ce9d0>
# Problem Statement: As a macroeconomic analyst at the Organization for Economic Cooperation and Development (OECD), your job is to collect relevant data for analysis. It looks like you have three countries in the north_america data frame and one country in the south_america data frame. As these are in two separate plots, it's hard to compare the average labor hours between North America and South America. If all the countries were into the same data frame, it would be much easier to do this comparison. # Objective: Demonstrate concatenation. # Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the respective fields, and click Login.
In [181]:
df_north = pd.read_csv('north_america_2000_2010.csv')    #read the 1st file
In [197]:
df_north.shape
Out[197]:
(3, 12)
In [198]:
df_north.head(5)
Out[198]:
Country20002001200220032004200520062007200820092010
0Canada1779.01771.01754.01740.01760.017471745.01741.017351701.01703.0
1Mexico2311.22285.22271.22276.52270.622812280.62261.422582250.22242.4
2USA1836.01814.01810.01800.01802.017991800.01798.017921767.01778.0
In [182]:
df_south = pd.read_csv('south_america_2000_2010.csv')    #read the 2nd file
In [194]:
df_south.shape
Out[194]:
(1, 12)
In [195]:
df_south.head(5)
Out[195]:
Country20002001200220032004200520062007200820092010
0Chile22632242225022352232215721652128209520742069.6
In [192]:
df_america = pd.concat([df_north,df_south], axis=0)      #concat the 2 dataframes; axis=0 refers to rows
In [199]:
df_america.shape
Out[199]:
(4, 12)
In [201]:
df_america.head(5)                                       #display the concatenated output
Out[201]:
Country20002001200220032004200520062007200820092010
0Canada1779.01771.01754.01740.01760.017471745.01741.017351701.01703.0
1Mexico2311.22285.22271.22276.52270.622812280.62261.422582250.22242.4
2USA1836.01814.01810.01800.01802.017991800.01798.017921767.01778.0
0Chile2263.02242.02250.02235.02232.021572165.02128.020952074.02069.6
In [204]:
print('Thank You! Neeraj Shinde: 18-Oct-2020 11:26 PM IST')
Thank You! Neeraj Shinde: 18-Oct-2020 11:26 PM IST
In [ ]:
 

Comments

Popular posts from this blog

MovieLens Case Study with Python

Statistics in Data Science