Traditional Computer Computation:¶

(Data & Program) -> Computer -> Output

Machine Learning:¶

(Data & Output) -> Computer -> Program/Model

The model that is generated is used to predict the results for the future

Machine Learning (ML) is subset of Artificial Intelligence (AI)

import os
os.getcwd() #gets current working directory

'C:\\Users\\ns45237\\working_neeraj\\2_machine_learning'

#Data Acquisition
#Loading the CSV file in Python
import pandas as pd #pandas is python data wrangling package
df = pd.read_csv('BostonHousing.csv')    #read the CSV file into a Python DataFrame
#df.to_csv("/home/neerajshinde/Data/BostonHousing.csv")     #loading data into an existing CSV file
#df = pd.read_excel('BostonHousing.xlsx') #read the XLSX file into a Python DataFrame
#df.to_excel("/home/neerajshinde/Data/BostonHousing.xlsx")   #loading data into an existing XLSX file

df.head(5) #display the first 5 observations from the DataFrame

df.shape  #display the no. of rows and columns (observations and variables)

(506, 14)

df.tail(5) #display the last 5 observation from the DataFrame

df.columns     #display the column names from the DataFrame

Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'b', 'lstat', 'medv'],
      dtype='object')

#select specific rows from the DataFrame
df.iloc[0]     #displays 1st row - note index always starts from 0

crim         0.00632
zn          18.00000
indus        2.31000
chas         0.00000
nox          0.53800
rm           6.57500
age         65.20000
dis          4.09000
rad          1.00000
tax        296.00000
ptratio     15.30000
b          396.90000
lstat        4.98000
medv        24.00000
Name: 0, dtype: float64

df.iloc[-1]     #displays last row in the DataFrame

crim         0.04741
zn           0.00000
indus       11.93000
chas         0.00000
nox          0.57300
rm           6.03000
age         80.80000
dis          2.50500
rad          1.00000
tax        273.00000
ptratio     21.00000
b          396.90000
lstat        7.88000
medv        11.90000
Name: 505, dtype: float64

df.iloc[:,-1]     #display the last column in the DataFrame

0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Name: medv, Length: 506, dtype: float64

df.iloc[:,0:2]     #display the 1st two columns

Let us try to read another dataset SalaryGender.csv

df_sg = pd.read_csv("SalaryGender.csv")     #read the CSV file into a DataFrame
df_sg                                       #display the DataFrame

df_sg.shape                                  #display the shape of the DataFrame rows X columns

(100, 4)

#check the datatype of the column Salary (Specific column)
df_sg['Salary'].dtype

dtype('float64')

#check the datatype of all columns in one shot
df_sg.dtypes

Salary    float64
Gender      int64
Age         int64
PhD         int64
dtype: object

#display unique values in a column
df_sg['Age'].unique()
df_sg['Gender'].unique()

array([1, 0], dtype=int64)

#display all values in a particular column
df_sg['Gender'].values

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0], dtype=int64)

#display statistical values for numeric (continous) columns
df_sg['Salary'].mean()

52.52450000000001

df_sg['Salary'].median()

39.3

df_sg['Salary'].mode()

0    30.0
dtype: float64

df_sg.mean()      #display mean for all the columns. Note that results make sense for numeric continuous data

Salary    52.5245
Gender     0.5000
Age       46.8800
PhD        0.3900
dtype: float64

df_sg['Salary'].mean(axis=0)     #axis=0 refers to column; axis=1 refers to rows

52.52450000000001

Using Seaborn Subpackage of Matplotlib to plot a Histogram to find out Correlation between the Data

import matplotlib.pyplot as plt
import seaborn as sns
correlations = df.corr()

correlations     #display the correlations
#correlation between 2 variables (columns) tell us if 2 variables have a relation or not.
#if correlation is Moving towards +1 then maximum correlation
#if correlation is Moving towards -1 then minimum correlation

#plot the heatmap of the above correlation
sns.heatmap(data = correlations,square = True, cmap = "bwr")
#blue i.e. -1 signifies minimum correlation
#red i.e. +1 signifies maximum correlation

<matplotlib.axes._subplots.AxesSubplot at 0x19d227f6ca0>

plt.yticks(rotation=0)
plt.xticks(rotation=90)

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]),
 <a list of 6 Text major ticklabel objects>)

df_school = pd.read_csv('middle_tn_schools.csv')     #read the CSV file into a DataFrame

df_school.shape

(347, 15)

df_school.head(5)

#Find the Correlation between 'reduced_lunch' and 'school_rating' columns
correlation_school = df_school[['reduced_lunch','school_rating']].corr()

correlation_school

#plot the heatmap of the above correlation
sns.heatmap(data = correlation_school,square = True, cmap = "bwr")

<matplotlib.axes._subplots.AxesSubplot at 0x19d2275b340>

Data Exploration¶

df_mtcars = pd.read_csv('mtcars.csv')      #read the CSV into a DataFrame

df_mtcars.head(5)                          #display the 1st 5 rows

df_mtcars.shape                            #display the shape of the DataSet

(32, 12)

df_mtcars.dtypes                           #display the DataTypes of all the columns

model     object
mpg      float64
cyl        int64
disp     float64
hp         int64
drat     float64
wt       float64
qsec     float64
vs         int64
am         int64
gear       int64
carb       int64
dtype: object

df_mtcars.groupby(['model'])['hp'].mean()                     #average size of horsepower across all the car models

model
AMC Javelin            150
Cadillac Fleetwood     205
Camaro Z28             245
Chrysler Imperial      230
Datsun 710              93
Dodge Challenger       150
Duster 360             245
Ferrari Dino           175
Fiat 128                66
Fiat X1-9               66
Ford Pantera L         264
Honda Civic             52
Hornet 4 Drive         110
Hornet Sportabout      175
Lincoln Continental    215
Lotus Europa           113
Maserati Bora          335
Mazda RX4              110
Mazda RX4 Wag          110
Merc 230                95
Merc 240D               62
Merc 280               123
Merc 280C              123
Merc 450SE             180
Merc 450SL             180
Merc 450SLC            180
Pontiac Firebird       175
Porsche 914-2           91
Toyota Corolla          65
Toyota Corona           97
Valiant                105
Volvo 142E             109
Name: hp, dtype: int64

correlation_mtcars = df_mtcars.corr()     #derive the correlation amongst all the variables
correlation_mtcars                        #display the correlation table

#plot the heatmap of the above correlation
sns.heatmap(data = correlation_mtcars,square = True, cmap = "Oranges")

<matplotlib.axes._subplots.AxesSubplot at 0x19d2154b640>

Data Wrangling¶

#load the load_diabetes dataset from SKLEARN
from sklearn.datasets import load_diabetes

load_diabetes = load_diabetes()       #load the dataset into a variable; Type = sklearn.utils.Bunch
print(load_diabetes.DESCR)            #describe the dataset

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, T-Cells (a type of white blood cells)
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone
      - s5      ltg, lamotrigine
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

#convert the dataset to a DataFrame
df_diabetes = pd.DataFrame(load_diabetes.data)

df_diabetes.head(5)       #display first 5 rows of the DataFrame

#give names to the columns
df_diabetes.columns = ['Column1','Column2','Column3','Column4','Column5','Column6','Column7','Column8','Column9','Column10']
df_diabetes.head(2)

df_diabetes.shape        #display the shape of the data

(442, 10)

#check if there are any null values in the data
df_diabetes.isna().any()

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

#Detect Outliers in each column of the DataFrame using a BOXPLOT
sns.boxplot(x=df_diabetes.iloc[:,2])              #shows outliers in column 3
#shows 3 outliers

<matplotlib.axes._subplots.AxesSubplot at 0x19d24f24cd0>

sns.boxplot(x=df_diabetes.iloc[:,0])              #shows outliers in column 1
#shows no outliers

<matplotlib.axes._subplots.AxesSubplot at 0x19d24e7d0a0>

sns.boxplot(df_diabetes['Column5'])                #show outliers in Column5 by refering Column Name

<matplotlib.axes._subplots.AxesSubplot at 0x19d24fd60d0>

#let us treat the outliers in Column5 and filter them out
filter = df_diabetes['Column5']>0.13       #filter the values > 0.13 so that outliers are removed
df1_out_rem = df_diabetes[filter]          #create a DataFrame with filtered Data
sns.boxplot(x=df1_out_rem['Column5'])      #BoxPlot for the Filtered Data (Outliers are filtered)

<matplotlib.axes._subplots.AxesSubplot at 0x19d250ce9d0>

df_north = pd.read_csv('north_america_2000_2010.csv')    #read the 1st file

df_north.shape

(3, 12)

df_north.head(5)

df_south = pd.read_csv('south_america_2000_2010.csv')    #read the 2nd file

df_south.shape

(1, 12)

df_south.head(5)

df_america = pd.concat([df_north,df_south], axis=0)      #concat the 2 dataframes; axis=0 refers to rows

df_america.shape

(4, 12)

df_america.head(5)                                       #display the concatenated output

print('Thank You! Neeraj Shinde: 18-Oct-2020 11:26 PM IST')

Thank You! Neeraj Shinde: 18-Oct-2020 11:26 PM IST

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

	crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
crim	1.000000	-0.200469	0.406583	-0.055892	0.420972	-0.219247	0.352734	-0.379670	0.625505	0.582764	0.289946	-0.385064	0.455621	-0.388305
zn	-0.200469	1.000000	-0.533828	-0.042697	-0.516604	0.311991	-0.569537	0.664408	-0.311948	-0.314563	-0.391679	0.175520	-0.412995	0.360445
indus	0.406583	-0.533828	1.000000	0.062938	0.763651	-0.391676	0.644779	-0.708027	0.595129	0.720760	0.383248	-0.356977	0.603800	-0.483725
chas	-0.055892	-0.042697	0.062938	1.000000	0.091203	0.091251	0.086518	-0.099176	-0.007368	-0.035587	-0.121515	0.048788	-0.053929	0.175260
nox	0.420972	-0.516604	0.763651	0.091203	1.000000	-0.302188	0.731470	-0.769230	0.611441	0.668023	0.188933	-0.380051	0.590879	-0.427321
rm	-0.219247	0.311991	-0.391676	0.091251	-0.302188	1.000000	-0.240265	0.205246	-0.209847	-0.292048	-0.355501	0.128069	-0.613808	0.695360
age	0.352734	-0.569537	0.644779	0.086518	0.731470	-0.240265	1.000000	-0.747881	0.456022	0.506456	0.261515	-0.273534	0.602339	-0.376955
dis	-0.379670	0.664408	-0.708027	-0.099176	-0.769230	0.205246	-0.747881	1.000000	-0.494588	-0.534432	-0.232471	0.291512	-0.496996	0.249929
rad	0.625505	-0.311948	0.595129	-0.007368	0.611441	-0.209847	0.456022	-0.494588	1.000000	0.910228	0.464741	-0.444413	0.488676	-0.381626
tax	0.582764	-0.314563	0.720760	-0.035587	0.668023	-0.292048	0.506456	-0.534432	0.910228	1.000000	0.460853	-0.441808	0.543993	-0.468536
ptratio	0.289946	-0.391679	0.383248	-0.121515	0.188933	-0.355501	0.261515	-0.232471	0.464741	0.460853	1.000000	-0.177383	0.374044	-0.507787
b	-0.385064	0.175520	-0.356977	0.048788	-0.380051	0.128069	-0.273534	0.291512	-0.444413	-0.441808	-0.177383	1.000000	-0.366087	0.333461
lstat	0.455621	-0.412995	0.603800	-0.053929	0.590879	-0.613808	0.602339	-0.496996	0.488676	0.543993	0.374044	-0.366087	1.000000	-0.737663
medv	-0.388305	0.360445	-0.483725	0.175260	-0.427321	0.695360	-0.376955	0.249929	-0.381626	-0.468536	-0.507787	0.333461	-0.737663	1.000000

	name	school_rating	size	reduced_lunch	state_percentile_16	state_percentile_15	stu_teach_ratio	school_type	avg_score_15	avg_score_16	full_time_teachers	percent_black	percent_white	percent_asian	percent_hispanic
0	Allendale Elementary School	5.0	851.0	10.0	90.2	95.8	15.7	Public	89.4	85.2	54.0	2.9	85.5	1.6	5.6
1	Anderson Elementary	2.0	412.0	71.0	32.8	37.3	12.8	Public	43.0	38.3	32.0	3.9	86.7	1.0	4.9
2	Avoca Elementary	4.0	482.0	43.0	78.4	83.6	16.6	Public	75.7	73.0	29.0	1.0	91.5	1.2	4.4
3	Bailey Middle	0.0	394.0	91.0	1.6	1.0	13.1	Public Magnet	2.1	4.4	30.0	80.7	11.7	2.3	4.3
4	Barfield Elementary	4.0	948.0	26.0	85.3	89.2	14.8	Public	81.3	79.6	64.0	11.8	71.2	7.1	6.0

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
mpg	1.000000	-0.852162	-0.847551	-0.776168	0.681172	-0.867659	0.418684	0.664039	0.599832	0.480285	-0.550925
cyl	-0.852162	1.000000	0.902033	0.832447	-0.699938	0.782496	-0.591242	-0.810812	-0.522607	-0.492687	0.526988
disp	-0.847551	0.902033	1.000000	0.790949	-0.710214	0.887980	-0.433698	-0.710416	-0.591227	-0.555569	0.394977
hp	-0.776168	0.832447	0.790949	1.000000	-0.448759	0.658748	-0.708223	-0.723097	-0.243204	-0.125704	0.749812
drat	0.681172	-0.699938	-0.710214	-0.448759	1.000000	-0.712441	0.091205	0.440278	0.712711	0.699610	-0.090790
wt	-0.867659	0.782496	0.887980	0.658748	-0.712441	1.000000	-0.174716	-0.554916	-0.692495	-0.583287	0.427606
qsec	0.418684	-0.591242	-0.433698	-0.708223	0.091205	-0.174716	1.000000	0.744535	-0.229861	-0.212682	-0.656249
vs	0.664039	-0.810812	-0.710416	-0.723097	0.440278	-0.554916	0.744535	1.000000	0.168345	0.206023	-0.569607
am	0.599832	-0.522607	-0.591227	-0.243204	0.712711	-0.692495	-0.229861	0.168345	1.000000	0.794059	0.057534
gear	0.480285	-0.492687	-0.555569	-0.125704	0.699610	-0.583287	-0.212682	0.206023	0.794059	1.000000	0.274073
carb	-0.550925	0.526988	0.394977	0.749812	-0.090790	0.427606	-0.656249	-0.569607	0.057534	0.274073	1.000000

	0	1	2	3	4	5	6	7	8	9
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019908	-0.017646
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068330	-0.092204
2	0.085299	0.050680	0.044451	-0.005671	-0.045599	-0.034194	-0.032356	-0.002592	0.002864	-0.025930
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022692	-0.009362
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031991	-0.046641

Search This Blog

Data Science Experiments

Data Pre-processing for Machine Learning

Traditional Computer Computation:¶

Machine Learning:¶

Data Exploration¶

Data Wrangling¶

Comments

Post a Comment

Popular posts from this blog

MovieLens Case Study with Python

Data Analytics Overview

	crim	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1	273	21.0	391.99	9.67	22.4
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1	273	21.0	396.90	9.08	20.6
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1	273	21.0	396.90	5.64	23.9
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1	273	21.0	393.45	6.48	22.0
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1	273	21.0	396.90	7.88	11.9

	Salary	Gender	Age	PhD
0	140.0	1	47	1
1	30.0	0	65	1
2	35.1	0	56	0
3	30.0	1	23	0
4	80.0	0	53	1
...	...	...	...	...
95	18.6	1	26	0
96	152.0	1	56	1
97	1.8	1	28	0
98	35.0	0	44	0
99	4.0	0	24	0

	reduced_lunch	school_rating
reduced_lunch	1.000000	-0.815757
school_rating	-0.815757	1.000000

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2

	Country	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010
0	Canada	1779.0	1771.0	1754.0	1740.0	1760.0	1747	1745.0	1741.0	1735	1701.0	1703.0
1	Mexico	2311.2	2285.2	2271.2	2276.5	2270.6	2281	2280.6	2261.4	2258	2250.2	2242.4
2	USA	1836.0	1814.0	1810.0	1800.0	1802.0	1799	1800.0	1798.0	1792	1767.0	1778.0