An Actuary learns Machine Learning - Part 7 - Sub-plots /Null Values/ Random Forests

4/2/2021

In which we plot an excessive number of graphs, fix our problems with null values, re-run our algorithm, and significantly improve our accuracy.

Source: https://somewan.design

We made a fairly strong start last time, we started a new problem - predicting house prices - did some data exploration, and got a first cut of a model set up. The main task for today is to do some more data exploration, which will hopefully lead on to some new ideas about how to improve our model. One approach that I'm already considering, is to just chuck in as many variables as possible and see what happens.

Mobile users - once again, if the section below is not rendering well, then please consider rotating to landscape view to see if that improves the layout. I've also made a few changes to the layout of the final which will hopefully help somewhat.

In [1]:

import os
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
from sklearn.ensemble import RandomForestRegressor

In [2]:

path = "C:\Work\Machine Learning experiments\Kaggle\House Price"
os.chdir(path)

In [3]:

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [4]:

# I'm using this instead of the head command as it renders better for mobile users

train_data.iloc[1:,1:5]

Out[4]:

	MSSubClass	MSZoning	LotFrontage	LotArea
1	20	RL	80.0	9600
2	60	RL	68.0	11250
3	70	RL	60.0	9550
4	60	RL	84.0	14260
5	50	RL	85.0	14115
...	...	...	...	...
1455	60	RL	62.0	7917
1456	20	RL	85.0	13175
1457	70	RL	66.0	9042
1458	20	RL	68.0	9717
1459	20	RL	75.0	9937

1459 rows × 4 columns

In [5]:

TrainColumns = train_data.columns
train_data.columns

Out[5]:

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

In [6]:

# Let's just go ahead and graph all 80 variables against sale price:
# I think this will be instructive in terms of being intuition

fig = plt.figure()
fig.subplots_adjust(hspace=0.5, wspace=0.5)
fig.set_figheight(18)
fig.set_figwidth(18)
for i in range(1,10):
    TempPivot = train_data.groupby(TrainColumns[i])['SalePrice'].mean().sort_values()
    ax = fig.add_subplot(3,3,i)
    ax = TempPivot.plot.bar()

In [7]:

fig = plt.figure()
fig.subplots_adjust(hspace=0.5, wspace=0.5)
fig.set_figheight(18)
fig.set_figwidth(18)
for i in range(11,20):
    TempPivot = train_data.groupby(TrainColumns[i])['SalePrice'].mean().sort_values()
    ax = fig.add_subplot(3,3,i-10)
    ax = TempPivot.plot.bar()

In [14]:

# Okay that was all of them. The categorical ones rendered well, the numerical ones less so
# but we can still see the overall shape of the relationship.

# What is our takeaway from this? - generally most of the variables look pretty useful!

# As a next step, let's try to just chuck all variables into the Random Forest algorithm and
# see what happens. Since most of them seem to differentiate the sale price, I'm hoping
# this will lead to an increase in accuracy.

# According to things I've read online, the Random Forest algorithm is fairly (but not completely) 
# robust to Outliers, Multicollinarity, and non-linearity. So I'm not going to adjust for these
# issues now, but we can possibly return to this at a later stage and see if it helps.

# [Message from the future] - tried to use all variables in the Random Forest, but this threw up
# an error due to the presense of null values in some of the columns. We therefore need to add 
# a step where we remove all columns which have a null value, and just fit against what is left

In [15]:

train_data.columns
features = []

In [16]:

frames = [train_data, test_data]

Total_data =  pd.concat(frames)
Total_data.iloc[1:,1:5]

Out[16]:

	MSSubClass	MSZoning	LotFrontage	LotArea
1	20	RL	80.0	9600
2	60	RL	68.0	11250
3	70	RL	60.0	9550
4	60	RL	84.0	14260
5	50	RL	85.0	14115
...	...	...	...	...
1454	160	RM	21.0	1936
1455	160	RM	21.0	1894
1456	20	RL	160.0	20000
1457	85	RL	62.0	10441
1458	60	RL	74.0	9627

2918 rows × 4 columns

In [17]:

# We are now running through all columns, if a null value is found, add it to
# the featuresNotUsed list, otherwise put it in the features list

TrainColumns
features = []
featuresNotUsed = []

for i in TrainColumns:
    if not(Total_data[i].isnull().values.any()):
        if not(i == 'Id'):
            if not(i == 'SalePrice'):
                features.append(i)
    else:
        featuresNotUsed.append(i)

features

Out[17]:

['MSSubClass',
 'LotArea',
 'Street',
 'LotShape',
 'LandContour',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'RoofStyle',
 'RoofMatl',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'Heating',
 'HeatingQC',
 'CentralAir',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'PavedDrive',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'MoSold',
 'YrSold',
 'SaleCondition']

In [18]:

featuresNotUsed

Out[18]:

['MSZoning',
 'LotFrontage',
 'Alley',
 'Utilities',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'MasVnrArea',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinSF1',
 'BsmtFinType2',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'Electrical',
 'BsmtFullBath',
 'BsmtHalfBath',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageCars',
 'GarageArea',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'SaleType',
 'SalePrice']

In [19]:

Y = train_data['SalePrice']

In [20]:

X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

In [21]:

#The below is just copy and pasted from the Titanic challenge. We're making 
# Sure that all our variables are in both the test and train set, otherwise 
# we'll get errors. It's nice to have built up a few code snippets I can just
# copy and paste without having to reininvent the wheel

columns = X.columns
ColList = columns.tolist()


missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

In [22]:

X.iloc[1:,1:5]

Out[22]:

	LotArea	OverallQual	OverallCond	YearBuilt
1	9600	6	8	1976
2	11250	7	5	2001
3	9550	7	5	1915
4	14260	8	5	2000
5	14115	5	5	1993
...	...	...	...	...
1455	7917	6	5	1999
1456	13175	6	6	1978
1457	9042	7	9	1941
1458	9717	5	6	1950
1459	9937	5	6	1965

1459 rows × 4 columns

In [23]:

X_test.iloc[1:,1:5]

Out[23]:

	LotArea	OverallQual	OverallCond	YearBuilt
1	14267	6	6	1958
2	13830	5	5	1997
3	9978	6	6	1998
4	5005	8	5	1992
5	10000	6	5	1993
...	...	...	...	...
1454	1936	4	7	1970
1455	1894	4	5	1970
1456	20000	5	7	1960
1457	10441	5	5	1992
1458	9627	7	5	1993

1458 rows × 4 columns

In [24]:

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)
predictions = RFmodel.predict(X_test)

In [56]:

output = pd.DataFrame({'ID': test_data.Id, 'SalePrice': predictions})
output.to_csv('my_submission - V2 - RF.csv',index=False)

So all that remains is to upload our submission and see how we performed.

And voila! We've improved our score from approx 0.2 down to 0.14780, moving us up to 2830 out of 4891 entries.

Tune in next time when we do some more data cleansing, deal with those pesky null values, and marginally improve our model.

An Actuary learns Machine Learning - Part 7 - Sub-plots /Null Values/ Random Forests

Leave a Reply.

Author

Sign up to get updates when new posts are added

Categories

Archives