THE REINSURANCE ACTUARY
  • Blog
  • Project Euler
  • Category Theory
  • Disclaimer

An Actuary learns Machine Learning - Part 7 - Sub-plots /Null Values/ Random Forests

4/2/2021

 

Picture



In which we plot an excessive number of graphs, fix our problems with null values, re-run our algorithm, and significantly improve our accuracy.





Source: https://somewan.design


We made a fairly strong start last time, we started a new problem - predicting house prices - did some data exploration, and got a first cut of a model set up.  The main task for today is to do some more data exploration, which will hopefully lead on to some new ideas about how to improve our model. One approach that I'm already considering, is to just chuck in as many variables as possible and see what happens.

Mobile users - once again, if the section below is not rendering well, then please consider rotating to landscape view to see if that improves the layout. I've also made a few changes to the layout of the final which will hopefully help somewhat.
In [1]:
import os
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
from sklearn.ensemble import RandomForestRegressor
In [2]:
path = "C:\Work\Machine Learning experiments\Kaggle\House Price"
os.chdir(path)
In [3]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
In [4]:
# I'm using this instead of the head command as it renders better for mobile users

train_data.iloc[1:,1:5]
Out[4]:
MSSubClass MSZoning LotFrontage LotArea
1 20 RL 80.0 9600
2 60 RL 68.0 11250
3 70 RL 60.0 9550
4 60 RL 84.0 14260
5 50 RL 85.0 14115
... ... ... ... ...
1455 60 RL 62.0 7917
1456 20 RL 85.0 13175
1457 70 RL 66.0 9042
1458 20 RL 68.0 9717
1459 20 RL 75.0 9937

1459 rows × 4 columns

In [5]:
TrainColumns = train_data.columns
train_data.columns
Out[5]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
In [6]:
# Let's just go ahead and graph all 80 variables against sale price:
# I think this will be instructive in terms of being intuition

fig = plt.figure()
fig.subplots_adjust(hspace=0.5, wspace=0.5)
fig.set_figheight(18)
fig.set_figwidth(18)
for i in range(1,10):
    TempPivot = train_data.groupby(TrainColumns[i])['SalePrice'].mean().sort_values()
    ax = fig.add_subplot(3,3,i)
    ax = TempPivot.plot.bar()
In [7]:
fig = plt.figure()
fig.subplots_adjust(hspace=0.5, wspace=0.5)
fig.set_figheight(18)
fig.set_figwidth(18)
for i in range(11,20):
    TempPivot = train_data.groupby(TrainColumns[i])['SalePrice'].mean().sort_values()
    ax = fig.add_subplot(3,3,i-10)
    ax = TempPivot.plot.bar()
In [14]:
# Okay that was all of them. The categorical ones rendered well, the numerical ones less so
# but we can still see the overall shape of the relationship.

# What is our takeaway from this? - generally most of the variables look pretty useful!

# As a next step, let's try to just chuck all variables into the Random Forest algorithm and
# see what happens. Since most of them seem to differentiate the sale price, I'm hoping
# this will lead to an increase in accuracy.

# According to things I've read online, the Random Forest algorithm is fairly (but not completely) 
# robust to Outliers, Multicollinarity, and non-linearity. So I'm not going to adjust for these
# issues now, but we can possibly return to this at a later stage and see if it helps.

# [Message from the future] - tried to use all variables in the Random Forest, but this threw up
# an error due to the presense of null values in some of the columns. We therefore need to add 
# a step where we remove all columns which have a null value, and just fit against what is left
In [15]:
train_data.columns
features = []
In [16]:
frames = [train_data, test_data]

Total_data =  pd.concat(frames)
Total_data.iloc[1:,1:5]
Out[16]:
MSSubClass MSZoning LotFrontage LotArea
1 20 RL 80.0 9600
2 60 RL 68.0 11250
3 70 RL 60.0 9550
4 60 RL 84.0 14260
5 50 RL 85.0 14115
... ... ... ... ...
1454 160 RM 21.0 1936
1455 160 RM 21.0 1894
1456 20 RL 160.0 20000
1457 85 RL 62.0 10441
1458 60 RL 74.0 9627

2918 rows × 4 columns

In [17]:
# We are now running through all columns, if a null value is found, add it to
# the featuresNotUsed list, otherwise put it in the features list

TrainColumns
features = []
featuresNotUsed = []

for i in TrainColumns:
    if not(Total_data[i].isnull().values.any()):
        if not(i == 'Id'):
            if not(i == 'SalePrice'):
                features.append(i)
    else:
        featuresNotUsed.append(i)

features
Out[17]:
['MSSubClass',
 'LotArea',
 'Street',
 'LotShape',
 'LandContour',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'RoofStyle',
 'RoofMatl',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'Heating',
 'HeatingQC',
 'CentralAir',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'PavedDrive',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'MoSold',
 'YrSold',
 'SaleCondition']
In [18]:
featuresNotUsed
Out[18]:
['MSZoning',
 'LotFrontage',
 'Alley',
 'Utilities',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'MasVnrArea',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinSF1',
 'BsmtFinType2',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'Electrical',
 'BsmtFullBath',
 'BsmtHalfBath',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageCars',
 'GarageArea',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'SaleType',
 'SalePrice']
In [19]:
Y = train_data['SalePrice']
In [20]:
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
In [21]:
#The below is just copy and pasted from the Titanic challenge. We're making 
# Sure that all our variables are in both the test and train set, otherwise 
# we'll get errors. It's nice to have built up a few code snippets I can just
# copy and paste without having to reininvent the wheel

columns = X.columns
ColList = columns.tolist()


missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]
In [22]:
X.iloc[1:,1:5]
Out[22]:
LotArea OverallQual OverallCond YearBuilt
1 9600 6 8 1976
2 11250 7 5 2001
3 9550 7 5 1915
4 14260 8 5 2000
5 14115 5 5 1993
... ... ... ... ...
1455 7917 6 5 1999
1456 13175 6 6 1978
1457 9042 7 9 1941
1458 9717 5 6 1950
1459 9937 5 6 1965

1459 rows × 4 columns

In [23]:
X_test.iloc[1:,1:5]
Out[23]:
LotArea OverallQual OverallCond YearBuilt
1 14267 6 6 1958
2 13830 5 5 1997
3 9978 6 6 1998
4 5005 8 5 1992
5 10000 6 5 1993
... ... ... ... ...
1454 1936 4 7 1970
1455 1894 4 5 1970
1456 20000 5 7 1960
1457 10441 5 5 1992
1458 9627 7 5 1993

1458 rows × 4 columns

In [24]:
RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)
predictions = RFmodel.predict(X_test)
In [56]:
output = pd.DataFrame({'ID': test_data.Id, 'SalePrice': predictions})
output.to_csv('my_submission - V2 - RF.csv',index=False)

So all that remains is to upload our submission and see how we performed.
​
Picture

And voila! We've improved our score from approx 0.2 down to 0.14780, moving us up to 2830 out of 4891 entries.

​Tune in next time when we do some more data cleansing, deal with those pesky null values, and marginally improve our model.


Your comment will be posted after it is approved.


Leave a Reply.

    Author

    ​​I work as an actuary and underwriter at a global reinsurer in London.

    I mainly write about Maths, Finance, and Technology.
    ​
    If you would like to get in touch, then feel free to send me an email at:

    ​LewisWalshActuary@gmail.com

      Sign up to get updates when new posts are added​

    Subscribe

    RSS Feed

    Categories

    All
    Actuarial Careers/Exams
    Actuarial Modelling
    Bitcoin/Blockchain
    Book Reviews
    Economics
    Finance
    Forecasting
    Insurance
    Law
    Machine Learning
    Maths
    Misc
    Physics/Chemistry
    Poker
    Puzzles/Problems
    Statistics
    VBA

    Archives

    March 2023
    February 2023
    October 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    October 2021
    September 2021
    August 2021
    July 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    November 2020
    October 2020
    September 2020
    August 2020
    May 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    April 2019
    March 2019
    August 2018
    July 2018
    June 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    December 2016
    November 2016
    October 2016
    September 2016
    August 2016
    July 2016
    June 2016
    April 2016
    January 2016

  • Blog
  • Project Euler
  • Category Theory
  • Disclaimer