THE REINSURANCE ACTUARY
  • Blog
  • Project Euler
  • Category Theory
  • Disclaimer

An Actuary learns Machine Learning - Part 8 - Data Cleaning / more Null Values / more Random Forests

6/2/2021

 

Picture



In which we deal with those pesky null values, add additional variables to our Random Forest model, but only actually improve our score by a marginal amount.





​Source: https://somewan.design



The task today is fairly simple, last time we excluded a number of variables from our model due to the fact that some entries had null values, and the Random Forest Algorithm could not handle any columns containing null values. This time, we're going to replace the null values in a vaguely sensible way, so as to be able to bring in these additional variables, and hopefully improve our model in the process.

So without further ado, here is the Notebook for today (Mobile users are recommended to go into landscape mode)
In [26]:
import os
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
from sklearn.ensemble import RandomForestRegressor
In [6]:
path = "C:\Work\Machine Learning experiments\Kaggle\House Price"
os.chdir(path)
In [27]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
In [28]:
train_data.iloc[1:,1:7]
Out[28]:
MSSubClass MSZoning LotFrontage LotArea Street Alley
1 20 RL 80.0 9600 Pave NaN
2 60 RL 68.0 11250 Pave NaN
3 70 RL 60.0 9550 Pave NaN
4 60 RL 84.0 14260 Pave NaN
5 50 RL 85.0 14115 Pave NaN
... ... ... ... ... ... ...
1455 60 RL 62.0 7917 Pave NaN
1456 20 RL 85.0 13175 Pave NaN
1457 70 RL 66.0 9042 Pave NaN
1458 20 RL 68.0 9717 Pave NaN
1459 20 RL 75.0 9937 Pave NaN

1459 rows × 6 columns

In [29]:
# We see above that the 'Alley' column has a large number of 'NaN' which the 
# Random Forests algorithm fell down on when we tried to use last time
# We need to replace these.

TrainColumns = train_data.columns
train_data.columns
Out[29]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
In [31]:
# The issue we had last time with using all the variables  is that the Random Forests
# did not like the null values in some of the columns, which led us to drop those columns.
# To bring them back in, we are going to have to replace the null values in the training 
# and test set. 

# I quickly looked through where the null values were, and generally they could just be
# replaced with a 0. For values such as LotArea which were null when the house did not 
# have a Lot, it makes sense to use 0. For columns that used null when a values was unknown,
# I think it's still okay to use 0 - we are going to encode all our numerical variables 
# anyway, so as long as we treat everything consistently, 0 will just replace 'unknown'.

# In the next part, I split out the columns with null values, and those without. This step 
# was not strictly neccessary, but I think it's useful as a sense check of what we are doing.

features= TrainColumns[1:79]

for i in features:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)
In [33]:
train_data.iloc[1:,1:7]
Out[33]:
MSSubClass MSZoning LotFrontage LotArea Street Alley
1 20 RL 80.0 9600 Pave 0
2 60 RL 68.0 11250 Pave 0
3 70 RL 60.0 9550 Pave 0
4 60 RL 84.0 14260 Pave 0
5 50 RL 85.0 14115 Pave 0
... ... ... ... ... ... ...
1455 60 RL 62.0 7917 Pave 0
1456 20 RL 85.0 13175 Pave 0
1457 70 RL 66.0 9042 Pave 0
1458 20 RL 68.0 9717 Pave 0
1459 20 RL 75.0 9937 Pave 0

1459 rows × 6 columns

In [32]:
# We can see that the Alley column has now got 0s in place of Nan
# Which means we can now use this column in our Random Forest

Y = train_data['SalePrice']
In [18]:
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
In [19]:
columns = X.columns
ColList = columns.tolist()


missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]
In [20]:
X.iloc[1:10,1:5]
Out[20]:
LotFrontage LotArea OverallQual OverallCond
1 80.0 9600 6 8
2 68.0 11250 7 5
3 60.0 9550 7 5
4 84.0 14260 8 5
5 85.0 14115 5 5
6 75.0 10084 8 5
7 0.0 10382 7 6
8 51.0 6120 7 5
9 50.0 7420 5 6
In [22]:
X_test.iloc[1:10,1:5]
Out[22]:
LotFrontage LotArea OverallQual OverallCond
1 81.0 14267 6 6
2 74.0 13830 5 5
3 78.0 9978 6 6
4 43.0 5005 8 5
5 75.0 10000 6 5
6 0.0 7980 6 7
7 63.0 8402 6 5
8 85.0 10176 7 5
9 70.0 8400 4 5
In [23]:
RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)
predictions = RFmodel.predict(X_test)
In [40]:
output = pd.DataFrame({'ID': test_data.Id, 'SalePrice': predictions})
output.to_csv('my_submission - V3 - RF.csv',index=False)

And all that remains is to submit our predictions to Kaggle:​

Picture

Okay, so we've improved our score - moving from 0.148 -> 0.146, which is good. We've also jumped up a few places on the leaderboard. That being said, the movement is actually pretty modest! We really didn't get much more out of our model by adding those extra variables, which is interesting. I would guess that either our Random Forests model is not fully exploiting the extra info, or more likely we've got tons of correlations and co-linearity in the variables.

That's all for today, we made some progress but not much. Tune in next time when I'm planning on doing a couple of things - I want to get some sort of K-fold cross validation working, firstly to see how this approach ​compares to validating against the test set, and then secondly so that we can use this to start to play around with assessing whether we can get a better model by removing some of the correlations within the variables. 

Your comment will be posted after it is approved.


Leave a Reply.

    Author

    ​​I work as an actuary and underwriter at a global reinsurer in London.

    I mainly write about Maths, Finance, and Technology.
    ​
    If you would like to get in touch, then feel free to send me an email at:

    ​[email protected]

      Sign up to get updates when new posts are added​

    Subscribe

    RSS Feed

    Categories

    All
    Actuarial Careers/Exams
    Actuarial Modelling
    Bitcoin/Blockchain
    Book Reviews
    Economics
    Finance
    Forecasting
    Insurance
    Law
    Machine Learning
    Maths
    Misc
    Physics/Chemistry
    Poker
    Puzzles/Problems
    Statistics
    VBA

    Archives

    February 2025
    April 2024
    February 2024
    November 2023
    October 2023
    September 2023
    August 2023
    July 2023
    June 2023
    March 2023
    February 2023
    October 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    October 2021
    September 2021
    August 2021
    July 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    November 2020
    October 2020
    September 2020
    August 2020
    May 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    April 2019
    March 2019
    August 2018
    July 2018
    June 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    December 2016
    November 2016
    October 2016
    September 2016
    August 2016
    July 2016
    June 2016
    April 2016
    January 2016

  • Blog
  • Project Euler
  • Category Theory
  • Disclaimer