THE REINSURANCE ACTUARY
  • Blog
  • Project Euler
  • Category Theory
  • Disclaimer

An Actuary learns Machine Learning - Part 9 - Cross Validation / Label Encoding / Feature Engineering

10/2/2021

 

Picture



In which we set up K-fold Cross Validation to assess model performance, spend quite a while tweaking our model, use hyper-parameter tuning, but then end up not actually improving our model.



​

​Source: https://somewan.design


There are a few things I'd like to accomplish today, first I'd like to generate some sort of accuracy score just using the training data. So we're going to set up K-fold cross validation and explore how that works. Secondly there's some minor tweaks I'd like to make to the variables - encoding some of them differently, and adding a couple of new variables. Finally we're going to run hyper-parameter tuning to see if that increases our performance. Once we've got K-fold CV set up we'll be able to investigate this improvement in performance without uploading to Kaggle which will be interesting.

Mobile users - once again, if the below is not rendering well, please try rotating to landscape view.

In [1]:
import os
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
In [2]:
path = "C:\Work\Machine Learning experiments\Kaggle\House Price"
os.chdir(path)
In [3]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
In [4]:
features = ['Neighborhood','OverallQual','OverallCond','BldgType','HouseStyle']
Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]
RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)
Out[4]:
RandomForestRegressor(random_state=1)
In [5]:
# Let's see how this model scores using cross validation

scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]
Out[5]:
[0.7499322339157881, 0.029658128350339423]
In [6]:
# Okay we scored 75% according to the CV measure, let's run the same process on
# the other models we've set up so far, first the version where we drop all columns
# which have an Nan in them

TrainColumns = train_data.columns
frames = [train_data, test_data]

Total_data =  pd.concat(frames)

features = []
featuresNotUsed = []

for i in TrainColumns:
    if not(Total_data[i].isnull().values.any()):
        if not(i == 'Id'):
            if not(i == 'SalePrice'):
                features.append(i)
    else:
        featuresNotUsed.append(i)
        
Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)

predictions = RFmodel.predict(X_test)
In [7]:
scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]
Out[7]:
[0.8408222755620794, 0.048773947402117185]
In [8]:
# Okay our score has improved, we're now at around 85%, let's try the version
# where we fill in all the Nan with 0s

features= TrainColumns[1:79]

for i in features:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)
    
Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)
Out[8]:
RandomForestRegressor(random_state=1)
In [9]:
scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]
Out[9]:
[0.858899065803126, 0.05176910554982082]
In [10]:
# Okay, we've used CV and tested our three models 'internally' rather
# than against the test set, and we've got broadly similar answers.
# Our first model scored about 75%, our second 85%, and our third 86%.
# Interestingly, our Std Dev went up at each step, which matches what
# we would expect, our model has become more accurate but at the cost of
# becoming more complex, and moving towards over-fitting

# Let's run some hyper-parameter tuning on our most successful model


#%%Set up space of possible hyper-params

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)
{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}
In [11]:
RFmodel = RandomForestRegressor(random_state=1)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
RFmodel_random = RandomizedSearchCV(estimator = RFmodel, param_distributions = random_grid, n_iter = 10, cv = 3, verbose=2, random_state=1, n_jobs = -1)
# Fit the random search model
RFmodel_random.fit(X, Y)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.4min finished
Out[11]:
RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(random_state=1),
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=1, verbose=2)
In [13]:
RFmodel_random.best_params_
Out[13]:
{'n_estimators': 1600,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 40,
 'bootstrap': False}
In [14]:
scores = cross_val_score(RFmodel_random, X, Y, cv=10)
scores
[scores.mean(), scores.std()]
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.1min finished
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.0min finished
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.0min finished
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished
Out[14]:
[0.8686354196314181, 0.039199207651046085]
In [15]:
# While setting up the previous version I realised that we'd missed out
# one variable - Sale Condition had not been included, let's quickly
# re-run the model, but including this.

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

features= TrainColumns[1:80]

for i in features:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)
    
Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)
Out[15]:
RandomForestRegressor(random_state=1)
In [16]:
scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]
Out[16]:
[0.859128262549325, 0.0536366209565343]
In [17]:
# Let's add try a few tweaks and see how our score changes
# I've been reading a few online tutorials, and there are a 
# few things people have done which seem to like good ideas

# First - some of the categorical variables have an ordering
# whereas we've encoded them just using onehotencoding
# e.g. Exterior Quality comes in the form 'good/average/excellent'
# Whereas we've ignored this ordering

# Taken from the following:
# https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

frames = [train_data, test_data]
total_data =  pd.concat(frames)

cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(total_data[c].values)) 
    total_data[c] = lbl.transform(list(total_data[c].values))

train_data = total_data[0:1460]
test_data = total_data[1460:2919]

features = total_data.columns
features = features.drop('Id')
features = features.drop('SalePrice')
features
Out[17]:
Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal',
       'MoSold', 'YrSold', 'SaleType', 'SaleCondition'],
      dtype='object')
In [18]:
pd.options.mode.chained_assignment = None  # default='warn'
In [19]:
for i in features:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)

Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)
Out[19]:
RandomForestRegressor(random_state=1)
In [20]:
scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]
Out[20]:
[0.8602885832464457, 0.05312459024978507]
In [21]:
# Okay, so that change seems to have done very little to improve the model
# we've decreased our score by less than 1%
In [22]:
# Let's try one more thing - let's add another variable corresponding
# to total sqr footage, note this is also taken from :
# https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
In [23]:
features= TrainColumns[1:80]

for i in features:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)
    
frames = [train_data, test_data]
total_data =  pd.concat(frames)

total_data['TotalSF'] = total_data['TotalBsmtSF'] + total_data['1stFlrSF'] + total_data['2ndFlrSF']
In [24]:
features = total_data.columns
features = features.drop('Id')
features = features.drop('SalePrice')
features
Out[24]:
Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal',
       'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'TotalSF'],
      dtype='object')
In [25]:
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(total_data[c].values)) 
    total_data[c] = lbl.transform(list(total_data[c].values))

train_data = total_data[0:1460]
test_data = total_data[1460:2919]

Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)
Out[25]:
RandomForestRegressor(random_state=1)
In [26]:
scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]
Out[26]:
[0.8667013439741871, 0.04532036910923081]
In [ ]:
# This actually appears to improve our model! 
# Let's quickly run hyper-parameter tuning on this version
# which will combine the best of all we have done above
In [27]:
#%%Set up space of possible hyper-params

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
In [28]:
RFmodel = RandomForestRegressor(random_state=1)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
RFmodel_random = RandomizedSearchCV(estimator = RFmodel, param_distributions = random_grid, n_iter = 30, cv = 3, verbose=2, random_state=1, n_jobs = -1)
# Fit the random search model
RFmodel_random.fit(X, Y)
predictions = RFmodel_random.predict(X_test)
Fitting 3 folds for each of 30 candidates, totalling 90 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  4.6min finished
In [29]:
RFmodel_random.best_params_
Out[29]:
{'n_estimators': 1600,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 40,
 'bootstrap': False}
In [30]:
RFmodel = RandomForestRegressor(n_estimators = 1600,
 min_samples_split = 2,
 min_samples_leaf = 1,
 max_features = 'sqrt',
 max_depth=  40,
 bootstrap=  False,
random_state=1)
In [31]:
scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]
Out[31]:
[0.877473773223001, 0.03875787125404092]
In [65]:
scores = cross_val_score(RFmodel_random, X, Y, cv=5)
scores
[scores.mean(), scores.std()]
Fitting 3 folds for each of 30 candidates, totalling 90 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  3.5min finished
Fitting 3 folds for each of 30 candidates, totalling 90 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  3.3min finished
Fitting 3 folds for each of 30 candidates, totalling 90 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  3.2min finished
Fitting 3 folds for each of 30 candidates, totalling 90 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  3.1min finished
Fitting 3 folds for each of 30 candidates, totalling 90 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  3.1min finished
Out[65]:
[0.8769289170485848, 0.02251117120317736]
In [66]:
output = pd.DataFrame({'ID': test_data.Id, 'SalePrice': predictions})
output.to_csv('my_submission - V4 - RF.csv',index=False)

And all that remains is to upload our latest submission as see how it performs. Fingers crossed.

Picture

And our score has not actually improved! hmmmm..

I'm slightly stumped here as I really thought it would improve given we've tided up our variables, and we've added hyper-parameter tuning. There's a few possibilities:
  • Our model is over-fitting to the training data, so even though we are getting higher CV scores, it's not actually becoming more predictive.
  • It's possible we've made an error somewhere.
  • We're coming up against the limits of what a Random Forest algorithm can achieve
  • I'm now also thinking that our encoding regime (moving from one hot encoding to label encoding) could have made the model worse
  • Also, another possibility is that the test set is not sufficient big, and since the difference in the predictions from our latest model and our previous best is fairly small, the test set is not accurately representing the difference in predictive power.

So that's all for today. Slightly disappointing that we didn't improve out score, but we did learn a few new tricks which will hopefully come in useful later on.

Tune in next time when we have our final attempt at the House Price Kaggle competition.

Your comment will be posted after it is approved.


Leave a Reply.

    Author

    ​​I work as an actuary and underwriter at a global reinsurer in London.

    I mainly write about Maths, Finance, and Technology.
    ​
    If you would like to get in touch, then feel free to send me an email at:

    ​[email protected]

      Sign up to get updates when new posts are added​

    Subscribe

    RSS Feed

    Categories

    All
    Actuarial Careers/Exams
    Actuarial Modelling
    Bitcoin/Blockchain
    Book Reviews
    Economics
    Finance
    Forecasting
    Insurance
    Law
    Machine Learning
    Maths
    Misc
    Physics/Chemistry
    Poker
    Puzzles/Problems
    Statistics
    VBA

    Archives

    February 2025
    April 2024
    February 2024
    November 2023
    October 2023
    September 2023
    August 2023
    July 2023
    June 2023
    March 2023
    February 2023
    October 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    October 2021
    September 2021
    August 2021
    July 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    November 2020
    October 2020
    September 2020
    August 2020
    May 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    April 2019
    March 2019
    August 2018
    July 2018
    June 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    December 2016
    November 2016
    October 2016
    September 2016
    August 2016
    July 2016
    June 2016
    April 2016
    January 2016

  • Blog
  • Project Euler
  • Category Theory
  • Disclaimer