An Actuary learns Machine Learning - Part 9 - Cross Validation / Label Encoding / Feature Engineering

10/2/2021

In which we set up K-fold Cross Validation to assess model performance, spend quite a while tweaking our model, use hyper-parameter tuning, but then end up not actually improving our model.

Source: https://somewan.design

There are a few things I'd like to accomplish today, first I'd like to generate some sort of accuracy score just using the training data. So we're going to set up K-fold cross validation and explore how that works. Secondly there's some minor tweaks I'd like to make to the variables - encoding some of them differently, and adding a couple of new variables. Finally we're going to run hyper-parameter tuning to see if that increases our performance. Once we've got K-fold CV set up we'll be able to investigate this improvement in performance without uploading to Kaggle which will be interesting.

Mobile users - once again, if the below is not rendering well, please try rotating to landscape view.

In [1]:

import os
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder

In [2]:

path = "C:\Work\Machine Learning experiments\Kaggle\House Price"
os.chdir(path)

In [3]:

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [4]:

features = ['Neighborhood','OverallQual','OverallCond','BldgType','HouseStyle']
Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]
RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)

Out[4]:

RandomForestRegressor(random_state=1)

In [5]:

# Let's see how this model scores using cross validation

scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]

Out[5]:

[0.7499322339157881, 0.029658128350339423]

In [6]:

# Okay we scored 75% according to the CV measure, let's run the same process on
# the other models we've set up so far, first the version where we drop all columns
# which have an Nan in them

TrainColumns = train_data.columns
frames = [train_data, test_data]

Total_data =  pd.concat(frames)

features = []
featuresNotUsed = []

for i in TrainColumns:
    if not(Total_data[i].isnull().values.any()):
        if not(i == 'Id'):
            if not(i == 'SalePrice'):
                features.append(i)
    else:
        featuresNotUsed.append(i)
        
Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)

predictions = RFmodel.predict(X_test)

In [7]:

scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]

Out[7]:

[0.8408222755620794, 0.048773947402117185]

In [8]:

# Okay our score has improved, we're now at around 85%, let's try the version
# where we fill in all the Nan with 0s

features= TrainColumns[1:79]

for i in features:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)
    
Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)

Out[8]:

RandomForestRegressor(random_state=1)

In [9]:

scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]

Out[9]:

[0.858899065803126, 0.05176910554982082]

In [10]:

# Okay, we've used CV and tested our three models 'internally' rather
# than against the test set, and we've got broadly similar answers.
# Our first model scored about 75%, our second 85%, and our third 86%.
# Interestingly, our Std Dev went up at each step, which matches what
# we would expect, our model has become more accurate but at the cost of
# becoming more complex, and moving towards over-fitting

# Let's run some hyper-parameter tuning on our most successful model


#%%Set up space of possible hyper-params

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}

In [11]:

RFmodel = RandomForestRegressor(random_state=1)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
RFmodel_random = RandomizedSearchCV(estimator = RFmodel, param_distributions = random_grid, n_iter = 10, cv = 3, verbose=2, random_state=1, n_jobs = -1)
# Fit the random search model
RFmodel_random.fit(X, Y)

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.4min finished

Out[11]:

RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(random_state=1),
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=1, verbose=2)

In [13]:

RFmodel_random.best_params_

Out[13]:

{'n_estimators': 1600,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 40,
 'bootstrap': False}

In [14]:

scores = cross_val_score(RFmodel_random, X, Y, cv=10)
scores
[scores.mean(), scores.std()]

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.1min finished

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.0min finished

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.0min finished

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.9min finished

Out[14]:

[0.8686354196314181, 0.039199207651046085]

In [15]:

# While setting up the previous version I realised that we'd missed out
# one variable - Sale Condition had not been included, let's quickly
# re-run the model, but including this.

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

features= TrainColumns[1:80]

for i in features:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)
    
Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)

Out[15]:

RandomForestRegressor(random_state=1)

In [16]:

scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]

Out[16]:

[0.859128262549325, 0.0536366209565343]

In [17]:

# Let's add try a few tweaks and see how our score changes
# I've been reading a few online tutorials, and there are a 
# few things people have done which seem to like good ideas

# First - some of the categorical variables have an ordering
# whereas we've encoded them just using onehotencoding
# e.g. Exterior Quality comes in the form 'good/average/excellent'
# Whereas we've ignored this ordering

# Taken from the following:
# https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

frames = [train_data, test_data]
total_data =  pd.concat(frames)

cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(total_data[c].values)) 
    total_data[c] = lbl.transform(list(total_data[c].values))

train_data = total_data[0:1460]
test_data = total_data[1460:2919]

features = total_data.columns
features = features.drop('Id')
features = features.drop('SalePrice')
features

Out[17]:

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal',
       'MoSold', 'YrSold', 'SaleType', 'SaleCondition'],
      dtype='object')

In [18]:

pd.options.mode.chained_assignment = None  # default='warn'

In [19]:

for i in features:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)

Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)

Out[19]:

RandomForestRegressor(random_state=1)

In [20]:

scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]

Out[20]:

[0.8602885832464457, 0.05312459024978507]

In [21]:

# Okay, so that change seems to have done very little to improve the model
# we've decreased our score by less than 1%

In [22]:

# Let's try one more thing - let's add another variable corresponding
# to total sqr footage, note this is also taken from :
# https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [23]:

features= TrainColumns[1:80]

for i in features:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)
    
frames = [train_data, test_data]
total_data =  pd.concat(frames)

total_data['TotalSF'] = total_data['TotalBsmtSF'] + total_data['1stFlrSF'] + total_data['2ndFlrSF']

In [24]:

features = total_data.columns
features = features.drop('Id')
features = features.drop('SalePrice')
features

Out[24]:

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal',
       'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'TotalSF'],
      dtype='object')

In [25]:

cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(total_data[c].values)) 
    total_data[c] = lbl.transform(list(total_data[c].values))

train_data = total_data[0:1460]
test_data = total_data[1460:2919]

Y = train_data['SalePrice']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

columns = X.columns
ColList = columns.tolist()

missing_cols = set( X_test.columns ) - set( X.columns )
missing_cols2 = set( X.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X[c] = 0
    
for c in missing_cols2:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X = X[X_test.columns]

RFmodel = RandomForestRegressor(random_state=1)
RFmodel.fit(X,Y)

Out[25]:

RandomForestRegressor(random_state=1)

In [26]:

scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]

Out[26]:

[0.8667013439741871, 0.04532036910923081]

In [ ]:

# This actually appears to improve our model! 
# Let's quickly run hyper-parameter tuning on this version
# which will combine the best of all we have done above

In [27]:

#%%Set up space of possible hyper-params

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [28]:

RFmodel = RandomForestRegressor(random_state=1)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
RFmodel_random = RandomizedSearchCV(estimator = RFmodel, param_distributions = random_grid, n_iter = 30, cv = 3, verbose=2, random_state=1, n_jobs = -1)
# Fit the random search model
RFmodel_random.fit(X, Y)
predictions = RFmodel_random.predict(X_test)

Fitting 3 folds for each of 30 candidates, totalling 90 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  4.6min finished

In [29]:

RFmodel_random.best_params_

Out[29]:

{'n_estimators': 1600,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 40,
 'bootstrap': False}

In [30]:

RFmodel = RandomForestRegressor(n_estimators = 1600,
 min_samples_split = 2,
 min_samples_leaf = 1,
 max_features = 'sqrt',
 max_depth=  40,
 bootstrap=  False,
random_state=1)

In [31]:

scores = cross_val_score(RFmodel, X, Y, cv=10)
scores
[scores.mean(), scores.std()]

Out[31]:

[0.877473773223001, 0.03875787125404092]

In [65]:

scores = cross_val_score(RFmodel_random, X, Y, cv=5)
scores
[scores.mean(), scores.std()]

Fitting 3 folds for each of 30 candidates, totalling 90 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  3.5min finished

Fitting 3 folds for each of 30 candidates, totalling 90 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  3.3min finished

Fitting 3 folds for each of 30 candidates, totalling 90 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  3.2min finished

Fitting 3 folds for each of 30 candidates, totalling 90 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  3.1min finished

Fitting 3 folds for each of 30 candidates, totalling 90 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  3.1min finished

Out[65]:

[0.8769289170485848, 0.02251117120317736]

In [66]:

output = pd.DataFrame({'ID': test_data.Id, 'SalePrice': predictions})
output.to_csv('my_submission - V4 - RF.csv',index=False)

And all that remains is to upload our latest submission as see how it performs. Fingers crossed.

And our score has not actually improved! hmmmm..

I'm slightly stumped here as I really thought it would improve given we've tided up our variables, and we've added hyper-parameter tuning. There's a few possibilities:

Our model is over-fitting to the training data, so even though we are getting higher CV scores, it's not actually becoming more predictive.
It's possible we've made an error somewhere.
We're coming up against the limits of what a Random Forest algorithm can achieve
I'm now also thinking that our encoding regime (moving from one hot encoding to label encoding) could have made the model worse
Also, another possibility is that the test set is not sufficient big, and since the difference in the predictions from our latest model and our previous best is fairly small, the test set is not accurately representing the difference in predictive power.

So that's all for today. Slightly disappointing that we didn't improve out score, but we did learn a few new tricks which will hopefully come in useful later on.

Tune in next time when we have our final attempt at the House Price Kaggle competition.

An Actuary learns Machine Learning - Part 9 - Cross Validation / Label Encoding / Feature Engineering

Leave a Reply.

Author

Sign up to get updates when new posts are added

Categories

Archives