THE REINSURANCE ACTUARY
  • Blog
  • Project Euler
  • Category Theory
  • Disclaimer

An Actuary learns Machine Learning - Part 13 - Kaggle Tabular Playground Competition - June 22

1/7/2022

 

Picture



In which we recreate the previous analysis, but in Python this time. And then add a new submission using Mean rather than median to impute missing values.





Source: https://somewan.design

Here's the extract from the Jupyter workbook, it's fairly self-explanatory so I won't really spend time working thought what's happening.
​
In [7]:
!pip install pandas
!pip install sklearn
Requirement already satisfied: pandas in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (1.4.3)
Requirement already satisfied: numpy>=1.18.5 in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (from pandas) (1.23.0)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (from pandas) (2022.1)
Requirement already satisfied: six>=1.5 in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Requirement already satisfied: sklearn in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (0.0)
Requirement already satisfied: scikit-learn in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (from sklearn) (1.1.1)
Requirement already satisfied: joblib>=1.0.0 in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (from scikit-learn->sklearn) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (from scikit-learn->sklearn) (3.1.0)
Requirement already satisfied: numpy>=1.17.3 in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (from scikit-learn->sklearn) (1.23.0)
Requirement already satisfied: scipy>=1.3.2 in c:\users\admin\anaconda3\envs\notebook\lib\site-packages (from scikit-learn->sklearn) (1.8.1)
In [8]:
import pandas as pd
import numpy as np
from sklearn import impute
In [9]:
#%% Run imports

mypath=r"C:\Users\Admin\OneDrive - BerkRe UK\Documents\Work\Machine Learning experiments\Kaggle - June - Missing Values\Data\data.csv"
TrainData = pd.read_csv(mypath)

mypath=r"C:\Users\Admin\OneDrive - BerkRe UK\Documents\Work\Machine Learning experiments\Kaggle - June - Missing Values\Data\Missing Values.csv"
MissingValues = pd.read_csv(mypath)

mypath=r"C:\Users\Admin\OneDrive - BerkRe UK\Documents\Work\Machine Learning experiments\Kaggle - June - Missing Values\Data\sample_submission.csv"
SampleSub = pd.read_csv(mypath)
In [10]:
imputer = impute.SimpleImputer(strategy='median')

imputer.fit(TrainData)
Xtrans = imputer.transform(TrainData)

TrainColumns = TrainData.columns.tolist()
In [11]:
# Then need to extract relevant 'missing' values

sub = []

for i in range(1000000):
    Row = MissingValues['Row'][i] #.apply(int)
    Column = MissingValues['Column'][i] #.apply(int)
    indexC = TrainColumns.index(Column)
    sub.append(Xtrans[Row][indexC])

SampleSub['value'] = pd.DataFrame (sub, columns = ['value'])
In [12]:
SampleSub.to_csv(r"C:\Users\Admin\OneDrive - BerkRe UK\Documents\Work\Machine Learning experiments\Kaggle - June - Missing Values\Submissions\Sub 3 - recreate median in Python v2.csv",index=False)

Note that we haven't really done any actual data analysis yet, we've just been getting the basic framework set up, but that's okay.

Just so that we can say we made some actual progress today, let's quickly run a version where we impute the missing values using mean rather than median and see what result that gives.

I only need to change a single line in the code above to do this, so I won't paste it again below, let's just look straight at the results:

Picture

We've got a very marginal improvement vs our previous best of 1.417 -> 1.416

Note, I'm not really sure it's best practice to use the leaderboard like this to compare my successive attempts, I should probably set up my own test/train split so as to avoid over-fitting to the public leaderboard. Maybe something to do next time?

So tune in next time where we'll do some EDA, and also think about what sort of model makes sense to impose on the data so as to give ourselves some structure to work with.


Your comment will be posted after it is approved.


Leave a Reply.

    Author

    ​​I work as an actuary and underwriter at a global reinsurer in London.

    I mainly write about Maths, Finance, and Technology.
    ​
    If you would like to get in touch, then feel free to send me an email at:

    ​LewisWalshActuary@gmail.com

      Sign up to get updates when new posts are added​

    Subscribe

    RSS Feed

    Categories

    All
    Actuarial Careers/Exams
    Actuarial Modelling
    Bitcoin/Blockchain
    Book Reviews
    Economics
    Finance
    Forecasting
    Insurance
    Law
    Machine Learning
    Maths
    Misc
    Physics/Chemistry
    Poker
    Puzzles/Problems
    Statistics
    VBA

    Archives

    March 2023
    February 2023
    October 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    October 2021
    September 2021
    August 2021
    July 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    November 2020
    October 2020
    September 2020
    August 2020
    May 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    April 2019
    March 2019
    August 2018
    July 2018
    June 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    December 2016
    November 2016
    October 2016
    September 2016
    August 2016
    July 2016
    June 2016
    April 2016
    January 2016

  • Blog
  • Project Euler
  • Category Theory
  • Disclaimer