An Actuary learns Machine Learning - Part 13 - Kaggle Tabular Playground Competition - June 22
In which we recreate the previous analysis, but in Python this time. And then add a new submission using Mean rather than median to impute missing values.
An Actuary learns Machine Learning - Part 12 - Kaggle Tabular Playground Competition - June 22
In which we start a new Kaggle competition, submit a dummy attempt, and then build a very basic Excel model to establish a baseline for future progress.
An Actuary learns Machine Learning - Part 11 - Titanic revisited & Gradient Boosting Classifiers
In which we try out the best performing algorithm from our house price prediction problem - Gradient Boosted Regression - on the Titanic problem, but don't actually manage to improve on our old score...
An Actuary learns Machine Learning - Part 10 - More label encoding / Gradient Boosted Regression
In which we correct our label encoding method from last time, try out a new algorithm - Gradient Boosted Regression - and finally managed to improve our score (by quite a lot it turns out)
An Actuary learns Machine Learning - Part 9 - Cross Validation / Label Encoding / Feature Engineering
In which we set up K-fold Cross Validation to assess model performance, spend quite a while tweaking our model, use hyper-parameter tuning, but then end up not actually improving our model.
An Actuary learns Machine Learning - Part 8 - Data Cleaning / more Null Values / more Random Forests
In which we deal with those pesky null values, add additional variables to our Random Forest model, but only actually improve our score by a marginal amount.
In which we plot an excessive number of graphs, fix our problems with null values, re-run our algorithm, and significantly improve our accuracy.
In which we start a new Kaggle challenge, try out a new Python IDE, build our first regression model, but most importantly - make these blog posts look much cleaner.
In which we take our final stab at the titanic challenge by ‘throwing the kitchen sink’ at the problem, setting up another 5 different machine learning models and seeing if they improve our performance (hint they do not, but hopefully it's still interesting)
An Actuary learns Machine Learning – Part 4 – Error correction/data cleansing/Feature Engineering
In which we do more data exploration, find and then fix a mistake in our previous model, spend some time on feature engineering, and manage to set a new high-score.
An Actuary learns Machine Learning – Part 3 – Automatic testing/feature importance/K-fold cross validation
In which we don’t actually improve our model but we do improve our workflow - being able to check our test score ourselves, analysing the importance of each variable using an algorithm, and then using an algorithm to select the best hyper-parameters
In which we build our first machine learning model in Python, beat our previous Excel model on our first attempt, and then fail multiple time to improve this new model…
In which we enter a machine learning competition, predict who survived the titanic, build an Excel model, and then realise it performs no better than Kaggle’s ‘test submission’...
"I don't know what you mean by 'glory,' " Alice said.
Humpty Dumpty smiled contemptuously. "Of course you don't—till I tell you. I meant 'there's a nice knock-down argument for you!' "
"But 'glory' doesn't mean 'a nice knock-down argument'," Alice objected.
"When I use a word," Humpty Dumpty said, in rather a scornful tone, "it means just what I choose it to mean—neither more nor less."
"The question is," said Alice, "whether you can make words mean so many different things."
"The question is," said Humpty Dumpty, "which is to be master—that's all."
I don't think Lewis Carroll had 'Big Data' or 'Machine Learning' in mind when he penned these words, however I think the quote is quite apt in this context. All to often these buzzwords seem to fall foul to the Humpty Dumpty principle, they mean just what the speaker chooses them to mean - regardless of what the words actually mean to anyone else. So what do these terms actually mean?
The field of study which investigates algorithms that give computers the ability to learn without being explicitly programmed.
What do we mean by ‘learn’ in this context? The definition used by Machine Learning practitioners, originally stated by Arthur Samuel is:
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."
So what problems can Machine Learning algorithms be applied to? The main advances in machine learning have been in the following areas:
A trait shared by all these problems is that previously computers were thought to be incapable of tackling them. This is one reason why Machine Learning is such an exciting and growing field of study.
If you'd like to know more about Machine Learning then Andrew Ng at Stanford University has released a really good free online course through Coursera which can be accessed through the following link:
Big Data can be defined as data which conforms to the 3Vs. Big Data is available at a higher volume, higher velocity (rate at which data is generated) and/or greater variety than normal data sources.
So for example, looking at an insurance company, claims data would not count as Big Data, the volume will be fairly low, velocity will be slow, and variety will be fairly uniform.
The browsing patterns of an aggregator website on the other hand would count as Big Data. For example, the amount of time someone spends on Comparethemarket.com, their clicks, what they search for, how many searches they make, how often they return to the website before making a purchase, etc. would count as Big Data. There would be a massive volume of data to analyse and the data would be available in real time. (It wouldn’t meet the variety criteria, but that’s not a necessary condition)
Due to the need to extract useful information from Big Data, and the difficulties created by the 3Vs, we cannot rely on traditional methods of data analysis. Given the volume and velocity of Big Data, we require methods of analysis that does not need to be programmed explicitly, this is where Machine Learning fits in. Machine Learning in the guise of speech and handwriting recognition can also be important if the data generated is in audio form but needs to be combined with other data.
Data Mining is a catch all term for the process of analysing and summarising data into useful information. Data may be in the form of Big Data, and methods used may be based on Machine Learning (where the algorithm learns from the data) or may be more traditional.
Data Visualisation is the process of creating visual graphics that aid in understanding and exploring data. It has become increasingly important for two reasons, firstly, the rise in the volume of data sets means that new methods are required to understand data, secondly, an increase in computing power means that more advanced visualisation techniques are now possible.
Data Science is a broad term which encompasses processes which aim to extract knowledge or insight from Data. Data science therefore includes all the previous fields.
For example, in carrying at an analysis, we will first collect our data, which may or may not be in the form of Big Data, we will then mine our data, possibly using machine learning, and then present our results through Data Visualisation.
I work as an actuary and underwriter at a global reinsurer in London.