In which we try out the best performing algorithm from our house price prediction problem - Gradient Boosted Regression - on the Titanic problem, but don't actually manage to improve on our old score...
This post is a follow up to two previous posts, which I would recommend reading first:
Since our last post, the loss creep for the July 2021 German flooding has continued, sources are now talking about a EUR 8bn (\$9.3bn) insured loss.  This figure is just in respect of Germany, not including Belgium, France, etc., and up from \$8.3bn previously.
But interestingly (and bear with me, I promise these is something interesting about this) when we compare this \$9.3bn loss to the OEP table in our previous modelling, it puts the flooding at just past a 1-in-200 level.
Photo @ Jonathan Kemper - https://unsplash.com/@jupp
Here are two events that you might think were linked:
Every year around the month of May, the National Oceanic and Atmospheric Administration (NOAA) releases their predictions on the severity of the forthcoming Atlantic Hurricane season.
Around the same time, US insurers will be busy negotiating their upcoming 1st June or 1st July annual reinsurance renewals with their reinsurance panel. At the renewal (for a price to be negotiated) they will purchase reinsurance which will in effect offload a portion of their North American windstorm risk.
You might reasonably think – ‘if there is an expectation that windstorms will be particularly severe this year, then more risk is being transferred and so the price should be higher’. And if the NOAA predicts an above average season, shouldn’t we expect more windstorms? In which case, wouldn't it make sense if the pricing zig-zags up and down in line with the NOAA predictions for the year?
Well in practice, no, it just doesn’t really happen like that.
Source: NASA - Hurricane Florence, from the International Space Station
This post is a follow up to a previous post, which I would recommend reading first if you haven't already:
In our previous modelling, in order to assess how extreme the 2021 German floods were, we compared the consensus estimate at the time for the floods (\$6bn insured loss) against a distribution parameterised using historic flood losses in Germany from 1994-2020. Since I posted that modelling however, as often happens in these cases, the consensus estimate has changed. The insurance press is now reporting a value of around \$8.3 bn . So what does that do for our modelling and our conclusions from last time?
As I’m sure you are aware July 2021 saw some of the worst flooding in Germany in living memory. Die Welt currently has the death toll for Germany at 166 .
Obviously this is a very sad time for Germany, but one aspect of the reporting that caught my attention was how much emphasis was placed on climate change when reporting on the floods. For example, the BBC , the Guardian , and even the Telegraph  all bring up the role that climate change played in the contributing to the severity of the flooding.
The question that came to my mind, is can we really infer the presence of climate change just from this one event? The flooding has been described as a ‘1-in-100 year event’ , but does this bear out when we analyse the data, and how strong evidence is this of the presence of climate change?
Image - https://unsplash.com/@kurokami04
David Mackay includes an interesting Bayesian exercise in one of his books . It’s introduced as a situation where a Bayesian approach is much easier and more natural than equivalent frequentist methods. After mulling it over for a while, I thought it was interesting that Mackay only gives a passing reference to what I would consider the obvious ‘actuarial’ approach to this problem, which doesn’t really fit into either category – curve fitting via maximum likelihood estimation.
On reflection, I think the Bayesian method is still superior to the actuarial method, but it’s interesting that we can still get a decent answer out of the curve fitting approach.
The book is available free online (link at the end of the post), so I’m just going to paste the full text of the question below rather than rehashing Mackay’s writing:
I received an email from a reader recently asking the following (which for the sake of brevity and anonymity I’ve paraphrased quite liberally)
I’ve been reading about the Poisson Distribution recently and I understand that it is often used to model claims frequency, I’ve also read that the Poisson Distribution assumes that events occur independently. However, isn’t this a bit of a contradiction given the policyholders within a given risk profile are clearly dependent on each other?
It’s a good question; our intrepid reader is definitely on to something here. Let’s talk through the issue and see if we can gain some clarity.
Financial Year 2020 results have now been released for the top 5 reinsurers and on the face of it, they don’t make pretty reading. The top 5 reinsurers all exceeded 100% combined ratio, i.e. lost money this year on an underwriting basis. Yet much of the commentary has been fairly upbeat. Commentators have downplayed the top line result, and have instead focused on an ‘as-if’ position, how companies performed ex-Covid.
We’ve had comments like the following, (anonymised because I don’t want to look like I’m picking on particular companies):
"Excluding the impact of Covid-19, [Company X] delivers a very strong operating capital generation"
“In the pandemic year 2020 [Company Y] achieved a very good result, thereby again demonstrating its superb risk-carrying capacity and its broad diversification.”
Obviously CEOs are going to do what CEOs naturally do - talk up their company, focus on the positives - but is there any merit in looking at an ex-Covid position, or is this a red herring and should we instead be focusing strictly on the incl-Covid results?
I actually think there is a middle ground we can take which tries to balance both perspectives, and I’ll elaborate that method below.
The term exposure inflation can refer to a couple of different phenomena within insurance. A friend mentioned a couple of weeks ago that he was looking up the term in the context of pricing a property cat layer and he stumbled on one of my blog posts where I use the term. Apparently my blog post was one of the top search results, and there wasn’t really much other useful info, but I was actually talking about a different type of exposure inflation, so it wasn’t really helpful for him.
So as a public service announcement, for all those people Googling the term in the future, here are my thoughts on two types of exposure inflation:
In which we correct our label encoding method from last time, try out a new algorithm - Gradient Boosted Regression - and finally managed to improve our score (by quite a lot it turns out)
An Actuary learns Machine Learning - Part 9 - Cross Validation / Label Encoding / Feature Engineering
In which we set up K-fold Cross Validation to assess model performance, spend quite a while tweaking our model, use hyper-parameter tuning, but then end up not actually improving our model.
An Actuary learns Machine Learning - Part 8 - Data Cleaning / more Null Values / more Random Forests
In which we deal with those pesky null values, add additional variables to our Random Forest model, but only actually improve our score by a marginal amount.
In which we plot an excessive number of graphs, fix our problems with null values, re-run our algorithm, and significantly improve our accuracy.
In which we start a new Kaggle challenge, try out a new Python IDE, build our first regression model, but most importantly - make these blog posts look much cleaner.
In which we take our final stab at the titanic challenge by ‘throwing the kitchen sink’ at the problem, setting up another 5 different machine learning models and seeing if they improve our performance (hint they do not, but hopefully it's still interesting)
I work as a pricing actuary at a reinsurer in London.