Colin Priest is a Fellow of the Institute of Actuaries of Australia and has worked in a wide range of actuarial and insurance roles, including Appointed Actuary, pricing, reserving, risk management, product design, underwriting, reinsurance, relationship management, and marketing. Over his career, Colin has held a number of CEO and general management roles, where he has championed data science initiatives in financial services, healthcare, security, oil and gas, government and marketing. He frequently speaks at various global actuarial conferences.
Colin is a firm believer in data-based decision making and applying machine learning to the insurance industry. He is passionate about the science of healthcare and does pro-bono work to support cancer research.
One of the most frustrating tasks that actuaries face is cleaning data in preparation for statistical analysis. And one of the most frustrating data problems is missing values.
You see, the statistical techniques most commonly used by actuaries weren’t initially designed for insurance. They were intended for use with data that had been collected from carefully designed scientific experiments, where all inputs and outputs are controlled, measured, and recorded. For example, when non-life actuaries price auto/motor insurance, they often use generalized linear models (GLMs), and these generalized linear models are compulsory components of the actuarial education training syllabus. There’s no doubt that GLMs remain popular with actuaries, despite the development of newer and better techniques. The Actuaries Institute even recently published an article entitled “In Praise of the GLM”, to give you an idea of just how popular this technique is in this field.
The actuarial community is going through a time of change. Over the past few years, I’ve been travelling the world teaching actuaries about newer modelling techniques that have more predictive power, and make their jobs easier and more fun. Some of my presentations have become part of the Actuaries Institute’s mandatory readings in their Commercial Actuarial Practice education course. One of those presentations, “Beyond GLMs”, explains the shortcomings of GLMs and introduces newer, better techniques.
GLMs have a glaring blind spot – missing values. If you give a GLM algorithm any data that has missing values, it will either error or automatically delete every row that contains any missing values! Here’s what happens in R when I have missing values in my auto/motor pricing data:
There are only two ways that I can fix this error:
- Remove the rows or columns that have missing values, or
- Fill in the missing values
If I remove the rows with missing values, then I lose 176,821 rows from my dataset, and all of the valuable information contained within those rows. If I remove the columns with missing values, then I lose important pricing factors, such as driver age.
If I fill in the missing values, I first need to know an appropriate value to use. But what if I don’t have an appropriate value? Then I could skew the analysis and build an incorrect model.
And what if the fact that a value is missing is important? What if the fact that a value is missing is more important than the value when one does exist? For example, I may have data that contains the date of the policyholder’s last claim. When this date is blank, it tells me that the policyholder has never had a claim, and that’s very useful information to know. I don’t want to treat these customers the same as customers who have had a claim, but I also don’t want to exclude them from the analysis.
It’s no wonder that I hear people tell me that data cleansing takes so much of their time!
Thank goodness there’s a better way than manually cleaning up missing values or removing valuable data from the analysis. Some modern machine learning insurance models can automatically detect missing values and use those missing values to better predict the future. And while many machine learning algorithms still don’t accept missing values, here at DataRobot we make it easy to use a wide range of modern algorithms in the presence of missing values. DataRobot is an expert machine learning automation platform that is smart enough to know how to deal with common problems such as missing values. For each machine learning algorithm, DataRobot builds a blueprint that optimizes the data for that specific algorithm.
Here’s what DataRobot does automatically:
- Detects missing values
- Knows which algorithms don’t work with missing values, and therefore which algorithms need extra data pre-processing
- Finds the optimal value to substitute for missing values
- Uses the presence of missing values to predict different outcomes to known values
It’s so easy and saves me time that I would otherwise have spent on data cleansing.