A Kaggle Competition: Allstate Claims Severity

Tushar Tiwari
Nerd For Tech
Published in
6 min readApr 29, 2021

--

Allstate You’re in good hands.
Allstate logo

About the dataset

Allstate is an American insurance company, which has organized a recruitment Kaggle competition in October 2016. At that time Allstate was developing automatic methods of predicting the cost of insurance claims.

A tabular dataset was given, having around 188k rows where each row is representing an insurance claim [Kaggle dataset].

Business Objective

Given a claim record task is to predict the loss value of a particular claim as accurately as possible.

Mapping Business problem to Machine learning

Allstate can face losses in 2 ways:

  1. Not undercharging the premium to the customer, so that in event of actual claim would be able to pay the claim amount.
  2. Not overcharging the premium to the customer, so that client will switch to the next insurance company. In the end, Allstate will lose customers, which is the loss of potential business.

Selection of Performance metric

As our task is to predict the loss value as close as possible, we will select the Mean Absolute Error(MAE) as our performance metric.

Mean Absolute Error(MAE) = It is the mean absolute value of the difference between the (actual value and predicted value).

Credit: statisticshowto.com

Exploratory Data Analysis

In the data set, we have 116 categorical and 14 continuous-valued features
In the training dataset, there are 188318 claim records and in the test dataset, there are 125546 claim records. for the test records, we do not have the target variable (loss).

There is no missing value in the entire data set.

Below is the distribution of the number of classes in each categorical feature.

Distribution of categorical features for number of classes

The distribution of the categorical variables have one majority class, which can be seen from the below Pie Charts

Distribution among the Categorical features.

Below is the distribution of the target class (i.e loss ) is we can see there is this Power Law distributed so we have converted it into normal distribution using Box-Cox transformation. This transformation in the later stage has given sufficient boost to the performance of the model.

Transforming the Target variable loss to normal Distribution.

As their many features analyzing each one is not appropriate. So have selected few important features.

Box plot for cat80,cat79,cat87

For the cat80, if a claim belongs to class A and C then the loss value is not significantly higher. And class B has a higher loss compared to the other classes.
For the cat79, again if the claim belongs to class A and C then the loss outliers are much smaller than other classes. Class D has a higher loss than other classes.class B has less deviation around the median value.
For the cat87, Class A has a loss smaller than other classes.

Box plot for cat10,cat12,cat57

For the cat10, class B has a higher loss and more variance around median than class A.
For cat12, again here class B has a higher loss than class A . also class B has more variance.
For cat57, the variance around the median for class A is much smaller than class B. And class B has more loss in comparison.

Box plot for cat7,cat81,cat89

For cat7, Class B has higher loss and more spread of losses than class A.
For cat81, the Majority of the losses for classes D and A are less in comparison to classes B and C.
Or the cat89, the Classes E and G have small losses without any outliers.Class H have significantly higher losses than other classes.

Correlation Heatmap for continous features

Correlation Heatmap

Observation:

A very strong correlation between features

  1. cont11 and cont12
  2. cont1 and cont9

There are some negatively correlated pair of features as well

  1. cont13 and cont3
  2. cont9 and cont3
  3. cont1 and cont3

Feature Engineering and Encoding

I have tried various ways to encode the categorical data like label encoding, one- hot encoding .In one hot encoding, the number of dimensions has increased to great extent.At end encoding based on lexicographic order
resulted in the highest performance gain.
For the numeric features, the transformation such as log1p, log2, square, square root was applied.

PCA : Principal component analysis

Finding out whether there is any component that contains significantly more variance than other components.

from sklearn.decomposition import PCA
X_pca = np.hstack((X_con,X_cat))
pca = PCA(n_components=20)
pca.fit(X_pca)
print(pca.explained_variance_)
print(pca.explained_variance_ratio_)
# output
[2.35154011 2.0085094 1.33360039 1.07081709 0.93789653 0.85301854
0.78585513 0.73253702 0.57646378 0.52337216 0.50127435 0.47667501
0.4375088 0.42108877 0.39689408 0.32907901 0.31447881 0.30535877
0.28352583 0.26869016]
[0.08142671 0.0695486 0.04617854 0.03707915 0.03247652 0.02953745
0.02721178 0.02536554 0.01996119 0.01812279 0.01735761 0.01650581
0.0151496 0.01458103 0.01374324 0.01139501 0.01088945 0.01057365
0.00981764 0.00930393]

We can see the explained variance for the top 2 principal components is above 2, and the top 4 principal components have only the variance above 1. There is no principle component having a large explained variance. Also, the topmost principal component has a variance of 8.14% among the 20 components.

Part 2 : Modeling

1. Baseline model

This model will predict the median value for all the inputs.

Mean absolute error on test data for baseline mode 1796.9614807242938

2. Custom Ensemble Model

The total training data is divided into two data sets D1 and D2. D1 contains 80% of the training data in the D2 containing 20% of the training data which is a holdout set and will later be used for testing the performance of the final custom ensemble model. From the D1 set, we are sampling(with Replacement) N different dataset which is used for training N base regressors ( decision trees). Using the prediction of the N base models a meta-regression model is trained which will predict our final loss for a data point. The performance of this metamodel is finally tested on the hold-out set D2.

Flow chart of custom ensemble model

Comparison of Models

Here I have trained 6 models, out of which the custom ensemble model has outperformed other models. I have used a decision tree as my base models and using the custom ensemble the performance has improved significantly over the base models.

Model Comparisons.

Kaggle submission Result

The submission was made after the Kaggle competition has ended.

Achieved a public score of 1203 and a private score of 1216.24.

Future Work

  1. Trying out different stacking architecture and reducing the MAE.
  2. Including different feature engineering ways.
  3. Using Hyperparameter tuning using libraries like optuna.
  4. Neural networks can be tried out.

References

  1. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
  2. https://notendur.hi.is/heh89/Allstate_Claims_Severity.pdf
  3. http://cs229.stanford.edu/proj2017/final-reports/5242012.pdf
  4. https://www.kaggle.com/sharmasanthosh/exploratory-study-on-ml-algorithms

Codebase and Deployment.

Github repository: containing the full code of this project. [Allstate]

Website for trying out the model:[ herokuapp ]

Contact: LinkedIn ||| Email

--

--

Tushar Tiwari
Nerd For Tech

Finding insights from data | Fascinated by how Markets works.