Customer Propensity Model for retail bank

Introduction

Amid the economic uncertainty, banking sector companies are mainly concerned about implementing decisive steps to optimize their businesses and improve their financial performance. Also, with the recent decline in profitability, leading businesses in the banking sector are adopting new operating models and technologies to address challenges pertaining to improving customer experience and compliance. In this sector, consumer is the king, and in this era of heightened customer expectations and new market entrants, it becomes pivotal for the banks to stay relevant in the marketplace. To address such concerns, leading businesses in the banking sector are leveraging customer segmentation solutions. Customer segmentation refers to proactively profiling the customers based on their preferences. Moreover, identifying the potential customers can occur through three different ways; the needs, buying characteristics, and the customer’s value. With altering consumer preferences, the banking sector needs to clearly define and target their best prospects and satisfy customers’ expectations.

Current problem we are going to solve in this article is, a banking institution wanted to know their existing potential customers who can buy their term deposit product. Knowing that term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can invest in higher gain financial products to make a profit. In addition, banks also hold better chance to persuade term deposit clients into buying other products such as funds or insurance to further increase their revenues

In this article we will go through the process of analysing the available data from exploratory data analysis, statistical analysis to machine learning models to identify potential prospects who can buy the term deposit products.

Exploratory Data Analysis (EDA)

We have collected an open source data set from the internet that consists of information about customer attributes of the bank.

The data set consist of 45211 observations of 21 columns. The data contains information such as customer demographics (age, job, marital, education, default, housing, loan) , information related to the last contact of the current campaign(contact, month day_of_week, duration, campaign)

Based on our understanding of the banking domain, we plot various graphs relating different attributes. Graphs and associated observations are provided below.

Blog

The above graph depicts customers subscribed for term deposit across marital status. The married group count is more who has more unsubscribed for term deposit.

Blog

The above graph depicts customer subscribed for term deposit across various job role. The blue-collar job people has more unsubscribed for term deposit.

Blog

The above graph depicts customer targeted across different education background. The secondary and tertiary education group people has been targeted more

Blog

The above graph depicts customer eligibility across job role. The blue-collar job people have more eligibility.

Blog

The above graph depicts customer credit default across marital status. The married group people have more credit default

Blog

The above graph depicts customer contacted v/s duration. 64.8% has been contacted for maximum duration.

Blog

The above graph depicts outcome of previous marketing campaign. 81.7% are unknown ,10.8% outcome are failure and success are only 3.3%

Blog

The above graph depicts customer responses across marital. 60.8% of married group people has responded followed by single which is 27.6%

Blog

The above graph depicts contacts performed before campaign across various age group. 40.2% contacts of age group are between 30-39 followed by 25.9% contacts of age group between 40-49.

Blog

The above graph depicts customer personal loan details across job. The blue-collar group people have more loan whereas management group people have less loan.

While Exploratory Data Analysis gives some insights into how each of the attributes is related to customer potential prospect, we can’t get any idea on relative influence of each of these attributes. It also does not provide a model to assess the probability of a customer propensity.

Summary

  1. Outcome of previous marketing campaign 81.7% are unknown ,10.8% outcome are failure and success are only 3.3%
  2. Customer responses across marital status 60.8% of married group people has responded followed by single which is 27.6%
  3. Contacts perform before campaign 40.2% contacts of age group are between 30-39 followed by 25.9% contacts of age group between 40-49
  4. The blue-collar group people have more loan whereas management group people have less loan

Developing a Machine Learning Algorithm for potential prospect

Listed below are the key data analytics techniques used for this purpose. All these techniques together will help create a robust fraud prediction solution.

Predictive Analytics

Machine learning algorithms are used to build analytical models which use historical data (where the value of the outcome variable is known) to build a model, which can predict the value of the outcome variable in new data where that value is not known. A good machine learning model can accurately predict the value of the outcome variable and thus help with quick decisions in the process workflow. Predictive analytics uses an outcome variable, which, in the fraud prediction case, is the fraud indicator variable, for building the predictive model.

Feature Name Description
age Customer's age
job Type Of Job
martial marital status
education has credit in default?
housing has housing loan?
loan has personal loan?
contact contact communication type
month last contact month of year

As a first step, it is important to understand what data is useful and available for building the predictive model. Normally, the loan details, customer profile data, credit in default, along with term deposit subscription, customer response data are used for this purpose. The machine learning model building process has to go through several steps such as assessing the quality of data, understanding the variables and relationships between them, selecting the best predictor variables and model building and validation

There are several statistical and machine learning algorithms used for classification to identify potential prospect. Below are the algorithms that will be used in our model building process

  1. Logistic Regression
  2. Random Forests
  3. Gradient Boosting Method
  4. Feed Forward Network

Feature Engineering

Success in machine learning algorithms is dependent on how the data is represented. Feature engineering is a process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model performance on unseen data. Domain knowledge is critical in identifying which features might be relevant and the exercise calls for close interaction between a loss claims specialist and a data scientist.

Importance of Feature Engineering:

  1. Better features results in better performance: The features used in the model depict the classification structure and result in better performance
  2. Better features reduce the complexity: Even a bad model with better features has a tendency to perform well on the datasets because the features expose the structure of the data classification
  3. Better features and better models yield high performance: If all the features engineered are used in a model that performs reasonably well, then there is a greater chance of highly valuable outcome

Feature Selection:

Statistical Significance: Univariate feature selection is one of the feature selection methods which works by selecting the best features based on univariate statistical tests. In our data set we have both numerical and categorical data. To calculate the statistical significance for categorical we’ll select chi-sq and for numerical we’ll select ANOVA. These methods will give the p values of each independent variables with the dependent variables. Below is the table containing the list of features and their p-values (If categorical – chi-sq will be performed else ANOVA will be performed)

Feature Name P Value
month 0.17
job 0.00
loan 0.21
housing 0.03
previous 0.00
emp.var.rate 0.00
cons.price.idx 0.00
nr.employed 0.00

The performance of the base models on the test data set is given below:

Logistic Regression Random Forest Gradient Boosting Feed Forward Network
Recall 90% 94.3% 83.3% 87.6%
Precision 73% 92.0% 94.3% 95.4%
F-Score 80% 93.1% 88.5% 91.3%

Importance of Features

Based on Random Forest algorithm the below factors are considered as top 10 important as they play significant role in predicting the potential prospect

Variable Name Relative Importance
duration 100%
euribor3m 64%
nr.employed 58%
age 56%
job 56%
day_of_week 48%
education 48%
pdays 44%
month 42%
poutcome 42%

Appendix:

Below are various parameter settings used while building models:

Random Forest

Parameter Value
ntrees 300
stopping_metric logloss
nbins 5
minrows 10
categorical_encoding label_encoder
max_dept 15
stopping_rounds 5

Gradient Boosting

Parameter Value
ntrees 500
stopping_metric logloss
nbins 10
max_dept 30
categorical_encoding label_encoder
stopping_rounds 5

Neural Network

Parameter Value
Hidden Layers 500,500,300,200
epochs 1000
epsilon 0.00000001
initial_weight_distribution UniformAdaptive
stopping_rounds 5
adaptive_rate TRUE
rho 0.99
rate_annealing 0.000001
huber_alpha 0.9
score_interval 5