Telecom Customer Churn Analysis

Introduction

With the rapid development of telecom industry, the service providers are interested more in expanding their customer base. To be up in the competitive telecom marker, retaining existing customers are very crucial and it’s a biggest challenge. According to the surveys done in telecom industries, it’s proved that the cost of acquiring new customers is more expensive than retaining existing customers. By making use of the data available with service providers, we can predict whether the valuable customer will leave the company or not.

Calculating how much to spend on acquisition and retention is something of a black art. If too many customers are lost, revenues will plummet. If too much is spent, margins will suffer unduly. But big money is at stake. Research by Tefficient shows that the average service provider in a mature market typically spends 15-20% of service revenues on acquisition and retention activities. That’s pretty staggering. To put it into context, McKinsey says average CAPEX spending on infrastructure (networks and IT) is 15% of revenues.

In this article, we’ll cover a top to bottom data analytics approach which will solve the customer churn problem.

Data Set

We will be using telecom customer churn data which is publicly available in Kaggle. Below is the data description of the data set used.

Blog

Exploratory Data Analysis

Based on the telecom domain knowledge the below insights are prepared

Blog

Overall customer churn is 14.5% in all states

Blog


Blog

New Jersey and California states has highest churn %

Blog


Blog

West Virginia has more number of customers leaving the company

Blog

Churn rate in customer group who has opted for international plan is high (42.4 %)
Churn rate in customer group who has no voice plan is high (16.7 %)




Blog

There is not much pattern difference found in account tenure between active and left customer groups

Blog



Blog
Blog


Blog

There is a clear pattern found in all type of call charges. That is, the customers who left the service provider paid more for all type of calls (except night calls).



Blog
Blog


Blog
Blog


There is a clear pattern found in all type of call durations. Usage of the service in terms of call duration is always high (Except night calls) in the customers who left the service provide than the active customers.



Summary

  1. New Jersey and California states has highest churn % (26.5%) where West Virginia has highest number of customer base (106 customers)
  2. Customers who subscribed for international calls leave the service provider (42.4 %)
  3. Churn % is always higher in customers who use the product intensively

While Exploratory Data Analysis gives some insights into how each of the attributes is related to customer churn, we can’t get any idea on relative influence of each of these attributes, i.e. which attributes in the data influencing churn significantly, and which attributes do not have any influence or relatively lesser significance. It also does not provide a model to assess the probability of existing customer leaving the service provider.

Statistical techniques and Machine Learning algorithms help us address these limitations as explained in the next section.

Developing a Machine Learning Algorithm to predict customer churn

Machine learning algorithms are used to build analytical models which use historical data (where the value of the outcome variable is known - labelled) to build a model, which can predict the value of the outcome variable in new data where that value is not known. A good machine learning model can accurately predict the value of the outcome variable and thus help with quick decisions in the process workflow. Predictive analytics uses an outcome variable, which, in the churn prediction case, is the churn indicator variable, for building the predictive model.

Building a machine learning model will have the below typical steps:

  1. Gathering data
  2. Data preparation
  3. Choosing a model
  4. Training
  5. Evaluation

Gathering Data

First step in creating a data analytics platform is creating the appropriate database. Data can be collected from different internal as well as external sources. Data in this case will have features like

  1. Internal Data
    1. Customer Demographics (Age, Gender, Marital Status, Location, etc.)
    2. Call Statistics (Length of calls like Local, National & International, etc.)
    3. Billing Information (what the customer paid for)
    4. Voice and Data Product
    5. Credit History
    6. Customer Satisfaction survey (CSAT)
    7. Call center data
  2. External Data
    1. Competitor information
    2. External suvey data

We have used publicaly available data for this case study.

Data Preparation

Data preparation will have tasks like:

Selecting correct sample data

Once the data is collected, it’s time to assess the condition of it, including looking for outliers, exceptions, incorrect, inconsistent, missing, or skewed information. This is important because source data will inform all of model’s findings, so it is critical to be sure it does not contain unseen biases. For example, if we are looking at practioneer behavior nationally, but only pulling in data from a limited sample, you might miss important geographic regions. This is the time to catch any issues that could incorrectly skew your model’s findings, on the entire data set, and not just on partial or sample data sets.

Formatting data to make it consistent

The next step in great data preparation is to ensure your data is formatted in a way that best fits the machine learning model. If data is aggregated from different sources, or if the data set has been manually updated by more than one stakeholder, it’s likely to discover anomalies in how the data is formatted (e.g. USD5.50 versus $5.50). In the same way, standardizing values in a column. Consistent data formatting takes away these errors so that the entire data set uses the same input formatting protocols.

Improving data quality

Here, start by having a strategy for dealing with erroneous data, missing values, extreme values, and outliers in the data. Self-service data preparation tools can help if they have intelligent facilities built in to help match data attributes from disparate datasets to combine them intelligently. For instance, if we have columns for FIRST NAME and LAST NAME in one dataset and another dataset has a column called PHYSICIAN NAME that seem to hold a FIRST and LAST NAME combined, intelligent algorithms should be able to determine a way to match these and join the datasets to get a singular view of the customer.

Feature Engineering

Success in machine learning algorithms is dependent on how the data is represented. Feature engineering is a process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model performance on unseen data. Domain knowledge is critical in identifying which features might be relevant and the exercise calls for close interaction between a domain specialist and a data scientist.

Feature Selection:

Out of all the attributes listed in the dataset, the attributes that are relevant to the domain and result in the boosting of model performance are picked and used. I.e. the attributes that result in the degradation of model performance are removed. This entire process is called Feature Selection or Feature Elimination.

There are several methods used for selecting appropriate features for optimal model performance. Following are some of the most commonly used methods.

Trial & Error: Start with features known with domain knowledge, and keep adding other features one at a time, and see the model performance. Keep the features that improve the performance and avoid those that don’t improve or degrade the performance. This approach is called Forward Selection. The other approach is to start with all features and keep eliminating one feature at a time, and observe the performance. Again, keep those features that don’t degrade the performance. This approach is called Backward Elimination. This is proven as the better method while working with trees.

Dimensionality Reduction (PCA): PCA (Principle Component Analysis) is used to translate the given higher dimensional data into lower dimensional data. PCA is used to reduce the number of dimensions and selecting the dimensions which explain most of the dataset’s variance. (In this case it is 99% of variance). The best way to see the number of dimensions that explains the maximum variance is by plotting a two-dimensional scatter plot.

Statistical Significance: Univariate feature selection is one of the feature selection methods which works by selecting the best features based on univariate statistical tests. In our data set we have both numerical and categorical data. To calculate the statistical significance for categorical we’ll select chi-sq and for numerical we’ll select ANOVA. These methods will give the p values of each independent variables with the dependent variables. Below is the table containing the list of features and their p-values (If categorical – chi-sq will be performed else ANOVA will be performed)

Trial & Error: Start with features known with domain knowledge, and keep adding other features one at a time, and see the model performance. Keep the features that improve the performance and avoid those that don’t improve or degrade the performance. This approach is called Forward Selection. The other approach is to start with all features and keep eliminating one feature at a time, and observe the performance. Again, keep those features that don’t degrade the performance. This approach is called Backward Elimination. This is proven as the better method while working with trees.

Dimensionality Reduction (PCA): PCA (Principle Component Analysis) is used to translate the given higher dimensional data into lower dimensional data. PCA is used to reduce the number of dimensions and selecting the dimensions which explain most of the dataset’s variance. (In this case it is 99% of variance). The best way to see the number of dimensions that explains the maximum variance is by plotting a two-dimensional scatter plot.

Statistical Significance: Univariate feature selection is one of the feature selection methods which works by selecting the best features based on univariate statistical tests. In our data set we have both numerical and categorical data. To calculate the statistical significance for categorical we’ll select chi-sq and for numerical we’ll select ANOVA. These methods will give the p values of each independent variables with the dependent variables. Below is the table containing the list of features and their p-values (If categorical – chi-sq will be performed else ANOVA will be performed)

Blog

In later part, we’ll see the importance of including the variables which have significant relationship with the target variable.

Choosing the model

There are many supervised algorithms available. Each algorithm differs in nature and produce different results based on the given data set. We have to choose the appropriate algorithms according to the problem that wants to be solved and the nature of the data. Below are the algorithms that will be used in our model building process:

  1. Logistic Regression
  2. Random Forests
  3. Gradient Boosting
  4. Neural Networks

Training Models and Evaluation

The model building activity involves construction of machine learning algorithms that can learn from a historical data and make predictions or decisions on unseen data. Following is the detailed modelling process.

Once the models are built, they are trained with training dataset, then validated with validation dataset, while fine tuning hyper parameters, and finally tested with test dataset. At each stage, chosen performance metric is observed to get desired performance level.

Training the models

The following steps summarize the process of the model development:

  1. Once the dataset is obtained, it is processed for better quality, then divided into Training, Validation and Test Sets with 70:20:10 ratio. This ratio can vary depending on overall size of the data set available
  2. Then a particular algorithm is chosen and features engineered for the algorithm, and the model is trained with training data till we get desired performance
  3. Then the model is tested with validation data and if the performance here is not good enough, we will go back to training step and tune some of the hyper parameters and test again with validation set. This process gets repeated until we satisfy with both training and validation set accuracies
  4. Then finally test with test data set. If we get desired result here, then deploy the model for production use
  5. If we don’t get desired results with this algorithm, try with other suitable algorithms and repeat the process again
Model Evaluation

A classification model’s performance can be evaluated by using confusion matrices. The key concept of confusion matrix is that it calculates the no. of correct & incorrect predictions which is further summarized with the no. of count values and breakdown into each classes. It eventually shows the path in which classification model is confused when it makes predictions.

The performance of the base models on the test data set is given below:

Blog

Evaluation results are showing that Gradient Boosting does better prediction on the data set.

Importance of Features

Based on Gradient Boosting algorithm the below factors are considered as top 10 important as they play significant role in predicting customer churn.

Blog

Below are the visualization of top factors and their statistical significance results which will help the insurance company to take decisions based on the insights provided.

Blog

This chart shows that more customers whose total call duration is > 300, they leave the service provider. Chi-Square test proves that this result is statistically significant as p = 001

Blog

This chart shows that customers make more service calls leave the service provider more. Chi-Square test proves that this result is statistically significant as p = 001


Blog

Customers who make more evening calls may be unhappy with the service and leave in higher rate. The result is statistically significant as p = 0.001

Blog

New Jersey has higher customer churn rate. Chi-sq test proves that this result is statistically significant as p = 0.001


Blog

Customers who pay more for the service leave the service provider in higher rate. The result is statistically significant as p = 0.001

Blog

Churn rate in customers who opted for international plan has higher churn rate. The result is statistically significant as p = 0.001


Blog

Customers who pay more for evening calls leave the service provider in higher rate. The result is statistically significant as p = 0.001

Blog

Customers who make more international calls may be unhappy with the service and leave in higher rate. The result is statistically significant as p = 0.08


Blog

This result is not statistically significant as the p=0.13

Blog

Churn rate is higher with customers who haven’t opted for voice mail option and the result is statistically significant as p = 0.001