June 2022

Abstract

Churn is a measure of how many customers stop using a service or product, often evaluated for a specific period of time. One of the biggest difficulties in telecommunication industry is to retain the customers and prevent the churn.

Customers in the telecom industry can choose from a variety of service providers and actively switch from one to the next. Acquiring new customers is not only more difficult, but also much more costly to companies than maintaining existing customer relationships. Therefore, many Customer Churn Prediction (CCP) models have been implemented.

In this study, a churn level prediction process is carried out using machine learning. A dataset with over 7000 customers of a telecom company is used. An action plan for the company is designed based on the results.

The complete dataset and auxiliary files can be found in the project repository on GitHub linked below:

GitHub

The full project notebook, with all the code, can also be seen on Deepnote:

In this article, all the discussion, results, and outputs of the Jupyter Notebook used to develop the study will be shown. The code can be seen on the links above.

Contextualization

churn

The telecom industry, while in a consolidation phase in developed markets, is booming in emerging markets [1]. With the increasing variety of service providers, telecom churn has emerged as one of the most important causes of revenue erosion for telecom operators [2]. Then, predicting churners from the demographic and behavioral data of customers has been of great interest. Acquiring new subscribers can cost up to 6 times more than retaining an existing one [3].

Churn management involves, among many others:

  • predicting customers likely to churn
  • preventive actions
  • running marketing campaigns

But why do customers churn? [4]

There is no unique answer. However, we can think of a few likely causes:

  • customer no longer values the product
  • motivating factors to use the product no longer exists
  • customer frustrated with product user experience
  • the product lacks a mandatory capability required by the user
  • value to the customer does not justify the expense
  • the customer has switched to an alternative solution
  • damage to product reputation (e.g., cybersecurity issue, performance, etc.)

And how can a company reduce churn? [4]

Again, there is no silver bullet. But increasing the perceived value proposition of the service to current users is obviously a key point. This can be achieved in several ways, among them:

  • make sure customers get the most out of the product
  • recruit the right kind of customers
  • price based on value
  • continually add value without breaking what already works
  • don’t take customers for granted

In this study, an IBM dataset will be analyzed. The Telco customer churn data contains information about a telecom company named Telco that provide home phone and internet services to 7043 customers [5]. It indicates which customers have left, stayed, or signed up for their service. Multiple important demographics and services are included for each customer, totaling 20 features. The aim is to predict behavior to retain customers.

A cross-industry standard process for data mining (CRISP-DM) [6] approach will be used as shown in Figure 1.

diagram crips
Figure 1. Proposed methodology flow diagram

Each step will be detailed in the appropriate section.

Loading libraries

The following libraries were used. A Conda environment file is available at the project repository at GitHub [7] so that the exact development environment can be replicated by anyone.

Versions of the packages:

-------------------- | ----------
      Package        |  Version  
-------------------- | ----------
Imbalanced-Learn     |      0.9.1
Matplotlib           |      3.5.1
NumPy                |     1.22.3
Pandas               |      1.4.2
Scikit-Learn         |      1.1.1
Seaborn              |     0.11.2

Python version: 3.9.12

In this article all the results and outputs of the Jupyter Notebook used to develop the study will be shown. The complete code can be found on the project repository [7].

Summarizing data and understanding the problem in hand

eda

Data dictionary

Each row represents a customer, and each column contains the customer’s attributes, as described below. The dataset includes information about:

  • Customers who left within the last month:
    • Churn: Yes = the customer left the company within the last month. No = the customer remained with the company.
  • Customers’ demographic info:
    • gender: customer’s gender: Male, Female
    • SeniorCitizen: customer is 65 or older: 1, 0 (meaning Yes and No, respectively)
    • Partner: customer is married: Yes, No
    • Dependents: customer lives with any dependents: Yes, No. Dependents could be children, parents, grandparents, etc.
  • Services that each customer has signed up for:
    • PhoneService: customer subscribes to home phone service with the company: Yes, No
    • MultipleLines: customer subscribes to multiple telephone lines with the company: Yes, No, No internet service
    • InternetService: customer subscribes to Internet service with the company: No, DSL, Fiber Optic
    • OnlineSecurity: customer subscribes to an additional online security service provided by the company: Yes, No, No internet service
    • OnlineBackup: customer subscribes to an additional online backup service provided by the company: Yes, No, No internet service
    • DeviceProtection: customer subscribes to an additional device protection plan for their Internet equipment provided by the company: Yes, No, No internet service
    • TechSupport: customer subscribes to an additional technical support plan from the company with reduced wait times: Yes, No, No internet service
    • StreamingTV: customer uses their Internet service to stream television programming from a third-party provider: Yes, No, No internet service
    • StreamingMovies: customer uses their Internet service to stream movies from a third-party provider: Yes, No, No internet service
  • Customer account information:
    • tenure: total number of months that the customer has been with the company.
    • Contract: customer’s current contract type: Month-to-Month, One Year, Two Year.
    • PaperlessBilling: customer has chosen paperless billing: Yes, No
    • PaymentMethod: how the customer pays their bill: Electronic check, Credit Card, Mailed Check, Bank transfer
    • MonthlyCharge: customer’s current total monthly charge for all their services from the company
    • TotalCharges: customer’s total charges, calculated to the end of the quarter
  • Finally, each customer has a CustomerID, a unique ID that identifies the customer.

What problem we have and which metric to use?

Based on data dictionary, it is a classification problem.

  • Churn is the target variable
  • It is a binary (yes or no) classification problem

We will start our study with data wrangling to:

  • structure and organize the data
  • clean the data
    • wrong data types
    • remove duplicates
  • enrich the data
    • decide how to deal with empty entries (if any)

Then, we will perform an exploratory data analysis (EDA) to:

  • identify categorical and non-categorical features
  • visualize the distribution of each feature
  • identify correlations
  • identify outliers
  • check the need for data transformations
  • check the balance of the target variable
    • as we will see, the target is imbalanced

After the EDA, we are going to evaluate some machine learning algorithms to see which one give the best churn prediction. Scikit-Learn pipelines will be used with preprocessing steps (standardization and encoding). Some algorithms may have problems with imbalanced targets [8]. So, some strategies will be evaluated to deal with the imbalance and added to the preprocessing steps. After the preprocessing steps, each model will be added to the pipeline and evaluated through cross validation.

Imbalanced data also have consequences in the choice of the evaluation metrics. Although the dataset is not heavily imbalanced, some known problems arise when dealing with such cases, mainly the accuracy paradox [9] and the unreliability of the ROC curves. Based on the literature, the chosen metric is the area under the precision-recall curve [10]. A more detailed explanation will be given at the appropriate section.

With the chosen metric, a grid search will be performed to tune the hyperparameters of selected algorithms to maximize the score of such metric. Finally, a critical analysis of the results will be made with some proposed actions to address the features that have more correlation with churn.

We can now enhance our methodology flow chart with all these steps, as shown in Figure 2:

flow chart
Figure 2. Methodology flow diagram

Basic info and data wrangling

Let’s get some basic info about our dataset to get familiarized with it. The following table has the first 5 entries of the dataset:

customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

Let’s check the size of the dataset:

Number of instances (rows):       7043
Number of attributes (columns):    21

The dataset has the following problems that were solved (see notebook on GitHub for details):

  • inconsistent representation of the categorical column SeniorCitizen
  • wrong data type, object in TotalCharges due to spaces ( ) that should be null entries
    • the spaces were converted to NaN revealing that 11 entries were null ones
    • these null entries were filled with the median value of the column
    • the final data type of the column was float64, as expect for a numeric column

After the due wrangling of the dataset, we can generate descriptive statistics to all numeric columns:

tenure MonthlyCharges TotalCharges
count 7043.00 7043.00 7032.00
mean 32.37 64.76 2283.30
std 24.56 30.09 2266.77
min 0.00 18.25 18.80
25% 9.00 35.50 401.45
50% 29.00 70.35 1397.47
75% 55.00 89.85 3794.74
max 72.00 118.75 8684.80

As expected, TotalCharges has a broad range of values.

Now, we can check the unique values for each column. This is a great way to understand categorical features:

customerID          7043
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
Churn                  2
dtype: int64

The demographics columns have two categories each, while most of the services related columns have three. We can extract more information of these columns:

customerID gender SeniorCitizen Partner Dependents PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod Churn
count 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043
unique 7043 2 2 2 2 2 3 3 3 3 3 3 3 3 3 2 4 2
top 7590-VHVEG Male No No No Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check No
freq 1 3555 5901 3641 4933 6361 3390 3096 3498 3088 3095 3473 2810 2785 3875 4171 2365 5174

Now we know that most customers:

  • are male
  • are younger than 65 years
  • have no partner and no dependents
  • have phone service, with a single line
  • have fiber optic internet service
  • do not have online services (security, backup, device protection)
  • do not have tech support
  • do not have streaming services
  • have monthly contracts
  • have paperless billing and pay with electronic check

The target feature Churn has No as the most frequent value, occurring 5174 times.

It’s time to get visual.

Data visualization

We are interested in the churn rate, so let’s start with the target column. Let’s find out the proportion of churn:

png

A little more than one quarter of the customers left the company in the last month. A high number that we need to understand.

First, we will look at the numerical features. Starting with histograms to search for patterns in the distributions:

png

The tenure distribution has an interesting shape. Most customers have been with the company for just a few months, but also many have been for about 72 months (maximum value for tenure). This is probably related with different contracts, something that we will check soon. Probably some marketing campaign was ran recently to capture new customers due to the high number of customers with few months.

We can see that most customers pay low monthly charges, but there is a great fraction with medium values. Since most customers have been with the company for just a few months, the total charges plot shows most customers with low values.

Let’s check if the tenure distribution has some kind of relation with the kind of contract:

png

As can be seen, most of the monthly contracts last for a few months, while the 2 years contracts tend to last for years, with a great increase towards the greater values of tenure in this dataset. This implies that customers with a great commitment at the beginning, like a 2-year contract, tend to stay with the company for a longer period of time. Long-term contracts usually have contractual fines. Therefore, customers have to wait until the end of the contract to churn. It is not clear if it is the case. A time-series data would be better to study this.

As seem before, we have numerical and categorical columns. Let’s start to see how our target variable relates with our numerical features.

png

The above plot shows that short tenure (recent) customers have higher churn rates. Moreover, the higher the monthly charge, the higher the churn rate.

png

The boxplots show that the churn rate is higher among customers with low tenure and high monthly charges. In details:

  • the median tenure for customers who have left is around 10 months, while it is around 40 months for those who have stayed with the company
  • the median monthly charge for customers who have churned is around 80, while it is around 65 for those who have not churned
  • since most customers who have churned spent less time with the company, they have low total charges compared with those who have stayed
    • There are many outliers in the total charges boxplot of customers who have churned. It is not clear the cause, but it could be wrong billing or expensive services that guided the customers away from the company.

Now the categorical features. Let’s find out the proportion of each category:

png

Some highlights:

  • the dataset is almost equally distributed in terms of gender
  • 55.0 % of the customers have month-to-month contracts
  • 21.7 % of the customers do not have internet service
  • 90.3 % of the customers have phone service
  • there are only 16.2 % senior customers. Thus, most customers are young people (less than 65 years)
  • 48.3% have a partner, but only 30 % have dependents

We see that some categorical features have ‘No’ and ‘No internet service’ (or ‘No phone service’) as categories. Maybe all of them can be labeled as ‘No’ if the categories provide no additional information. We can check this, plotting the churn rate by category:

png

Features that seem to be positively correlated with churn:

  • month to month contracts
  • absence of online backup, online security, and device protection services
  • absence of tech support
  • being a senior citizen
  • paperless billing
  • pay with electronic check
  • internet service by fiber optic

Features that seem to be negatively correlated with churn:

  • two-year contracts
  • absence of internet service
  • having a partner or dependent

We will quantify these correlations soon. First, let’s try to interpret the findings.

Both genders behave similarly when it comes to migrating to another service provider.

It is interesting to see that each service that has the “No internet service” category has much lower churn rates. Maybe the internet service provided by the company has connectivity problems, particularly the fiber optic one. It could also be that the setup is not easy, so that those who opted not to have tech support may not be able to use the services. And that would be more severe in senior customers. While it seems that there are issues with the fiber optic internet, the DSL one has a much lower churn rate despite being a slower connection.

Since the “No internet service” category provided insights, it will not be merged with the “No” category.

We can explore more details about the internet service:

png

It’s interesting that customers with DSL and higher monthly charges have lower churn rate.

The data suggest that the more people at the customers’ places, the less churn. Customers with partners or dependents have lower churn rates. Probably because more people are involved in the decision of leaving, making it more difficult.

For marketing reasons, we can see how many customers who have partners also have dependents. Assuming that, in most cases, “dependents” mean children, it is more likely that customers with partners will also have dependents. Let’s check this assumption.

png

Almost half of the customers with partners have dependents. Again, assuming that in most cases “dependents” mean children, this means that marketing campaigns which aim at avoiding churn may focus on single people.

To quantify correlation, we need to convert categorical variables into indicators. The get_dummies Pandas method was used to convert. With the indicators, we can plot the correlation of each categorical and numerical feature with the target:

png

The correlation plot confirms the trends we have seen before:

  • strong negative correlation with churn:
    • tenure
    • two years contract
    • no internet service
  • strong positive correlation with churn:
    • month to month contract
    • no online services (security, tech support, backup, device protection)
    • fiber optic internet
    • electronic check payment

We could use these pieces of information to perform a feature selection [11]. We are not going through this path here, but is something to explore in a future work. Feature selection can be used to reduce sample sets to improve scores or to boost performance on very high-dimensional datasets. The dataset in hand is not so large, then the feature selection path is left aside for a future study.

Data transforms

transforms

Before evaluating some machine learning algorithms, we need to do some data transforms as follows:

  • the numerical features will be standardized to get the same range. This benefits some algorithms.
    • we are going to use StandardScaler from Scikit-Learn
  • the categorical features will be encoded to be a suitable input to the algorithms
    • we are going to use OneHotEncoder from Scikit-Learn
  • the target column will be encoded
    • we are going to use LabelEncoder from Scikit-Learn

The StandardScaler and the OneHotEncoder will be part of a preprocessing step on a Scikit-Learn pipeline in all following procedures.

After the transforms, the dataset will be split in two, one for features and the other for the target.

Full details with code can be seen on the GitHub repository.

Evaluate algorithms

algorithms_image

We need a reference to begin our model evaluation. A DummyClassifier instance will be used to create a frame of reference. The classifier was configured so that it will always classify as the minority class, 1 (‘Churn’) in our case, a common configuration with imbalanced datasets.

In order to test the efficiency of a given classifier, we need to:

  • split the data in a train and a test sets
    • the model will be evaluated with the test set (data that it has not dealt with before)
    • in an imbalanced dataset, the split should be stratified to maintain the same class distribution in each subset
    • the splits will be repeated with random samples
  • perform a cross-validation to avoid overfitting

Since we are going to do these steps with our real models, they will be used with our reference too. The results for the dummy classifier:

    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.265    |   0.000   
Precision       |   0.265    |   0.000   
Recall          |   1.000    |   0.000   
F1 score        |   0.419    |   0.000   
F2 score        |   0.644    |   0.000   
AUROC           |   0.500    |   0.000   
AUPRC           |   0.265    |   0.000   

There are 7 metrics in the score table above. The accuracy value is exactly the proportion of churn in the dataset, since we adopted a constant strategy. The recall score is also consistent with the constant strategy. But which metric is more suitable to our case?

Which metric to use?

Since we have an imbalanced dataset, accuracy might not be the best metric. This is known as the accuracy paradox [9]. In our dataset, if a model predicts every example as zero (no churn), it will have an accuracy of 73.5 %, which seems a high score but “predicts” only based on the majority class.

A common metric to compare models is the area under the ROC curve. However, as described in the literature [10], it is also a metric that can be misleading with imbalanced data.

There are many approaches described in the literature. We don’t have any information regarding costs (marketing costs, revenue lost due to churn) in this dataset, so we can’t make a cost analysis. Then, remains the choice among metrics that deal with precision and recall like these metrics themselves, F-beta score and area under precision-recall curve (AUPRC). It will be considered that recall is more important than precision in this study, but we don’t want low precision scores. This means that we are considering more important to predict real churn, minimizing false negatives, but we want to avoid increasing too much false positives. Considering these restrictions, we have F-beta score, with beta greater than one, and AUPRC. The values of all metrics will be shown, but the chosen one is AUPRC.

The Scikit-Learn docs [12] state that the average_precision_score function from the package can be used as a AUPRC value, so it was the chosen function in this study.

Below, there are the models that will be tested. For the initial screening, there will be no modification of default parameters values, besides for those that have the option to set as a binary classifier. Those that have random behavior were seeded with a fixed value for reproducibility.

[('LR', LogisticRegression(max_iter=10000)),
 ('LDA', LinearDiscriminantAnalysis()),
 ('KNN', KNeighborsClassifier()),
 ('CART', DecisionTreeClassifier(random_state=42)),
 ('NB', GaussianNB()),
 ('SVC', SVC(random_state=42)),
 ('RF', RandomForestClassifier(random_state=42)),
 ('SGD', SGDClassifier(loss='modified_huber', random_state=42)),
 ('LGBM', LGBMClassifier(objective='binary', random_state=42)),
 ('XGB',
  XGBClassifier(tree_method='hist', objective='binary:logistic))]

As previously stated, each model will be added to a pipeline after the preprocessing step (standardization and encoding).

Full details with code can be seen on the GitHub repository.

Results:

LR
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.804    |   0.010   
Precision       |   0.657    |   0.022   
Recall          |   0.552    |   0.027   
F1 score        |   0.599    |   0.022   
F2 score        |   0.570    |   0.025   
AUROC           |   0.724    |   0.014   
AUPRC           |   0.482    |   0.020   

LDA
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.798    |   0.009   
Precision       |   0.637    |   0.019   
Recall          |   0.555    |   0.026   
F1 score        |   0.593    |   0.020   
F2 score        |   0.570    |   0.023   
AUROC           |   0.720    |   0.013   
AUPRC           |   0.472    |   0.017   

KNN
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.763    |   0.009   
Precision       |   0.558    |   0.017   
Recall          |   0.525    |   0.025   
F1 score        |   0.541    |   0.020   
F2 score        |   0.531    |   0.022   
AUROC           |   0.687    |   0.013   
AUPRC           |   0.419    |   0.015   

CART
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.724    |   0.009   
Precision       |   0.481    |   0.016   
Recall          |   0.500    |   0.022   
F1 score        |   0.490    |   0.017   
F2 score        |   0.496    |   0.020   
AUROC           |   0.652    |   0.011   
AUPRC           |   0.373    |   0.011   

NB
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.696    |   0.010   
Precision       |   0.460    |   0.010   
Recall          |   0.843    |   0.023   
F1 score        |   0.595    |   0.011   
F2 score        |   0.722    |   0.016   
AUROC           |   0.743    |   0.011   
AUPRC           |   0.430    |   0.010   

SVC
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.801    |   0.011   
Precision       |   0.670    |   0.028   
Recall          |   0.493    |   0.028   
F1 score        |   0.568    |   0.026   
F2 score        |   0.521    |   0.027   
AUROC           |   0.703    |   0.015   
AUPRC           |   0.465    |   0.022   

RF
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.789    |   0.007   
Precision       |   0.636    |   0.020   
Recall          |   0.483    |   0.023   
F1 score        |   0.549    |   0.018   
F2 score        |   0.507    |   0.021   
AUROC           |   0.692    |   0.011   
AUPRC           |   0.444    |   0.014   

SGD
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.762    |   0.020   
Precision       |   0.612    |   0.099   
Recall          |   0.458    |   0.222   
F1 score        |   0.471    |   0.157   
F2 score        |   0.458    |   0.196   
AUROC           |   0.665    |   0.071   
AUPRC           |   0.406    |   0.053   

LGBM
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.797    |   0.010   
Precision       |   0.642    |   0.021   
Recall          |   0.528    |   0.031   
F1 score        |   0.579    |   0.025   
F2 score        |   0.547    |   0.028   
AUROC           |   0.711    |   0.016   
AUPRC           |   0.465    |   0.020   

XGB
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.783    |   0.009   
Precision       |   0.607    |   0.020   
Recall          |   0.515    |   0.033   
F1 score        |   0.557    |   0.025   
F2 score        |   0.531    |   0.030   
AUROC           |   0.697    |   0.016   
AUPRC           |   0.442    |   0.019   

png

All models performed better than our dummy model reference considering our AUPRC metric. However, the values are still low. There is probably room for improvement. Many machine learning algorithms are prone to underperform with imbalanced datasets. We can also tune each model hyperparameter.

Regarding imbalance, there are many strategies to deal with this problem:

  • Data sampling
    • Undersampling: delete or select a subset of examples from the majority class
    • Oversampling: duplicate examples in the minority class or synthesize new examples from the ones in the minority class
  • Select an algorithm which is less affected, or not at all, to imbalance
  • Collect more data, if possible
  • Select a model that penalizes classification errors in minority class
  • Select a metric that has more focus on the minority class

Since we have a static dataset, we can’t collect more data. However, we can try all the other strategies. We have already discussed the metric that will be used. In the next sections, sampling strategies will be evaluated to deal with the imbalance problem, and a grid search for hyperparameters tuning will be performed in selected models.

Undersampling

Our first strategy is using RandomUnderSampler from the Imbalanced Learn library. This sampler applies a random undersampling technique, which randomly delete examples in the majority class. The sampler will be added to the pipeline of steps to be applied after the preprocessing and before the model. Importantly, the change to the class distribution is only applied to the training dataset. The intent is to influence the fit of the models. The sampling is not applied to the test or holdout datasets used to evaluate the performance of a model.

Full details with code can be seen on the GitHub repository.

Results:

LR
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.745    |   0.006   
Precision       |   0.512    |   0.008   
Recall          |   0.804    |   0.020   
F1 score        |   0.625    |   0.010   
F2 score        |   0.721    |   0.014   
AUROC           |   0.763    |   0.009   
AUPRC           |   0.464    |   0.009   

LDA
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.741    |   0.007   
Precision       |   0.508    |   0.009   
Recall          |   0.805    |   0.021   
F1 score        |   0.622    |   0.010   
F2 score        |   0.720    |   0.015   
AUROC           |   0.761    |   0.009   
AUPRC           |   0.460    |   0.010   

KNN
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.704    |   0.008   
Precision       |   0.466    |   0.008   
Recall          |   0.794    |   0.019   
F1 score        |   0.587    |   0.009   
F2 score        |   0.696    |   0.013   
AUROC           |   0.733    |   0.009   
AUPRC           |   0.425    |   0.008   

CART
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.679    |   0.012   
Precision       |   0.433    |   0.014   
Recall          |   0.682    |   0.032   
F1 score        |   0.530    |   0.018   
F2 score        |   0.611    |   0.025   
AUROC           |   0.680    |   0.016   
AUPRC           |   0.380    |   0.014   

NB
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.686    |   0.009   
Precision       |   0.451    |   0.009   
Recall          |   0.854    |   0.021   
F1 score        |   0.590    |   0.011   
F2 score        |   0.725    |   0.015   
AUROC           |   0.739    |   0.011   
AUPRC           |   0.424    |   0.010   

SVC
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.743    |   0.006   
Precision       |   0.510    |   0.007   
Recall          |   0.794    |   0.020   
F1 score        |   0.621    |   0.011   
F2 score        |   0.714    |   0.015   
AUROC           |   0.759    |   0.010   
AUPRC           |   0.460    |   0.010   

RF
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.740    |   0.009   
Precision       |   0.507    |   0.011   
Recall          |   0.764    |   0.021   
F1 score        |   0.610    |   0.014   
F2 score        |   0.694    |   0.017   
AUROC           |   0.748    |   0.012   
AUPRC           |   0.450    |   0.013   

SGD
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.686    |   0.070   
Precision       |   0.472    |   0.076   
Recall          |   0.727    |   0.192   
F1 score        |   0.546    |   0.037   
F2 score        |   0.633    |   0.110   
AUROC           |   0.699    |   0.029   
AUPRC           |   0.402    |   0.022   

LGBM
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.740    |   0.008   
Precision       |   0.507    |   0.010   
Recall          |   0.783    |   0.026   
F1 score        |   0.615    |   0.013   
F2 score        |   0.706    |   0.019   
AUROC           |   0.754    |   0.011   
AUPRC           |   0.454    |   0.011   

XGB
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.729    |   0.009   
Precision       |   0.494    |   0.010   
Recall          |   0.761    |   0.022   
F1 score        |   0.599    |   0.011   
F2 score        |   0.687    |   0.016   
AUROC           |   0.739    |   0.010   
AUPRC           |   0.439    |   0.010   

png

As can be seen, recall values increased in all models, with decrease in precision values, since there is a trade-off between these metrics. Our models become better classifying the minority class with less false negatives but more false positives.

We have chosen the metric AUPRC. We are going to select 4 models with higher AUPRC to tune the hyperparameters: LR, LDA, SVC, LGBM.

Tuning hyperparameters - random undersampling

A grid search will be performed to find the best combination of hyperparameters for each model.

Full details with code can be seen on the GitHub repository.

Results:

LR_l1
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.744    |   0.005   
Precision       |   0.511    |   0.006   
Recall          |   0.805    |   0.020   
F1 score        |   0.625    |   0.009   
F2 score        |   0.722    |   0.014   
AUROC           |   0.763    |   0.008   
AUPRC           |   0.463    |   0.008   

LR_l2
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.743    |   0.005   
Precision       |   0.511    |   0.006   
Recall          |   0.805    |   0.019   
F1 score        |   0.625    |   0.008   
F2 score        |   0.721    |   0.013   
AUROC           |   0.763    |   0.008   
AUPRC           |   0.463    |   0.008   

LDA
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.742    |   0.010   
Precision       |   0.509    |   0.012   
Recall          |   0.794    |   0.023   
F1 score        |   0.621    |   0.013   
F2 score        |   0.714    |   0.017   
AUROC           |   0.759    |   0.012   
AUPRC           |   0.459    |   0.013   

SVC
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.705    |   0.014   
Precision       |   0.469    |   0.015   
Recall          |   0.829    |   0.020   
F1 score        |   0.599    |   0.014   
F2 score        |   0.718    |   0.015   
AUROC           |   0.745    |   0.012   
AUPRC           |   0.434    |   0.014   

LGBM
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.752    |   0.007   
Precision       |   0.523    |   0.009   
Recall          |   0.777    |   0.029   
F1 score        |   0.625    |   0.013   
F2 score        |   0.708    |   0.020   
AUROC           |   0.760    |   0.012   
AUPRC           |   0.465    |   0.011   

png

As can be seen, LGBM performed slightly better than LR considering the AUPRC metric. Given that LR is usually easier to explain to all stakeholders and faster to implement, maybe is the best option for production. The LR algorithm gives a slightly higher recall (and slightly lower precision). Here, we will stay with LGBM.

Let’s take a look at the 10 most important features according to the tuned LGBM classifier:

png

We see that all listed features are among the ones that we thoroughly discussed during the EDA performed in the summarizing section. It is interesting to notice that tenure is by far the most important feature. And, like detailed in the EDA, fiber optic internet is definitively a service the company should take a closer look at, being the second most important feature to our model.

Will oversampling be a better strategy?

Oversampling

Now, we are going to use SMOTE from the Imbalanced Learn library. This sampler applies a Synthetic Minority Oversampling Technique (SMOTE), which select examples that are close in the feature space, drawing a line between the examples and drawing a new sample at a point along that line. Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found. A randomly selected neighbor is chosen, and a synthetic example is created at a randomly selected point between the two examples in feature space.

The sampler will be added to the pipeline of steps to be applied after the preprocessing and before the model. Importantly, the change to the class distribution is only applied to the training dataset. The intent is to influence the fit of the models. The sampling is not applied to the test or holdout datasets used to evaluate the performance of a model.

The same models from before will be tested. For the initial screening, there will be no modification of default parameters values, besides for those that have the option to set as a binary classifier.

Full details with code can be seen on the GitHub repository.

Results:

LR
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.748    |   0.006   
Precision       |   0.517    |   0.008   
Recall          |   0.799    |   0.023   
F1 score        |   0.627    |   0.010   
F2 score        |   0.720    |   0.016   
AUROC           |   0.764    |   0.009   
AUPRC           |   0.466    |   0.009   

LDA
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.745    |   0.007   
Precision       |   0.513    |   0.008   
Recall          |   0.795    |   0.024   
F1 score        |   0.623    |   0.011   
F2 score        |   0.716    |   0.017   
AUROC           |   0.761    |   0.010   
AUPRC           |   0.462    |   0.010   

KNN
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.689    |   0.010   
Precision       |   0.447    |   0.011   
Recall          |   0.724    |   0.019   
F1 score        |   0.553    |   0.012   
F2 score        |   0.644    |   0.015   
AUROC           |   0.700    |   0.011   
AUPRC           |   0.397    |   0.010   

CART
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.722    |   0.010   
Precision       |   0.480    |   0.018   
Recall          |   0.545    |   0.033   
F1 score        |   0.510    |   0.023   
F2 score        |   0.531    |   0.029   
AUROC           |   0.666    |   0.016   
AUPRC           |   0.383    |   0.016   

NB
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.701    |   0.010   
Precision       |   0.465    |   0.010   
Recall          |   0.838    |   0.022   
F1 score        |   0.598    |   0.012   
F2 score        |   0.722    |   0.016   
AUROC           |   0.745    |   0.011   
AUPRC           |   0.433    |   0.011   

SVC
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.764    |   0.009   
Precision       |   0.541    |   0.012   
Recall          |   0.729    |   0.029   
F1 score        |   0.621    |   0.015   
F2 score        |   0.682    |   0.021   
AUROC           |   0.753    |   0.012   
AUPRC           |   0.466    |   0.013   

RF
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.779    |   0.010   
Precision       |   0.590    |   0.020   
Recall          |   0.556    |   0.029   
F1 score        |   0.572    |   0.023   
F2 score        |   0.562    |   0.026   
AUROC           |   0.708    |   0.015   
AUPRC           |   0.446    |   0.019   

SGD
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.720    |   0.034   
Precision       |   0.489    |   0.042   
Recall          |   0.792    |   0.093   
F1 score        |   0.600    |   0.032   
F2 score        |   0.700    |   0.057   
AUROC           |   0.743    |   0.025   
AUPRC           |   0.440    |   0.027   

LGBM
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.785    |   0.008   
Precision       |   0.595    |   0.013   
Recall          |   0.598    |   0.033   
F1 score        |   0.596    |   0.021   
F2 score        |   0.597    |   0.028   
AUROC           |   0.725    |   0.015   
AUPRC           |   0.462    |   0.017   

XGB
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.779    |   0.009   
Precision       |   0.586    |   0.017   
Recall          |   0.571    |   0.030   
F1 score        |   0.578    |   0.022   
F2 score        |   0.574    |   0.027   
AUROC           |   0.713    |   0.015   
AUPRC           |   0.449    |   0.018   

png

As can be seen, compared to no sampling, recall values increased in all models, with decrease in precision values since there is a trade-off between these metrics. Our models become better, classifying the minority class with less false negatives but more false positives.

We have chosen the metric AUPRC. We are going to select 4 models with higher AUPRC to tune the hyperparameters: LR, LDA, SVC, LGBM.

Tuning hyperparameters - oversampling

A grid search will be performed to find the best combination of hyperparameters for each model. The same hyperparameters evaluate for RUS will be evaluated here.

Full details with code can be seen on the GitHub repository.

Results:

LR_l1
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.751    |   0.008   
Precision       |   0.520    |   0.010   
Recall          |   0.798    |   0.022   
F1 score        |   0.629    |   0.011   
F2 score        |   0.720    |   0.016   
AUROC           |   0.766    |   0.010   
AUPRC           |   0.468    |   0.011   

LR_l2
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.748    |   0.006   
Precision       |   0.517    |   0.008   
Recall          |   0.799    |   0.023   
F1 score        |   0.627    |   0.010   
F2 score        |   0.720    |   0.016   
AUROC           |   0.764    |   0.009   
AUPRC           |   0.466    |   0.009   

LDA
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.747    |   0.009   
Precision       |   0.515    |   0.011   
Recall          |   0.787    |   0.023   
F1 score        |   0.622    |   0.013   
F2 score        |   0.712    |   0.018   
AUROC           |   0.760    |   0.012   
AUPRC           |   0.462    |   0.013   

SVC
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.725    |   0.017   
Precision       |   0.490    |   0.018   
Recall          |   0.816    |   0.023   
F1 score        |   0.612    |   0.016   
F2 score        |   0.720    |   0.016   
AUROC           |   0.754    |   0.013   
AUPRC           |   0.449    |   0.016   

LGBM
    Metric      |    Mean    |  Std Dev  
----------------------------------------
Accuracy        |   0.788    |   0.007   
Precision       |   0.587    |   0.012   
Recall          |   0.681    |   0.034   
F1 score        |   0.630    |   0.017   
F2 score        |   0.660    |   0.026   
AUROC           |   0.754    |   0.014   
AUPRC           |   0.485    |   0.014   

png

LGBM performed better considering the AUPRC metric. However, this result was due to a great precision score and a fair recall score. If more weight in recall is needed, logistic regression, the close second by the metric, seems a better choice, with the bonus of being a faster and easier to comprehend algorithm. As discussed earlier, we do not have costs data to support a choice between more weight to precision or recall. That’s why we chose AUPRC as the metric, to have some balance between both.

The oversampling strategy did not improve the metrics compared with undersampling. Since it takes longer, and the SMOTE algorithm is more complex to explain to stakeholders, it is reasonable to choose undersampling, with the LGBM or the LR model, for production.

So, undersampling is better. Just for the sake of completeness, and since the LR algorithm is simpler and faster, let’s take a look at the most important features according to this model combined with oversampling. We will choose the model with the saga solver, since both LR models shown before performed almost equally by the metric. This solver is a variation of gradient descent and incremental aggregated gradient approaches that uses a random sample of previous gradient values. Since it is fast for big datasets, and oversampling results in larger datasets compared with undersampling, it seems a reasonable choice.

The way to get information about feature importance in logistic regression is different from the one we saw for LGBM in the undersampling section. In order to understand, we need to remember some math.

Representing as \(p_+(x)\) the model’s estimate of the probability of class membership of a data item represented by feature vector x, we have the following equation that specifies that the log-odds of the class is equal to a linear function \(f(x)\):

\[\log \left( \frac{p_+(x)}{1 - p_+(x)} \right) = f(x) = w_0 + w_1 x_1 + w_2 x_2 + \cdots\]

where \(w_i\) are the coefficients, or weights, of each feature. Solving the equation for \(p_+(x)\), it yields the logistic function [13]:

\[p_+(x) = \frac{1}{1+\exp(-f(x))}\]

This is a classification problem with classes 0 (no churn) and 1 (churn). The logistic regression model has a coef_ property that contains the coefficients found for each feature. These coefficients are both positive and negative. The positive scores indicate a feature that predicts class 1, whereas the negative scores indicate a feature that predicts class 0. We can order these values ascending and see the first 5 features (negative ones, predict 0), and the last 5 (positive, predict 1):

Feature Importance
0 std_scaler__tenure -1.545057
5 ohe__Contract_Two year -0.889923
1 std_scaler__MonthlyCharges -0.444953
11 ohe__InternetService_DSL -0.432502
14 ohe__MultipleLines_No -0.252727
Feature Importance
41 ohe__TechSupport_No 0.295541
29 ohe__PaymentMethod_Electronic check 0.379205
12 ohe__InternetService_Fiber optic 0.544658
3 ohe__Contract_Month-to-month 0.782041
2 std_scaler__TotalCharges 0.896876

The features listed are essentially the same that the LGBM model listed as the top 10 most important features in the undersampling section. We can see that tenure has the larger absolute value, being the feature with most weight.

We have seen that, mathematically, there is a relationship between these coefficients and odds. Actually, we can “convert” these coefficients to odds in order to make more sense of them by simply exponentiating the values [14]:

Feature odds
2 std_scaler__TotalCharges 2.451932
3 ohe__Contract_Month-to-month 2.185929
12 ohe__InternetService_Fiber optic 1.724018
29 ohe__PaymentMethod_Electronic check 1.461123
41 ohe__TechSupport_No 1.343853
... ... ...
... ... ...
14 ohe__MultipleLines_No 0.776680
11 ohe__InternetService_DSL 0.648884
1 std_scaler__MonthlyCharges 0.640854
5 ohe__Contract_Two year 0.410687
0 std_scaler__tenure 0.213300

Verbally, we can say that, for every one-unit increase in TotalCharges, the odds that the observation is in class 1 (churn) are 2.45 times as large as the odds that the observation is not in class 1 when all other variables are held constant. The same meaning stands for every value greater than 1. The top 5 are: TotalCharges, Contract_Month-to-month, InternetService_Fiber optic, PaymentMethod_Electronic check and TechSupport_No. It is interesting to see TotalCharges here, we have seen in the visualization section of this study that there are outliers in the churn class for this feature. Since the Scikit-Learn standard scaler, used in the preprocessing, is sensitive to outliers [15], it might have some influence in this result. All the other features were discussed before as features that indicate churn.

For odds less than 1, we can take 1/odds to make even better sense of them. So, for every one-unit increase in tenure, the odds that the observation is NOT in class 1 (churn) are 1/0.21 or 4.76 times as likely as the odds that it is in target class 1. The top 5, considering the inverse value, are: tenure, Contract_Two year, MonthlyCharges, InternetService_DSL and MultipleLines_No. All of these features were discussed in the EDA as features that do not indicate churn.

We can visualize the odds through a colorized bar chart as follows:

png

Conclusions and action plan

conclusion_image

Customer churn is a critical issue that needs to be analyzed and predicted. Therefore, a model that can predict customer churn and deal with a huge amount of data is significant. In this study:

  • an extensive EDA was performed
    • detailed insights were gained from the data regarding important features
  • pipelines were used with the following pattern: preprocessing (scaling, encoding); resampling; model
    • the pipelines were then input to cross validation with repeated stratified K-fold (5 fold, 3 repeats)
  • resampling strategies were analyzed and compared
    • random undersampling and SMOTE had similar results
    • random undersampling is easier to explain to stakeholders and outputs a smaller dataset, resulting in faster train times when models are applied
      • it might be preferable in a production scenario
  • AUPRC was the main metric to compare models due to its literature background in imbalanced datasets
    • the top 4 classifiers were chosen to hyperparameter tuning
  • LGBM classifier had the best scores when applied after both resampling strategies
    • LR had close scores to LGBM
      • since it is an easier to explain and faster to train model, it might be preferable in a production scenario
      • if it is desired higher recall values, LR is the best model
  • both LGBM and LR important features were among the ones detailed in the EDA

The results suggest that the Telco company should:

  • give more attention to technical support
  • improve the fiber optic service
  • invest in marketing strategies targeting
    • customers with short-term contracts, trying to move them to long contracts
    • customers without online services, offering these services
    • single customers, since their churn rate is higher than that of those who have partners and dependents
  • guide customers towards simple paying methods like paper billing and credit card

The results shown here give some background and ideas for future works exploring:

  • feature selection
  • class_weight in models that support it or some similar parameter
  • a scaler, apart from StandardScaler, less sensitive to outliers
  • combination of random undersampling and oversampling
  • tuning more hyperparameters for the selected models
  • other models, like the ones used in the cited papers

Anyway, I hope to have provided interesting insights, and a valuable project. Should you have any comments, questions or suggestions, don’t hesitate to contact me:


GitHub

References

  1. N. Modani, K. Dey, R. Gupta, and S. Godbole, “CDR Analysis Based Telco Churn Prediction and Customer Behavior Insights: A Case Study,” in Web Information Systems Engineering – WISE 2013, vol. 8181, X. Lin, Y. Manolopoulos, D. Srivastava, and G. Huang, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 256–269. doi: 10.1007/978-3-642-41154-0_19.
  2. S. Agrawal, A. Das, A. Gaikwad, and S. Dhage, “Customer Churn Prediction Modelling Based on Behavioural Patterns Analysis using Deep Learning,” in 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), Shah Alam, Jul. 2018, pp. 1–6. doi: 10.1109/ICSCEE.2018.8538420.
  3. I. Ullah, B. Raza, A. K. Malik, M. Imran, S. U. Islam, and S. W. Kim, “A Churn Prediction Model Using Random Forest: Analysis of Machine Learning Techniques for Churn Prediction and Factor Identification in Telecom Sector,” IEEE Access, vol. 7, pp. 60134–60149, 2019, doi: 10.1109/ACCESS.2019.2914999.
  4. “Churn.” https://www.productplan.com/glossary/churn/ (accessed Jun. 22, 2022).
  5. IBM. “Telco customer churn”. https://www.kaggle.com/datasets/blastchar/telco-customer-churn (accessed Jun. 1, 2022)
  6. Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22
  7. F. L. S. Bustamante, Customer churn prediction, 2022. https://github.com/chicolucio/customer-churn-prediction
  8. J. Burez and D. Van den Poel, “Handling class imbalance in customer churn prediction,” Expert Systems with Applications, vol. 36, no. 3, pp. 4626–4636, Apr. 2009, doi: 10.1016/j.eswa.2008.05.027.
  9. Wikipedia. “Accuracy paradox”. https://en.wikipedia.org/wiki/Accuracy_paradox
  10. T. Saito and M. Rehmsmeier, “The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets,” PLoS ONE, vol. 10, no. 3, p. e0118432, Mar. 2015, doi: 10.1371/journal.pone.0118432.
  11. Scikit-Learn. “Feature selection”. https://scikit-learn.org/stable/modules/feature_selection.html
  12. Scikit-Learn. “Model evaluation”. https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics
  13. Provost, Foster, and Tom Fawcett. Data Science for Business: what You Need to Know About Data Mining and Data-analytic Thinking. Sebastopol, Calif.: O’Reilly, 2013
  14. J. Benton, “Interpreting Coefficients in Linear and Logistic Regression,” Medium, Jul. 22, 2020. https://towardsdatascience.com/interpreting-coefficients-in-linear-and-logistic-regression-6ddf1295f6f1 (accessed Jun. 21, 2022).
  15. Scikit-Learn. “Compare the effect of different scalers on data with outliers”. https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#