Customer retention is a critical aspect of the telecom industry. With the high cost of acquiring new customers, it is often more cost-effective to retain existing customers than to attract new ones. Machine learning can play a crucial role in predicting customer churn and helping devise strategies to retain customers. This article provides a comprehensive guide to understanding and implementing a machine learning project for customer retention in telecom, complete with end-to-end code.
Understanding the Domain: Telecom
The telecom industry is characterized by high competition, with multiple service providers vying for the same customer base. Customer churn, or the rate at which customers stop doing business with an entity, is a significant concern. Factors contributing to customer churn can include service quality, customer care, pricing, and more.
Machine learning can help identify customers at risk of churn by recognizing patterns in customer behavior and usage data. In this article, we’ll focus on building a model to predict customer churn based on their usage data.
To tackle a customer churn problem in the telecom industry, we have employed machine learning techniques on a dataset to predict which customers are most likely to stop doing business with the company.
Data Preprocessing
We started by loading the customer data from a CSV file, which contained a variety of information about each customer including their tenure with the company, monthly charges, total charges, and whether or not they had churned.
customerID
: A unique identifier for each customergender
: The customer’s genderSeniorCitizen
: Indicates whether the customer is a senior citizen or notPartner
: Indicates whether the customer has a partner or notDependents
: Indicates whether the customer has dependents or nottenure
: The number of months the customer has stayed with the companyPhoneService
: Indicates whether the customer has a phone service or notMultipleLines
: Indicates whether the customer has multiple lines or notInternetService
: The customer’s internet service providerOnlineSecurity
: Indicates whether the customer has online security or notOnlineBackup
: Indicates whether the customer has online backup or notDeviceProtection
: Indicates whether the customer has device protection or notTechSupport
: Indicates whether the customer has tech support or notStreamingTV
: Indicates whether the customer has streaming TV or notStreamingMovies
: Indicates whether the customer has streaming movies or notContract
: The contract term of the customerPaperlessBilling
: Indicates whether the customer has paperless billing or notPaymentMethod
: The customer’s payment methodMonthlyCharges
: The amount charged to the customer monthlyTotalCharges
: The total amount charged to the customerChurn
: Whether the customer churned or not (the target variable)
Before we can build a model, we’ll need to preprocess the data. This will involve:
- Encoding the categorical variables
- Converting the target variable (
Churn
) into a binary format - Handling any missing values
# Check the data types and missing values data.info() # Convert 'TotalCharges' to numeric and handle errors by coercing them to NaN data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce') # Check for missing values again data.isna().sum()
The ‘TotalCharges’ column was initially read as an object data type because it contains some non-numeric values. After converting ‘TotalCharges’ to a numeric type, we found 11 missing (NaN) values.
Let’s handle these missing values by replacing them with the median of ‘TotalCharges’. Using the median is a common practice for dealing with missing values as it is less sensitive to outliers than the mean.
Next, we will convert the ‘Churn’ column into a binary format where ‘Yes’ is represented as 1 and ‘No’ as 0
Lastly, we need to encode the categorical variables. Machine learning models require numerical input, so we can’t directly input the categorical data from columns like ‘gender’, ‘Partner’, etc. We’ll use a method called one-hot encoding to transform these categorical variables into a format that works better with classification algorithms.
# Handle missing values in 'TotalCharges' by replacing them with the median data['TotalCharges'].fillna(data['TotalCharges'].median(), inplace=True) # Convert 'Churn' into binary format data['Churn'] = data['Churn'].map({'Yes': 1, 'No': 0}) # Drop 'customerID' as it doesn't provide any useful information for the model data.drop('customerID', axis=1, inplace=True) # Perform one-hot encoding on categorical features data_encoded = pd.get_dummies(data) # Display the first few rows of the preprocessed dataframe data_encoded.head()
The preprocessing steps have been successfully applied to the data:
- Missing values in the ‘TotalCharges’ column were replaced with the median value.
- The ‘Churn’ column was transformed into a binary format, with ‘Yes’ represented as 1 and ‘No’ as 0.
- The ‘customerID’ column was dropped, as it doesn’t provide useful information for the model.
- One-hot encoding was applied to the categorical variables, creating new binary columns for each category.
Now, the dataset only contains numerical values. Lets visualize this data.
Data Visualization
Distribution of Churn: This plot shows the distribution of churn in the dataset. This visualization helps us understand the imbalance in our dataset. It shows that the dataset is imbalanced, with more customers not churning(0). This is a common situation in churn datasets. Understanding this imbalance is important because it influences how we evaluate our model. For example, if 80% of customers don’t churn, a model that always predicts “no churn” would be 80% accurate, but it wouldn’t be very useful because it fails to identify the customers who do churn.
Distribution of Tenure: This is a histogram that shows the distribution of the number of months that customers have stayed with the company (tenure). This visualization can help us understand the typical customer lifespan in the company.We might find that a significant number of customers have a short tenure, indicating that many customers churn shortly after signing up for the service. This could suggest a need to improve the onboarding process or customer service in the early stages of the customer lifecycle.
Distribution of Monthly Charges: This is a histogram that shows the distribution of the amount that customers are charged monthly. It shows that customers who churn tend to have higher monthly charges. This might suggest that pricing is a key factor influencing churn, and the company could consider revising its pricing strategy or offering more flexible plans.
Distribution of Total Charges: This is a histogram that shows the distribution of the total amount that customers have been charged. This gives us insights into how much revenue the company generates from individual customers. If the churn is higher among customers who have a higher total charge, it indicates that the company is losing its potentially most valuable customers. Strategies could be devised to retain these high-revenue customers.
Churn Rate by Contract Type: This is a bar plot that shows the churn rate separated by contract type (Month-to-month, One year, Two year). Customers with a month-to-month contract have a higher churn rate. This could suggest that these customers feel less committed to the company and are more likely to switch to a competitor. The company could consider incentives to encourage customers to commit to longer contracts.
Churn Rate by Payment Method: This is a bar plot that shows the churn rate separated by payment method (Electronic check, Mailed check, Bank transfer, Credit card). The plot shows tChurn rate varies with the payment method. Customers who pay with an electronic check have a higher churn rate. This could be due to a variety of factors, such as convenience, security concerns or personal preference and would be worth investigating further.
Data Modelling
Now we are ready for data modelling.
Let’s continue with splitting the data into features (X) and target (y), and creating training and test sets. Then, we’ll build the Random Forest classifier model, evaluate it and interpret the model.
# Split the data into features and target X = data_encoded.drop('Churn', axis=1) y = data_encoded['Churn'] # Split the data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Initialize the RandomForestClassifier model model = RandomForestClassifier(n_estimators=100, random_state=42) # Fit the model on the training data model.fit(X_train, y_train) # Make predictions on the test data y_pred = model.predict(X_test) # Calculate the evaluation metrics accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) roc_auc = roc_auc_score(y_test, y_pred) # Print the evaluation metrics accuracy, precision, recall, roc_auc
The evaluation metrics for the model are:
- Accuracy: 0.796
- Precision: 0.660
- Recall: 0.474
- ROC AUC: 0.693
These values indicate that the model’s performance is decent, but there is definitely room for improvement. The recall, in particular, is relatively low, which suggests that the model is not catching all of the churn cases. Precision indicates that when the model predicts a customer will churn, it’s correct about 66% of the time.
Remember that different business contexts might require optimizing for different metrics. For example, if the cost of falsely identifying a customer as likely to churn (and perhaps giving them unnecessary incentives to stay) is high, we’d want to increase precision. If missing customers who are likely to churn is a bigger problem, we’d want to focus on recall.
Let’s interpret the model by examining the feature importances.
# Get the feature importances importances = model.feature_importances_ # Create a DataFrame to display the features and their importances importances_df = pd.DataFrame({ 'Feature': X.columns, 'Importance': importances }) # Sort the DataFrame by importance in descending order importances_df = importances_df.sort_values(by='Importance', ascending=False) # Display the DataFrame importances_df
The feature importances indicate how much each feature contributes to the model’s predictions. The most important features for predicting customer churn, according to this model, are:
TotalCharges
: The total amount charged to the customertenure
: The number of months the customer has stayed with the companyMonthlyCharges
: The amount charged to the customer monthlyContract_Month-to-month
: Whether the customer has a month-to-month contractPaymentMethod_Electronic check
: Whether the customer’s payment method is an electronic check
These results make intuitive sense. Customers who have been with the company for a shorter period of time, have higher charges, and have a month-to-month contract are likely at a higher risk of churning.
Conclusion
In conclusion, our model offers a useful tool for identifying customers at risk of churning. However, machine learning is an iterative process, and we can improve the model with more feature engineering and hyperparameter tuning. Furthermore, it’s important to remember that preventing churn is not just about predicting it, but also understanding why customers churn and taking action to improve customer satisfaction and loyalty.
You can find the jupyter notebook and dataset here.