Mastering Customer Retention with Causal Machine Learning: A Hands-On Guide Using Python

Introduction

Imagine you've been a loyal customer at your bank for years, but recently their app has become overly complex. You’re contemplating leaving when, out of the blue, a bank employee reaches out to help you navigate the app. At the same time, your daughter, also a customer, is considering leaving due to high trading fees. Just before she unsubscribes, she receives a voucher for free trading for a month. How does this happen?

This scenario highlights a powerful strategy: uplift modeling. It allows businesses to focus on the right customers by tailoring retention efforts to those most likely to respond positively. By leveraging uplift modeling, you can not only target the right audience but also maximize the effectiveness of your retention strategies.

Curious about how this works? In this article, we’ll explore how to implement uplift modeling to enhance customer retention. We’ll cover everything from defining your use case to evaluating model results, providing you with the knowledge and tools to apply this technique to your organization’s needs.

What We’ll Cover

  • Defining the use case: Understanding churn, target customers, and retention actions.
  • Data Acquisition: Identifying and collecting the right data for uplift modeling.
  • Modeling Techniques: Exploring various models used in uplift modeling.
  • Applying Knowledge: Implementing a single retention action—an email campaign.
  • Advanced Implementation: Handling multiple treatments and personalized user strategies.

Defining Our Use Case: Improving Customer Retention

To apply uplift modeling effectively, we first need to understand the concept of churn in our business context. What does churn mean for us? Who are we targeting, and what actions will we implement to retain customers? Addressing these questions is crucial for setting up a successful model.

Understanding Churn

Our first task is to precisely define churn. This involves understanding how and when customers stop engaging with our services. The definition of churn varies based on the business model:

  • In a transaction-based company, churn might be indicated by a drop in transaction frequency or account activity.
  • For subscription-based companies, churn can be seen in unsubscribed users or those who have stopped using the product.
  • In tech companies, churn could be defined as a lack of transactions over a set period, like 90 days, while mobile apps might use a 30-day inactivity period.

The complexity of this definition will depend on your specific business needs and the metrics you choose to consider. The goal is to create clear, understandable thresholds for identifying churn.

Churn Prediction Window

Next, we need to determine the timeframe we aim to impact. Do we want to prevent churn within the next 15 or 30 days? The prediction window influences data organization and retention actions. Longer timeframes can complicate model performance and impact measurement, so it's wise to choose a practical window that balances model effectiveness and actionable insights.

Selecting Target Users

Deciding whether to target a specific user segment is another key consideration:

  • Targeting segments with increased churn rates.
  • Focusing on high-value customers for a better return on investment.
  • Engaging new customers for sustained activation.
  • Addressing customers likely to churn soon.

In our example, we'll target clients with a higher likelihood of churn to maximize our retention efforts.

Defining Retention Actions

Choosing the right retention actions is crucial. Collaborating with business stakeholders can help in this process. We’ll explore four possible actions:

  • Personalized emails.
  • In-app notifications highlighting new features.
  • Direct calls to customers.
  • Special offers or discounts.

Our uplift model will help identify the most effective action for each customer.

Data Gathering: The Foundation of Uplift Modeling

Building an effective uplift model starts with gathering the right data, including both existing user information and experimental data.

Leveraging Existing User Data

Begin by examining the customer data available to you:

  • Customer demographics (age, geography, etc.).
  • Product specifics (subscription dates, tiers, etc.).
  • Transaction details (frequency, value, total spend, etc.).
  • Engagement metrics (login frequency, feature usage, etc.).

To enhance the model, analyze how these metrics change over time. Use techniques like moving averages, percentage changes, and aggregations at different time scales to create a comprehensive user profile.

Gathering Experimental Data

Next, collect data related to your retention actions. Ideally, use randomized experiments to test different actions and their impacts. If experiments aren’t feasible, observational data can still provide valuable insights, although it may require additional steps to address biases.

Generating Synthetic Data

If needed, generate synthetic data using tools like the causalml package from Uber. Here’s a brief example:

import pandas as pd
from causalml.dataset import make_uplift_classification

# Define parameters for uplift effects
n_uplift_increase_dict = {"email_campaign": 2, "in_app_notification": 3, "call_campaign": 3, "voucher": 4}
n_uplift_decrease_dict = {"email_campaign": 1, "in_app_notification": 1, "call_campaign": 2, "voucher": 1}
delta_uplift_increase_dict = {"email_campaign": 0.05, "in_app_notification": 0.03, "call_campaign": 0.08, "voucher": 0.10}
delta_uplift_decrease_dict = {"email_campaign": 0.02, "in_app_notification": 0.01, "call_campaign": 0.03, "voucher": 0.02}
n_uplift_increase_mix_informative_dict = {"email_campaign": 1, "in_app_notification": 2, "call_campaign": 1, "voucher": 2}
n_uplift_decrease_mix_informative_dict = {"email_campaign": 1, "in_app_notification": 1, "call_campaign": 1, "voucher": 1}

# Generate the dataset
df, feature_names = make_uplift_classification(
    n_samples=20000, 
    treatment_name=['email_campaign', 'in_app_notification', 'call_campaign', 'voucher'],
    y_name='retention',
    n_classification_features=20,
    n_classification_informative=10,
    n_uplift_increase_dict=n_uplift_increase_dict,
    n_uplift_decrease_dict=n_uplift_decrease_dict,
    delta_uplift_increase_dict=delta_uplift_increase_dict,
    delta_uplift_decrease_dict=delta_uplift_decrease_dict,
    n_uplift_increase_mix_informative_dict=n_uplift_increase_mix_informative_dict,
    n_uplift_decrease_mix_informative_dict=n_uplift_decrease_mix_informative_dict,
    positive_class_proportion=0.7,
    random_seed=42
)

Preparing Data for Modeling

After gathering and encoding your data, the next step is to prepare it for modeling. We need to split the dataset into training and testing sets, and handle both mono-treatment and multi-treatment scenarios. Here’s how you can achieve that:

from sklearn.model_selection import train_test_split
import numpy as np

def prepare_data(df, feature_names, y_name, test_size=0.3, random_state=42):
    # Create binary treatment column
    df['treatment_col'] = np.where(df['treatment_group_key'] == 'control', 0, 1)
    
    # Split data into train and test sets
    df_train, df_test = train_test_split(df, test_size=test_size, random_state=random_state)
    
    # Create mono-treatment subsets
    df_train_mono = df_train[df_train['treatment_group_key'].isin(['email_campaign', 'control'])]
    df_test_mono = df_test[df_test['treatment_group_key'].isin(['email_campaign', 'control'])]
    
    # Prepare features, treatment, and target variables for full dataset
    X_train = df_train[feature_names].values
    X_test = df_test[feature_names].values
    treatment_train = df_train['treatment_group_key'].values
    treatment_test = df_test['treatment_group_key'].values
    y_train = df_train[y_name].values
    y_test = df_test[y_name].values
    
    # Prepare features, treatment, and target variables for mono-treatment dataset
    X_train_mono = df_train_mono[feature_names].values
    X_test_mono = df_test_mono[feature_names].values
    y_train_mono = df_train_mono[y_name].values
    y_test_mono = df_test_mono[y_name].values
    
    return {
        'X_train': X_train, 'X_test': X_test,
        'treatment_train': treatment_train, 'treatment_test': treatment_test,
        'y_train': y_train, 'y_test': y_test,
        'X_train_mono': X_train_mono, 'X_test_mono': X_test_mono,
        'y_train_mono': y_train_mono, 'y_test_mono': y_test_mono
    }

# Example usage
data = prepare_data(df, feature_names, y_name)

# Print shapes for verification
print(f"Full test set shape: {data['X_test'].shape}")
print(f"Mono-treatment test set shape: {data['X_test_mono'].shape}")

Understanding Uplift Modeling Approaches

Uplift modeling differs from traditional methods by predicting incremental impact (or uplift) rather than just predicting outcomes. Here’s an overview of common approaches:

  • Two-Model Approach: Train separate models for the treated and control groups. The uplift is calculated as the difference between predicted probabilities.
  • Difference in Probabilities: Compute the difference in outcome probabilities between the treated and control groups.
  • Uplift Trees: Use decision trees to predict uplift directly.

These methods can be explored using Python libraries such as scikit-learn or causalml.

Implementing the Two-Model Approach

Here’s how you can implement the two-model approach in Python:

from sklearn.ensemble import RandomForestClassifier
from causalml.metrics import plot_gain, plot_qini

# Train separate models for control and treatment
control_model = RandomForestClassifier(n_estimators=100, random_state=42)
treatment_model = RandomForestClassifier(n_estimators=100, random_state=42)

control_model.fit(X_train_mono, y_train_mono)
treatment_model.fit(X_train_mono, y_train_mono)

# Predict probabilities
control_preds = control_model.predict_proba(X_test_mono)[:, 1]
treatment_preds = treatment_model.predict_proba(X_test_mono)[:, 1]

# Calculate uplift
uplift = treatment_preds - control_preds

# Plot Gain and Qini curves
plot_gain(df_test_mono['retention'].values, uplift, title='Gain Chart')
plot_qini(df_test_mono['retention'].values, uplift, title='Qini Curve')

Handling Multiple Treatments

When dealing with multiple treatments, consider these approaches:

  • One-vs-All: Create separate models for each treatment compared to the control group.
  • Multi-class Models: Use models capable of handling multiple classes directly.

Evaluating Model Results

After building your uplift models, evaluate their effectiveness by analyzing metrics like Gain and Qini curves. These metrics help determine how well your model targets customers and optimizes retention actions.

By applying uplift modeling techniques, you can tailor your retention strategies to those customers who are most likely to benefit, significantly improving your organization’s retention rates.

Understanding Uplift Modeling Approaches

Let’s quickly understand what those are!

1. S-Learner (Single-Model)

S Learner — source causalml documentation

The S-Learner is the simplest meta-learner of all. Why? Because it only consists of using a traditional machine learning model that includes the treatment feature as input. While simple to implement, it may struggle if the importance of the treatment variable is low.

2. T-Learner (Two-Model)

“The T-Learner tries to solve the problem of discarding the treatment entirely by forcing the learner to first split on it. Instead of using a single model, we will use one model per treatment variable.
In the binary case, there are only two models that we need to estimate (hence the name T)” Source [3]

Each of these approaches has its pros and cons. How well they work will depend on your data and what you’re trying to achieve.

In this article we’ll try out all three: an Uplift Random Forest Classifier, a S-Learner, and a T-Learner, and compare their performances when it comes to improving our company’s retention.

Single treatment uplift model implementation with causal ML

Model Training

Now let’s train our models. We’ll start with our direct uplift model, the uplift random forest classifier. Then we’ll train our meta models using an XGBoost regressor. Two things to note here:

  • The algorithm choice behind your meta-models will obviously impact the final model performances, thus you may want to select it carefully.
  • Yes, we’re selecting regressors as meta models rather than classifiers, mainly because they provide more flexibility, outputting a precise effect.

Here are the different steps you’ll find in the below code:

  • We initialize our result dataframe
  • Then we train each model on our training set
  • Finally we predict our treatment effects on the test sets before saving the results
from causalml.inference.meta import BaseSRegressor, BaseTRegressor
from causalml.inference.tree import UpliftRandomForestClassifier
from xgboost import XGBRegressor

#save results in a df
df_results_mono = df_test_mono.copy()

# Initialize and train a randomForest Classifier
rfc = UpliftRandomForestClassifier(control_name='control')
rfc.fit(X_train_mono, treatment_train_mono, y_train_mono)

# Initialize and train S-Learner
learner_s = BaseSRegressor(
    learner=XGBRegressor(
        n_estimators=100,
        max_depth=3,
        learning_rate=0.1,
        random_state=42
    ),
    control_name='control'
)

learner_s.fit(X_train_mono, treatment_train_mono, y_train_mono)

# Initialize and train T-Learner
learner_t = BaseTRegressor(
    learner=XGBRegressor(
        n_estimators=100,
        max_depth=3,
        learning_rate=0.1,
        random_state=42
    ),
    control_name='control'
)

learner_t.fit(X_train_mono, treatment_train_mono, y_train_mono)

# Predict treatment effects
df_results_mono[["mono_S_learner"]] = learner_s.predict(X=X_test_mono)
df_results_mono[["mono_T_learner"]] = learner_t.predict(X=X_test_mono)
df_results_mono["random_forest_learner"] = rfc.predict(X_test_mono)

display(df_results_mono[["mono_S_learner", "mono_T_learner", "random_forest_learner"]].mean())

df_mono_results_plot = df_results_mono[["mono_S_learner","mono_T_learner", "random_forest_learner","retention","treatment_col"]].copy()

Model Evaluation

How to evaluate and compare our models’ performances? That is a great question! As we’re predicting something we do not know — we don’t know the effect of our treatment on our customers as each customer either received the treatment or was in the control group. We cannot use classic evaluation metrics. Hopefully, there are other ways:

  • The Gain Curve: The gain curve offers an easy way to visualize our model’s performance. The idea behind gain is simple:
    • We compute the estimated effect of each of our customers, order them from the biggest effect to the lesser.
    • From here, we move point by point. At each point, we calculate the average treatment effect meaning, both the average effect — for control and treatment — and we take the difference.
    • We do that for both our models ordering and a random ordering, simulating random selection, and compare both curves!
    • It helps us understand which improvement our model would have brought versus a random selection.
  • The AAUC Score: The AAUC score is very close to the actual gain curve as it measures the Area under the curve of the gain curve of our model, enabling us to compare it with the one of the random model. It summarizes the gain curve in an easy-to-compare number.

In the following code, we calculate these metrics:

from causalml.metrics import plot_gain
from causalml.metrics import auuc_score

#AAUC score
aauc_normalized = auuc_score(df_mono_results_plot, outcome_col='retention', treatment_col='treatment_col', normalize=True, tmle=False)
print(f"AAUC Score Normalized: {aauc_normalized}")

# Plot Gain Curve
plot_gain(df_mono_results_plot, outcome_col='retention', treatment_col='treatment_col')
plt.title('Gain Curve - T-Learner')
plt.show()

Here are the results we got. Higher scores are better of course.

  • T-Learner: ~6.4 (best performer)
  • S-Learner: ~6.3 (very close second)
  • Random Forest: ~5.7 (good, but not as good as the others)
  • Random Targeting: ~0.5 (baseline)

What do these results mean?

Well, all our models are performing way better than random targeting. This is reassuring. They’re about 12 times more effective! We’ll understand what it means in terms of impact just after.

We also understand from these AAUC scores that, while all models are performing quite well, the T-Leaner is the best performer.

Now let’s take a look at the gain curve.

Gain Curve

How to read a gain curve:

  • X-Axis (Population): This represents the size of the population you’re targeting, starting from the most responsive individuals (on the left) to the least responsive (on the right).
  • Y-Axis (Gain): This shows the cumulative gain, which is the improvement in your outcome (e.g., increased retention).

Gain Curve Interpretation

The gain curve shows us the benefit — in our initial unit hence “people retained” — of targeting the population using our uplift model or randomly targeting.

In this case, it seems that if we reach out to the whole population with our emails, we would retain approximately 100 additional users. This is our baseline scenario. Note that every curve ends by this result which is expected considering our gain definition.

So how to interpret this? Well, looking at the curve we can say that using our model, by reaching out to only 50% of the population, we can save 600 additional users! Six times more than by reaching out to everyone. How is that possible? By targeting only users that are likely to react positively to our outreach, while ignoring those who would leverage this email to actually churn, for instance.

It is time for a small disclaimer: we’re using synthetic data here, our results are extremely unlikely in the real world, but it is good to illustrate.

In this case, our models enable us to do more with less. This is a good example of how we can optimize our resources using uplift modeling and targeting a lower share of the population, hence limiting the operation costs, to obtain a good share of the results. A kind of Pareto effect if you’d like.

Multi-Treatment Model: Personalization

Let’s now restart our analysis, considering all our retention strategies described above:

  • Email campaign
  • Call campaign
  • In-app notification
  • Vouchers

In order to achieve this, we need experimentation results of either a multi-treatment experimentation of all those actions, or to aggregate the results of multiple experimentation. The better the experimental data, the better predictive output we’ll get. However, setting up such experiments can take time and resources.

Let’s use our previously generated data, keeping in mind that obtaining this data in the first place is probably the biggest challenge of this approach!

Model Training

Let’s start by training our models. We’ll keep the same model type as before: a Random Forest, S-Learner, and T-Learner.

However, these models will now learn to differentiate between the effects of our four distinct treatments.

# Save results in a df
df_results_multi = df_test.copy()

# Define treatment actions
actions = ['call_campaign', 'email_campaign', 'in_app_notification', 'voucher']

# Initialize and train Uplift Random Forest Classifier
rfc = UpliftRandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    min_samples_leaf=50,
    min_samples_treatment=10,
    n_reg=10,
    control_name='control',
    random_state=42
)
rfc.fit(X_train, treatment_train, y_train)

# Initialize and train S-Learner
learner_s = BaseSRegressor(
    learner=XGBRegressor(
        n_estimators=100,
        max_depth=3,
        learning_rate=0.1,
        random_state=42
    ),
    control_name='control'
)

learner_s.fit(X_train, treatment_train, y_train)

# Initialize and train T-Learner
learner_t = BaseTRegressor(
    learner=XGBRegressor(
        n_estimators=100,
        max_depth=3,
        learning_rate=0.1,
        random_state=42
    ),
    control_name='control'
)

learner_t.fit(X_train, treatment_train, y_train)

Predictions

Now that our models are trained, let’s generate our predictions for each treatment. For each user, we’ll get the uplift of each treatment. This will enable us to choose the most effective treatment by user, if any treatment has a positive uplift. Otherwise, we just won’t reach out to this person!

def predict_multi(df, learner, learner_name, X_test):
    """
    Predict treatment effects for multiple treatments and determine the best treatment.
    """

    # Predict treatment effects
    cols = [f'{learner_name}_learner_{action}' for action in actions]
    df[cols] = learner.predict(X=X_test)
    
    # Determine the best treatment effect
    df[f'{learner_name}_learner_effect'] = df[cols].max(axis=1)
    
    # Determine the best treatment
    df[f"{learner_name}_best_treatment"] = df[cols].idxmax(axis=1)
    df.loc[df[f'{learner_name}_learner_effect'] < 0, f"{learner_name}_best_treatment"] = "control"
    
    return df

# Apply predictions for each model
df_results_multi = predict_multi(df_results_multi, rfc, 'rf', X_test)
df_results_multi = predict_multi(df_results_multi, learner_s, 's', X_test)
df_results_multi = predict_multi(df_results_multi, learner_t, 't', X_test)

Here is the kind of data we’ll obtain from this, for each model:

We’ll be able, for each model, to pick the best treatment for each user!

Model Evaluation

Now let’s look at our approach evaluation. As we have multiple treatments, it is slightly different:

  • For each user, we select the best treatment.
  • Then we order our users based on their best treatment effect.
  • And look at what really happened: either the user really stayed or left.

Following this rationale, we easily understand how we can outperform random targeting by only targeting a small share of our whole population.

From here, we’re able to plot our gain curve and compute our AAUC. Easy right? The code below does exactly that, still leveraging causalML.

# AAUC score
aauc_normalized = auuc_score(df_t_learner_plot_multi, outcome_col='retention', treatment_col='treatment_col', normalize=True, tmle=False)
aauc_non_normalize = auuc_score(df_t_learner_plot_multi, outcome_col='retention', treatment_col='treatment_col', normalize=False, tmle=False)
print(f"AAUC Score Normalized: {aauc_normalized}")
print(f"AAUC Score: {aauc_non_normalize}")

# Plot Gain Curve
plot_gain(df_t_learner_plot_multi, outcome_col='retention', treatment_col='treatment_col')
plt.title('Gain Curve - T-Learner')
plt.show()

Results Interpretation

  • T-Learner: ~1.45 (best performer)
  • S-Learner: ~1.42 (very close second)
  • Random Forest: ~1.20 (good, but not as good as the others)
  • Random Targeting: ~0.52 (baseline)

What this means:

Once again, all our models outperform random targeting, and once again the T-Learner is the best performer. However, we note that the difference is lower than in our first case. Different reasons could explain that, one being the actual set-up. We’re considering a bigger population here, which we did not consider in our first experiment. It also could mean that our models do not perform as well when it comes to multi-treatment, and we would need to iterate and try to improve their performance.

But let’s look at our gain curve to understand better our performance.

Interpretation of the Multi-Treatment Gain Curve

As we can see, if we were to target 100% of our population — 30,000 users — we would retain an additional 850 users (approximately). However, using our models, we are able to retain 1,600 users while only contacting 33% of the total population.

Finally, we notice that past 40% of the population, all curves start to decrease indicating that there is no value contacting those customers.

We made it. We successfully built a model that enables us to personalize effectively our retention actions to maximize our ROI. Based on this model, our company decided to put this model to production and saved millions by not wasting resources reaching out to everyone, but also focusing the right type of effort on the right customer!

Putting such a model to production is another challenge in itself because we need to ensure its performance in the long term, and keep retraining it when possible. The framework to do that would be to:

  • Generate inference with your model on 80% of your target population
  • Keep 10% of your target population intact: Control
  • Keep an additional 10% of your population to keep experimenting to train your model for the next time period (month/quarter/year depending on your capabilities)

We might look into this later on!

Conclusion

If you made it this far, thank you! I hope this was interesting and that you learned how to create an uplift model and how to evaluate its performance.

If I did a good job, you may now know that uplift models are an incredible tool to understand and that it can lead to great, direct and measurable impact. You also may have understood that uplift models enable us to target the right population with the right treatment, but require a strong and exploitable experimental data to be trained on. Getting this data up to date is often the big challenge of such projects. It is applicable on historical/observational data, one would need to add specific cleaning and treating steps to ensure that the data is unbiased.

So what’s next? While we’re deep-diving in the world of causal machine learning, I want to make sure you are heard. So if you want to look into specific topics that you think you could apply in your own company and would like to learn more about it, let me know, I’ll do my best. Let’s keep all learning from each other! Until next time, happy modeling!

Source

Unless otherwise noted, all images are by the author

Next Post Previous Post
No Comment
Add Comment
comment url