Improving E-Commerce Conversion Rates with Machine Learning

Date: Jan. 27, 2023

Problem introduction

Conversion rate is a measure used in digital marketing to evaluate the effectiveness of a particular campaign or website in terms of converting visitors into customers or achieving a specific goal. It is an important metric to track and optimize because it helps businesses understand how effective their marketing campaigns are and how they can improve their website or landing pages to generate more conversions. The objective of this project is to utilize machine learning to understand and predict conversion rates across different user and advertiser segments. The project also aim to provide actionable recommendations help business optimize their marketing strategies and improve user conversion rates.

Exploratory data analysis

The dataset we have contains the following attributes:

  1. country: user country based on the IP address
  2. age: user age (Self-reported at sign-up step)
  3. new_user: whether the user created the account during this session or had already an account and simply came back to the site
  4. source: marketing channel source
  5. total_pages_visited: number of total pages visited during the session. This can be seen as a proxy for time spent on site and engagement
  6. converted: this is our label. 1 means they converted within the session, 0 means they left without buying anything. The percentage if converted user is 3.23%.

Among the attributes, country and source are categorical data and age, new_user, total_pages_visted, and converted are numerical data.

We perform exploratory data analysis to enhance our understanding of the user profile and how features are associated with user conversion.

The data distribution of user age and total_pages_visited are positively skewed, and the total_pages_visited distribution has a long tail on right.
User distributions for categorical features (new_user, country, and source)

Our findings about the e-commerce user profile: 

  • Most of the users are under 50 years old and have visited fewer than 15 pages.
  • The new users to existing users ratio is 7 to 3.
  • Half of the users are from the US while on 4 percent are from Germany
  • Half of the users know the products from the source ‘Seo.’
  • 3% of the user converted – the data is highly imbalanced.
The correlation matrix for numerical features age, total_pages_visted, new_user, and converted

 

User age v.s. conversion rates: we observe a monotonic decrease in conversion rate as user age increases till age = 60.

To answer some important questions: 

  1. What is the association between conversion rate and user profile? Age, and being a new user is negatively associated with user conversion, and total number of pages visited is positively associated with the conversion rate. Also, Chinese users have much lower a conversion rate compared to users from UK, US, and Germany. Marketing channel sources don’t seem to have an impact on whether users convert.
  2. What are some key indicators for a high/low conversion rate? Key indicators for a high conversion rate are high number of pages visited, existing users, younger ages. Key indicators for a low conversion rate are being from China, low number of pages visited, new users, older ages.
  3. What would a group of users with a high conversion rate look like and what group could be the best target audience for a marketing campaign? Younger users who have already visited a number of pages, have existing accounts, and live in Germany, UK or US.

Link to Jupyter Notebook file: https://github.com/aprilzhizhou/machine_learning_projects/blob/main/conversion_rate/1_exploratory_data_analysis.ipynb

Machine learning models

Here we build machine learning models to predict conversion rates based on user profiles. Our goal is to make accurate predictions and identify important features to user conversion. The classification models we choose are: logistic regression, random forest, decision tree (for result interpretation) and XGBoost.

To optimize the performance of each model, we conduct hyperparameter tuning with grid search and train the model with the optimal hyperparameters. Since our main focus is achieve high accuracy in predicting users who are likely to convert,  we prioritize minimizing false negative errors where converted users are incorrectly predicted to not convert. Thus, we use “recall” as our evaluation metric for both cross-validation and model training.

We build a decision tree model for result interpretation

The logistic regression model achieved a recall of 0.93, while the random forest model had a lower recall of 0.84. On the other hand, the XGBoost model performed the best, with a recall of 0.94. These results indicate that the logistic regression and XGBoost models are effective in identifying converted users.

Grid search cross-validation results for XGBoost model.

To identify key features that have a high impact user conversion, we conduct feature importance analyses. The results from the analyses confirm our speculations that a higher total number of pages visited by a user has a positive effect on the user conversion rate, while marketing channel sources have little impact on the conversion rate. Additionally, we find that the total number of pages visited by a user outweighs other user features by a large margin in terms of its impact on user conversion. This suggests that the business should focus its marketing efforts on users with high page visits as they are much more likely to make purchases. Age is also a significant factor that affects user conversion rates. We discover that younger users are more likely to convert compared to older users.

Link to Jupyter Notebook file: https://github.com/aprilzhizhou/machine_learning_projects/blob/main/conversion_rate/2_ML_models.ipynb

Recommendations:

  1. To improve user conversion rates, the marketing team should prioritize users who have shown interest by visiting many pages but have not yet made a purchase. One approach could be to send targeted email advertisements or coupons to these users, encouraging them to make purchases.
  2. The marketing team should consider expanding their marketing channels to target younger users, who have higher conversion rates. For example, they could focus on social platforms that are popular among young people. It is also important to investigate why the website is unpopular among older users. Conducting surveys to identify issues such as an unfriendly UI or a difficult-to-navigate digital payment system could provide insights into how to improve the website’s appeal to older users.
  3. The marketing team should also consider expanding their market channels in Germany, given that German users have the highest conversion rates despite only accounting for 4.1% of the total users. They should also address the issue of the low conversion rate among Chinese users, as they make up around 1/4 of the total users. This may involve investigating potential problems with the Chinese site’s UI, payment system, or brand reputation, such as poor translation, and making necessary improvements. If the issues are resolved, it could lead to significant growth in the Chinese market, which is a very large market.
  4. Given that users with existing accounts are more likely to convert than new users, the marketing team could focus their advertising efforts on users who have accounts but haven’t converted yet.