Predicting Hotel Demand using Previous Hotel Bookings, Air Passenger Traffic Volume, Air Shopping Data, Holidays, and Seasonality

Authors: Richard Hathaway, Rufei (Lilian) Wu, Ruidi (Rachel) Zhao, Qingyang (Jojo) Zhou 

 Executive Summary  

This project attempts to predict weekly total room nights for hotel bookings up to four weeks into the future, using previous hotel bookings, air passenger traffic volume, air shopping data, holidays, and seasonality. Three different model types, Prophet, SARIMAX, and Random Forest were used to make these predictions. Random Forest model gave the lowest average MAPE and performed on average 53% better than a baseline model, although different model types performed better for different hotels. We found that the most important predictors vary for different hotels. Thus, a stepwise feature selection algorithm was implemented to find the best combination of predictors. A Python model pipeline was created to allow the selection of the best model for hotel predictions.  

Introduction  

Sabre is a software and technology company that partners with airlines, hoteliers, agencies, and other travel partners to retail, distribute and fulfill travel. The company is investing in a technology evolution to create a new marketplace for personalized travel in a cloud-based environment. Sabre is committed to helping customers operate more efficiently, increase revenue, and offer personalized traveler experiences by artificial intelligence, machine learning, and real-time data and analytics. This technology is used as the intelligence behind mobile apps, airport check-in kiosks, online travel sites, airline and hotel reservation networks, travel agent terminals, and scores of other travel solutions.  

 

The ability to predict future hotel bookings has many benefits in the travel industry. Hotels can use predictions to aid pricing and revenue management strategies. However, there is not much work is done in this area. Previous work in predicting hotel bookings has mostly focused on data within the hotel industry.  

 

This project seeks to incorporate data from the airline industry with the hotel industry data to augment the strength of these predictions. Furthermore, we strive to understand what factors affect fluctuations in hotel bookings and if they are consistent or different across different hotels.  

 

We attempted to predict the weekly total room nights for a given hotel for up to four weeks into the future. The response variable of our models, room nights, is calculated as the length of stay of a reservation multiplied by the number of rooms booked for the reservation. The room nights of all reservations for a hotel for each week were summed, assigning reservations to the week in which the reservation began (arrival date).  

 

The present work contributes to understanding of the inner workings of the dynamic and complex travel industry that connects people from all over the world. 

 

Data Description 

This project used one primary data source and three supplementary ones. The primary data source was hotel booking data, where each record in the data represents one hotel booking. Each record contains the following attributes booking date, arrival date, length of stay, number of rooms booked, advance purchase, and room nights (Length of stay * Number of rooms booked). 

 

The hotel booking data contains data for 36 hotels across four different cities that span January 2019 – August 2021. Hotels contain anonymous labels in the form of ‘City 1 Hotel 6’. Some hotels are airport hotels, while others are in the city center, or on the outskirts of the city. However, the data are not labeled which hotels were of which type. The exact city names were not disclosed. Due to significant disruptions in travel due to COVID-19, analysis was limited to data only from 2018-12-31 to 2020-03-01. 

 

Additionally, the following three datasets were used: 

  1. Sabre’s air passenger traffic data, which contains the number of passengers that arrived by air travel to each city for each month, along with projections of air passenger traffic for future months.  
  1. Sabre’s air shopping data, which contains the number of shops made in Sabre’s GDS for air travel to each city, where each row represents one combination of departure date and booking date. For example, the dataset contains the number of shops made on 2019-01-01 for a departure on 2019-03-01 to City 1, the number of shops made on 2019-01-02 for a departure on 2019-03-01 to City 1, and so on for all combinations of shop date and departure date.  
  1. A list of US Federal holidays from 2011 to 2030, which was scraped from the U.S. Office of Personnel Management website [1]. 

Exploratory Data Analysis (EDA) 

EDA was performed on all the available datasets at the city level to understand the existing patterns. There were several unexpected spikes in the hotel booking data that are addressed in the preprocessing steps as outliers. Strong seasonality is observed in the air passenger traffic data but is not clear in the air shopping data due to the limitation of the data range. Significant disruptions are observed in all datasets in around April 2020 due to the impact of COVID-19. 

Hotel Booking Data 

Figure 1 plots the total room nights by arrival week for each city. Around April 2020, the value of total room nights dropped significantly due to the impact of COVID-19. An unexpected spike in City 2 around August 2020 is observed, with three weeks having total room nights larger than 10,000. Investigation into the data reveals that there were 5 bookings in these three weeks that had total room nights ranging from 2,000 to 7,000. This is an example of influential outliers that was handled during the preprocessing stage. 

Figure 1. Total room nights by arrival week for each city  Figure 1. Total room nights by arrival week for each city

To investigate whether monthly seasonality exists, total room nights by day of the month for each city was plotted. Figure 2 suggests that day of the month might be an influential predictor for total room nights due to fluctuations in values for each day. However, no obvious pattern of seasonality can be observed in the graph. 

 Figure 2. Total room nights by Day of Month for each city

Figure 2. Total room nights by Day of Month for each city 

Air Shopping Data 

Similar to the total room nights, a sharp disruption of air shops was observed in around April 2020 due to the impact of COVID-19. However, it is unclear if any seasonality exists from the plot because the dataset only contains one year of data before the pandemic and thus does not have enough seasonal cycles to show the pattern. 

Air Passenger Traffic Data 

Figure 3 shows the monthly passenger air traffic for each city. Again, a sharp decline in air passenger traffic occurred in around April 2020 because of COVID-19. The air passenger traffic began to recover starting in February 2021, but it still did not reach the pre-pandemic level. Before April 2020, similar patterns of data repeated several times, which suggests that a strong seasonality exists.  

Figure 3. Total air passenger traffic by month for each city  Figure 3. Total air passenger traffic by month for each city 

Preprocessing 

Following EDA, the preprocessing steps were performed to prepare the data for modeling and to eliminate extreme outliers that introduced noises in model learning process. The preprocessing steps were divided into four stages: weekly aggregation, data scope adjustment, outlier removal and imputation, and additional predictors generation. These four stages produced a well-prepared data frame that can be put into the models for training.  

Weekly Aggregation  

The original hotel booking data is booking-level data, where each record corresponds to one reservation at the hotel. In order to generate a time series that could be modeled, the hotel booking data was aggregated into weekly-level. For each hotel, all records in the dataset were grouped into the Monday’s date of the arrival week, and total room nights were summed for each Monday. We focused on weekly predictions because there was too much noise in the daily predictions for the model to capture a clear trend. Additionally, tuning models on daily data was too computationally expensive, and aggregating to the weekly level significantly reduced the runtime.  

Data Scope Adjustment 

While performing EDA, two problems were discovered in the hotel booking data. The first problem is that some hotels (such as City 2 Hotel 3, City 3 Hotel 2, and City 3 Hotel 6) have large gaps in record dates between bookings at the beginning of the time series. For example, the data for City 3 Hotel 2 begins with one booking on 2019-03-21 and then suddenly jumps to 2019-04-02. One potential explanation was that these nonconsecutive records might be test bookings automatically made by the computer when the hotel first partnered with Sabre. Therefore, to eliminate the impact of these abnormal test bookings, they were removed in the preprocessing stage.  

 

The second problem is that there are sharp declines in values observed in several datasets, including hotel booking, air shopping, and air passenger traffic data, due to the impact of COVID-19. Because COVID-19 caused the data and even the future of the travel industry unpredictable, it would be very challenging for the models to capture a meaningful and stable trend that could be used for predictions. Thus, the project only focused on the data before 2020-03-02, which is an estimated date of when COVID-19 started. This scope limitation eliminated the unpredictable trends brought by COVID-19; however, future work could involve investigating if there were any predictable trends during and after the COVID-19 pandemic.  

Outlier Removal and Imputation 

Outlier Removal Methods 

The significant outliers in the data were removed to eliminate noise that could mislead the models. Outliers might exist in the data due to unobservable events, such as conferences, local events near the hotel, or temporary closures of hotel rooms. Three methods were used to detect outliers: Low Pass Filter, Median Absolute Deviation, and Isolation Forest.  

Low Pass Filter (LPF) 

Low Pass Filter [2] removes outliers by calculating the centered rolling average of the time series data and then removing the outliers based on Z-score. It uses a fixed-size, centered window that slides over the time series data. Within each window, the data points (i.e. center of the window) that do not fall into the region bounded by the upper and lower bounds are considered outliers. For each rolling window, the lower bound and upper bound is calculated as the following: 

  • Lower bound:  
  • Upper bound:  

where  stands for the number of IQRs () away from the 1st/3rd quantile. This is a parameter that can be tuned. 

Median Absolute Deviation (MAD) 

Median Absolute Deviation [3] is a robust way to identify outliers, as it replaces standard deviation with median deviation and the mean with median in the common outlier removing method. Using the median can be beneficial to remove the effect that a large outlier could have on the outlier removal criteria. Similar to the LPF, it uses a fixed-size, centered window that slides over the data. Within each window, the data points (i.e. the center of the window) that are larger than the upper bound or smaller than the lower bound are considered outliers. For each rolling window that contains data, calculated the Rolling MAD first: Rolling median, Rolling Absolute Deviations, Rolling MAD. Then, the lower bound and upper bound were calculated as the following: Lower bound: and Upper bound:, where represents the number of MAD away from the rolling median, which is a parameter that can be tuned.  

Isolation Forest 

Isolation Forest is an unsupervised extension of the popular random forest algorithm. When an isolation forest is built, the algorithm splits each individual data point off from all other data points. The easier it is to isolate a single point from all other points, the more likely it is an outlier. This is because if a data point is not an outlier and it will be closely surrounded by other data points, and thus more difficult to isolate it. It is used as a black-box model that is directly taken from the scikit-learn package [4]. The parameter to tune is the maximum fraction of outliers detected, with a default of 0.01 in this project.  

Outlier Removal Rule 

The three outlier removal methods all have their own advantages and disadvantages. LPF and MAD are generally better at removing outliers compared to Isolation Forest, since they can detect local outliers using a sliding window. However, they are not able to detect outliers at the beginning or end of the time series data because of a centered window is used. For example, when using a window size of four weeks, the first two weeks and the last two weeks’ data do not have a window centered on them, thus they will never be detected as outliers. On the other hand, although Isolation Forest only detects global outliers, it takes care of all the data points. Therefore, the final outlier removal rule combines the advantages of these three methods: 

  • For the first half-window-size weeks and the last half-window-size weeks, Isolation forest was used to detect outliers. 
  • All the other weeks in the middle use LPF/MAD, which is a tunable parameter 

For instance, when a four-week centered window is used, Isolation forest was used on the first 2 weeks and the last 2 weeks, while LPF/MAD were applied on all the remaining weeks in the middle (Figure 4). 

Figure 4. Sample data from Jan 2019 to March 2020  Figure 4. Sample data from Jan 2019 to March 2020 

Outlier Imputation Rule 

After an outlier is removed, a newly calculated value needs to be imputed, depending on how the outlier is removed.  

LPF/MAD 

If the outlier is removed by LPF or MAD, then it is imputed as either the lower bound or upper bound generated by LPF or MAD, depending on the original value of the outlier. If the original value is lower than the lower bound, then the outlier is imputed with the lower bound; if the original value is higher than the upper bound, then the outlier is imputed with the upper bound.  

Isolation Forest 

If the outlier is removed by Isolation Forest, the new value is imputed as the mean value across all sample weeks in the entire time series. 

Additional Predictors 

Besides the historical hotel booking data, three more predictors were added to the model to further improve its performance: holiday week, air shops, and air passenger traffic.  

Holiday Week 

This is a binary variable that indicates whether a week is a holiday week or not. A week is a holiday week if it meets one of the following three conditions: 

  • Contains one or more federal holidays 
  • Previous week’s Saturday or Sunday is a federal holiday 
  • Next week’s Monday is a federal holiday. 

 

This strategy allows the models to take into account travelers who travel over a long weekend and arrive during the week before the holiday (for a Monday holiday) or depart during the week after the holiday (for a Saturday/Sunday holiday). The list of federal holidays was obtained from the United States Federal Government Office of Personnel Management’s website [1]. The federal holiday data from 2011 to 2030 was scraped. In each year on average, around ⅓ of the weeks are marked as holiday weeks, because most holidays are on Mondays. We did not consider other holidays, such as Black Friday, because the effect of these special days might have been reflected in the air shopping data. 

Air Shops 

A serious problem with using air shopping data for time series forecasting is that at the time of prediction, the number of shops will be incomplete. This is because between the time of prediction and the future arrival date that the model is trying to predict, more shops will be made. Unlike the air passenger traffic data that Sabre can project, air shops are difficult to forecast. Thus, to solve this issue and to ensure that no unobtainable data is used in the model, all shops records with advance purchase (AP) less than or equal to the prediction horizon are dropped. For example, when using a prediction horizon of four weeks, all shops with AP less than or equal to 28 days will be dropped.  

 

Finally, because the raw air shopping data is at daily-level, we performed weekly aggregation by departure date to get the weekly total air shops. 

Air Passenger Traffic 

Air passenger traffic dataset contains the number of passengers arriving in each city each month, with a few future predictions provided beyond the current dates. To resolve the conflict between monthly data and weekly prediction, the air passenger traffic value for each week is calculated as the following: 

  • Situation 1: If an entire week is in the same month, we used the air passenger traffic data for that month. For example, for the week 03/23/2020 – 03/29/2020, the monthly air passenger traffic in March 2020 is used. 
  • Situation 2: If a week spans two months, we used the weighted average of the air passenger traffic values from the two months. For example, for the week 03/30/2020 – 04/05/2020, the monthly air passenger traffic value for this week is: 2/7 * monthly air passenger traffic in March 2020 + 5/7 * monthly air passenger traffic in April 2020. 

Models 

Pipeline Workflow 

Since hotels differ in booking volumes, best model type, and best predictor combinations, a model pipeline (Figure 5) is implemented for each hotel separately. The entire pipeline consists of two stages: training and prediction. 

 

There are three main use cases of the pipeline – “train-only”, “train+predict”, and “predict-only”. Before any predictions on a hotel with a model type, users need to train at least once on the same hotel data using the same model type to obtain a hyperparameter json file. To do that, users could either “train-only” or “train+predict”, which would automatically predict the following horizon after the test period. Finally, users also have the options to predict only” if they have a saved json file from before and want to apply it to the new samples. 

 

Training 

During the training stage, the model pipeline begins with taking user inputs, which specify the path to hotel input data, the model type (one or all), the error metric to use, the prediction horizon, and the external variables to use. The next step is data preprocessing, including weekly aggregation on total room nights, outlier removal and imputation, splitting the data into training and testing datasets, and generating external variables (holiday, passenger traffic, and air shops; seasonality does not have to be generated). Then, the pipeline enters the model training stage. For each model type specified, the program begins with hyperparameter tuning on the full model to obtain the best hyperparameters combination. Using the selected parameters, it then performs the backward stepwise feature selection process, which returns the best predictors combination with the lowest test error metric. The model object returned from the feature selection will be saved in a json file for future reference and prediction. Finally, the pipeline decides the best model type out of all that are executed, by comparing the test error metric. 

Prediction 

Users can also do only the prediction on a hotel, if they have a saved model json file from previous training attempts and want to apply it to the new data on the same hotel with the same model type. 

Figure 5. Model pipeline Figure 6. Implementation of walk-forward cross validation

Figure 5. Model pipeline 

Input Data Requirements 

The input of the pipeline should include the following data files: 

  • historical hotel booking data 
  • historical air shopping data 
  • historical and predicted values of passenger traffic data 
  • holiday data (same for all hotels).  

 

Three different model types, Prophet, SARIMAX, and Random Forest were used to make the predictions. 

 

Prophet  

Prophet is an open-sourced software developed by Facebook’s Core Data Science team for large-scale, automated time series forecasting [5, 6]. Since it is designed for business use cases, it is more intuitive and flexible than most other traditional models. It does not require much effort in data preprocessing, because it is robust to outliers, missing data, and dramatic changes in the historical trends. It provides options for tuning human-interpretable parameters based on business knowledge. It also allows manually adding changepoints (i.e. sudden changes in the trends), which is especially useful if there are known business changes that might affect the trends, so that analysts could manually incorporate those changes and help fit the model. 

 

Prophet is an additive model that decomposes a time series into three components: trend, seasonality, and holidays. Trend function g(t) estimates the non-periodic changes in the series. Seasonality s(t) represents any periodic changes, which could be daily, weekly, yearly, or any custom periods. Holiday h(t) represents the effects of holidays that happen irregularly.  

y(t) = g(t) + s(t) + h(t) + error 

SARIMAX 

SARIMAX (Seasonal Auto-Regressive Integrated Moving Average with Exogenous variables) is a traditional statistical time series model built on the ARIMA model. The model utilizes values of the target variable at previous time steps (auto-regressive terms) and the errors at previous time steps (moving average terms) to forecast future values. If the time series is not stationary with respect to the mean, the model eliminates this shift by using the difference between consecutive time steps to fit the model. The SARIMAX model adds seasonal auto-regressive and seasonal moving-average terms, where the model uses the values and errors of the target at past time steps that are multiples of the seasonal period, s. For example, if s = 12, P = 2, and Q = 2, the SARIMAX model would analyze the time series at 12-time steps prior and 24-time steps prior. The SARIMAX model also allows for seasonal differencing if the seasonal pattern of the time series is not stationary. Finally, exogenous variables are added to the model as additional regressors. The SARIMAX model can be expressed as (p, d, q), (P, D, Q, s), where the first set of hyperparameters is the non-seasonal order, and the second set of hyperparameters is the seasonal order. 

Random Forest 

Random forest is a popular ensemble method widely used for regression and classification tasks by bagging multiple decision trees during the training stage. The prediction results for the regression problem are averaged over all the decision trees to enhance accuracy and avoid overfitting. This project used the RandomForestRegressor function in the scikit-learn python library [12].  

 

Random forest can be used for time series forecasting through transforming the original time series dataset into a format suitable for a supervised learning problem. In particular, the time series data needs to be restructured using a sliding-window representation. Lagged values (i.e. values from previous time steps) are considered as input variables. Additional external variables such as seasonality (i.e. month of the year) and holiday can also be added as features. Table 2 shows a sample data frame that is in the desired format for the Random Forest model.  

 

Table 1: Sample Data for Random Forest 

Total_RoomNights  Lag1  Lag2  Lag3  Lag4  Seasonality 
459  256  317  287  345  01 
424  459  256  317  287  02 

 

Model Comparison 

As it is presented in Table 2, each model type has clear advantages and disadvantages.  

 

Table 2: Model comparison 

 

Prophet 

SARIMAX 

Random Forest 

Avg. MAPE (step. feature selection) 

25% 

22% 

12.2% 

Run Time per hotel 

2 minutes (50 random search iterations 

8 minutes (grid search – 256 iterations) 

1 minute (auto-arima) 

4 minutes (50 random search iterations) 

Initial Required Training Data 

12 weeks 

20 weeks 

4 weeks 

Unique Features 

Options to tune the weight of predictors based on business knowledge 

Traditional regression-style output and statistics such as p-values 

Feature importance measures for model interpretability. 

Challenges 

More than 9 parameters to tune and could take over 30 hours to try all combinations 

Can be challenging to manually select the model parameters 

3 weeks of initial training data is lost due to the need to create lagged variables. 

 

Prophet offers many options to tune hyperparameters and the weights of variables, but trying all parameter combinations can be very slow. SARIMAX offers the fastest training times and interpretable output summary tables with helpful regression-style outputs. However, it is difficult to manually select the model hyperparameters, even with pre-existing business knowledge about the data. Finally, Random Forest offers a feature importance measure, which helps users understand what the most important factors in the model are. However, one minor drawback of using a traditional machine learning model is that a few weeks of initial training data are lost in feature engineering to create lagged variables. 

 

Each model type also has different requirements for the length of the initial training data. Prophet does not have a formal number of weeks needed, but it is advisable to have enough initial training data so that the model can begin to learn the pattern in the data before cross-validation begins. In this project, 12 weeks of initial training data were used for Prophet. The required initial training period for SARIMAX is determined by the range of hyperparameters that are being tested, because trying to test additional lags, seasonal lags, or seasonal differencing operations requires increasing the size of the training sample so that the model can learn the time series. Given the range of hyperparameters tested for SARIMAX in this project, at least 20 weeks of initial training data were required. On the other hand, a machine learning model, such as random forest requires only enough training to satisfy the lags used as predictors, which in this project, is four weeks. We used the TimeSeriesSplit class from scikit-learn package [13] to set up the walk forward cross-validation for Random Forest. 

 

Compared to SARIMAX, Prophet sacrifices some inferential benefits for flexibility and interpretability. Prophet discovers the optimal fit on the historical data to use for prediction but does not analyze the relationship between time steps in a time series, as opposed to SARIMAX, which assumes correlations between time steps. This assumption means that Prophet cannot capture unstable trends as well as other models do. In addition, Prophet would work much better with time series with strong seasonality and several seasons of historical data [6]. 

Cross Validation 

To select the best model hyperparameters, we implemented walk forward cross-validation. The initial dataset was split into a training dataset that began with the week of 12/31/2018 and ended with the week of 12/30/2019. Our test dataset consisted of data from the week of 1/6/2020 that ended with the week of 2/24/2020.  

 

We implemented walk forward cross-validation on the training set. This involved first fitting the model on the first few weeks of training data (number of weeks determined by model requirement) and then making predictions on the next several weeks of data (number of weeks determined by horizon). For example, a model is first trained on weeks 1-12 and cross-validated on weeks 13-16. Then, we “walk-forward” four weeks, training the model on weeks 1-16 and cross-validating weeks 17-20. We continued this process until the four weeks to cross-validate are the last four weeks in our training data. For each walk forward step, we calculated the cross-validation error metrics, which are then averaged across steps as the final cross-validation metric. Walk-forward cross validation [14] was implemented as it is presented in Figure 6. 

Figure 6. Implementation of walk-forward cross validation

Figure 6. Implementation of walk-forward cross validation 

Backward Stepwise External Variable Selection 

We hypothesized that the importance of different external variables might differ between hotels. Therefore, we implemented an algorithm to dynamically select the best combination of these external variables. However, to reduce the runtime, we applied the variable selection after the best model hyperparameters were chosen in cross-validation using all the external variables specified by the user (the full model). Once the hyperparameters had been selected and the full model had been fit, we tried dropping each external variable in the model to determine if dropping any variable improved the model, and if so which one. If the full model performed better than all of the partial models with one variable dropped, we stopped the algorithm. On the other hand, if any of the partial models performed better than the full model, we set the best partial model as the new full model and repeated the process of dropping each variable. 

 

For example, take a model fit in cross-validation using the variables holiday, air passenger traffic, and air shops. Three partial models are tested, one where a different variable is dropped from the full model, as it is presented in Table 3: 

 

Table 3: Variable selection 

Model  Variables In the Model  MAPE 
Full Model  Holiday, Air Passenger, Air Shops  0.26 
Partial Model (Drop Holiday)  Air Passenger, Air Shops  0.31 
Partial Model (Drop Air Passenger)  Holiday, Air Shops  0.21 
Partial Model (Drop Air Shops)  Air Passenger Traffic, Holiday  0.27 

 

The partial model that dropped air passenger traffic had a lower MAPE than the full model, so this partial model becomes the new full model. 

 

Table 4: Selection of best model based on variable selection 

Model  Variables In the Model  MAPE 
Full Model  Holiday, Air Shops  0.21 
Partial Model (Drop Holiday)  Air Shops  0.22 
Partial Model (Drop Air Shops)  Holiday  0.23 

 

In the next iteration, there is no partial model that outperforms the full model with holiday and air shops in the model, so this model is selected as the best model (Table 4) 

 

This stepwise algorithm relies on one main assumption. The algorithm stops when there is no partial model with one less variable that outperforms the full model. Therefore, it assumes that there are no combinations of even fewer variables that would outperform the full model as well. Thus, it is not guaranteed that the best model combination will be found. However, performing feature selection in this way could reduce the runtime, especially when a user want to consider additional external variables. 

Key Findings 

We performed preliminary analysis on the prediction results on a sample of eight hotels by running them through our pipeline. Two hotels were selected from each city – one of a higher volume and the other of a lower volume – to investigate the relationship between hotel specific characteristics and predictability. We compared the MAPE of the best models for each hotel (i.e., models with the best hyperparameter and predictor combinations), using a horizon of four weeks. The prediction period was the week of 2020-01-06 (inclusive) to the week of 2020-03-02 (exclusive), and the training period was the week of 2018-12-31 to the week of 2019-12-30. For SARIMAX, we used the results from grid search (and not auto-arima) to ensure consistency in selection metrics between model types, because auto-arima is tuned using AIC instead of MAPE. 

 

In this project, the baseline model used for comparison was simply using the weekly total room nights at the time of prediction as the prediction result. For example, with a horizon of four weeks, the predicted total room nights of the week of 2021-02-22 would be the actual total room nights of the week of 2021-01-25.  

Overall Performance  

Table 5 demonstrates the MAPE scores of the baseline model and our models for each hotel. The percentage difference between the best model’s MAPE and the baseline model’s MAPE is calculated so that the more negative the value is, the better the best models are compared to corresponding baseline. In all eight hotels, our best models were at least 40% better than the baseline. On average, they had a MAPE score of 0.12, which was 53.4% better than the baseline predictions. 

 

We also observed that except for City 3, low-volume (yellow) hotels on average had higher MAPE than their high-volume (green) counterparts in the same city.  

 

Table 5. MAPE of Baseline and the Best Performing Model for Each Hotel 

  Baseline  Our model  Percentage Difference 
City 1 Hotel 6  0.32  0.16  -51.1% 
City 1 Hotel 9  0.20  0.06  -70.1% 
City 2 Hotel 7  0.23  0.11  -51.4% 
City 2 Hotel 1  0.14  0.06  -54.9% 
City 3 Hotel 4  0.33  0.10  -68.9% 
City 3 Hotel 1  0.24  0.14  -43.4% 
City 4 Hotel 8  0.35  0.20  -42.3% 
City 4 Hotel 5  0.24  0.13  -47.2% 
Average  0.26  0.12  -53.7% 

* Percentage Difference calculated as (best model MAPE – baseline MAPE)/baseline MAPE 

Comparison between Model Types 

Table 6 illustrates the percentage differences between each model type and the baseline model in each hotel. The bolded values correspond to the model type with the lowest percentage difference and thus highest relative performance to the baseline. In this sample, Random Forest outperformed the other two model types in 7 out of the 8 hotels. On average, it was 53% better than baseline, around 46% better than Prophet, and around 30% better than SARIMAX 

Even in the hotel where SARIMAX and Prophet failed to beat the baseline (i.e., City 4 Hotel 8), Random Forest still performed 47% better than the baseline. For this reason, Random Forest is used as the default model type in our package. 

 

Table 6. Percentage Differences of MAPE from Baseline  

  C 1 H 6  C 1 H 9  C 2 H 7  C 2 H 1  C 3 H 4  C 3 H 1  C 4 H 8  C 4 H 5  Average 
Baseline 

0.32 

0.20 

0.14 

0.24 

0.24 

0.34 

0.24 

0.35 

0.26 

Prophet  2%  -29%  -25%  -29%  10%  -36%  46%  7%  -7% 
SARIMAX  -36%  -19%  -51%  -31%  -46%  -36%  104%  -46%  -20% 
RF  -51%  -70%  -48%  -55%  -69%  -43%  -42%  -47%  -53% 

* Percentage Difference calculated as (MAPE of each model type – baseline MAPE) / baseline MAPE 

 

There are several hypotheses as to why Random Forest performed better. First, due to the limitations in our sample time scope, time series models (i.e., SARIMAX and Prophet) might lose their advantages in detecting trends. Since there is only slightly more than one-year of data in our training dataset, there are not enough cycles that allowed the models to capture yearly seasonal trends. Especially for Prophet, which relies heavily on seasonal patterns, the models might not have enough information to make inferences based on seasonal patterns. Similarly, historical trends might not be obvious with limited training data, which might mislead the time series models to make inaccurate forecasts. These problems are not an issue for Random Forest, because it does not rely on historical trends when making predictions. If these hypotheses are true, the issue could be mediated by including more years of data into the training sample. 

Comparison between Predictors 

Table 7 shows the average rank of feature importance across 8 hotels, using the feature importance measure returned by Random Forest. A higher rank means the predictor is more influential relative to the other predictors. On average, historical hotel bookings in the past two weeks were important predictors, which are within expectations. To our surprise, air passenger traffic and air shops were also high in rank, even though air passenger traffic is monthly data and air shops are incomplete by the time of predictions. This observation hints at the relationship between hotel demand and its attraction to air travelers, such as its distance to the airport, distance to downtown, and conference room rental availability etc. 

 

Table 7. Average Rank of Feature Importance Generated by Random Forest Across 8 Hotels 

Rank  1  2  3  4  5  6  7  8  9 
Variable  Lag 1  Passenger traffic  Lag 2  shops  seasonality  Lag 4  Lag 3  Is Holiday
week 
Is Not holiday week 

 

It was also observed that whether a week is a holiday week was the least important predictor among all. There are two potential hypotheses as to why this might be the case. First, it is possible that our definition of holiday week was too general and included weeks that might not affect hotel bookings, which introduced noise and misled the models. As discussed, we defined a normal week as a holiday week if the holiday fell at the beginning or end of the week, to consider travels over long weekends. This definition is more reasonable for major holidays like Christmas, when people travel days before the actual holiday. However, for holidays where travel is less common, such as Columbus Day or Veterans Day, this assumption might not be appropriate. Another possible reason is that not all the hotels analyzed in this project are in major tourist destinations or are near airports, so these hotels may not receive large numbers of holiday bookings. 

 

Additionally, even though the predictors have different relative importance, we observed that the best predictor combinations varied between hotels, and there was not one predictor that was important to all hotels. Table 8 illustrates the average MAPE across three model types of models with different predictor combinations. The bolded values correspond to the best predictor combination for each hotel. As observed, there was not a predictor combination, or even a single predictor, that consistently returned the lowest error. Similarly, simply including all predictors in the model could introduce noise and result in large errors. Thus, this observation supported our decision that the model pipeline should dynamically find the best predictors with the stepwise feature selection algorithm. 

 

Table 8. Average MAPE across Three Model Types of Models with Different Predictors Combinations 

Predictor Combinations  C 1 H 6  C 1 H 9  C 2 H 7  C 2 H 1  C 3 H 4  C 3 H 1  C 4 H 8  C 4 H 5 
Hotel Booking History (History)  0.31  0.38  0.13  0.15  0.26  0.36  0.37  0.19 
History, Seasonality  0.34  0.22  0.14  0.16  0.22  0.37  0.44  0.33 
History, Seasonality, Holiday  0.31  0.27  0.15  0.11  0.20  0.27  0.42  0.31 
History, Seasonality, Passenger  0.27  0.13  0.15  0.12  0.14  0.17  0.34  0.22 
History, Seasonality, Holiday, Passenger  0.26  0.14  0.15  0.13  0.15  0.17  0.39  0.24 
History, Seasonality, Holiday, Shops  0.27  0.18  0.15  0.19  0.18  0.21  0.56  0.22 
All Predictors  0.28  0.15  0.15  0.10  0.14  0.18  0.54  0.18 
Difference between Best and Worst Combination  8ppt  26ppt  2ppt  10ppt  11ppt  20ppt  22ppt  15ppt 

*  In the last row, ppt refers to “Percentage Points,” calculated as (max MAPE – min MAPE) for each hotel. E.g., for City 1 Hotel 6, it is 0.34 – 0.26 = 0.08ppt 

Comparison between Low-Volume and High-Volume Hotels 

Finally, we compared the model performances on hotels with different average booking volume. We selected two hotels from each city with varying average total room nights in the sample period, as presented in Table 9 

 

Table 9. Average Weekly Total Room-Nights from 2018-12-31 to 2020-03-01 

Low Volume  High Volume 
City 1 Hotel 6  City 1 Hotel 9 
560  950 
City 2 Hotel 7  City 2 Hotel 1 
620  2220 
City 3 Hotel 4  City 3 Hotel 1 
1220  3320 
City 4 Hotel 8  City 4 Hotel 5 
190  1250 

 

As it is presented in Table 10, the MAPE’s of the two types of hotels in each city and the percentage difference between them. The percentage difference was calculated so that the more positive the value is, the better the model performance is for high-volume hotels compared to its low-volume counterparts. In all cities, the average MAPE across three model types of high-volume hotels were at least 30% and on average 67.8% better than low-volume hotels. The pattern was consistent in each single mode type, except for Random Forest in City 3. We are unsure about the reason for this exception and would need more information of each city to form a hypothesis. 

 

There are some hypotheses why high-volume hotels were more predictable. Higher-volume hotels are likely more recognizable hotels that have more stable demand and historical booking trends that models could capture. On the other hand, larger hotels might have specific characteristics that helps with predictability. For example, high-volume hotels might be located near downtown or tourist sites and attract more air travelers during holidays, whose effects on bookings could be captured by the holiday and air passenger predictors in the model. The source of guests in smaller hotels might be less obvious (e.g., non-air travelers) and thus less predictable. 

 

Table 10. Comparison of MAPE for Low- and High-Volume Hotels in Each City 

  Model  Low volume  High volume  Perc Diff 
City 1  Prophet  0.33  0.14  80.3% 
SARIMAX  0.21  0.16  26.0% 
RF  0.16  0.06  90.7% 
Average  0.23  0.12  63.4% 
City 2  Prophet  0.18  0.10  55.9% 
SARIMAX  0.11  0.10  15.3% 
RF  0.12  0.06  64.5% 
Average  0.14  0.09  45.6% 
City 3  Prophet  0.37  0.10  114.5% 
SARIMAX  0.18  0.16  14.8% 
RF  0.10  0.14  -28.1% 
Average  0.22  0.15  36.4% 
City 4  Prophet  0.37  0.25  36.3% 
SARIMAX  0.72  0.13  139.4% 
RF  0.20  0.13  47.6% 
Average  0.48  0.17  95.6% 

Percentage Difference calculated as (low-volume MAPE – high-volume MAPE) / AVG(low MAPE, high MAPE) 

Conclusions and Next Steps 

In this project, we successfully built machine learning models for hotel demand predictions using previous hotel bookings, air passenger traffic volume, air shopping data, holidays, and seasonality. Our models show significant improvement over the baseline. For the sample of eight hotels, we analyzed in-depth, Random Forest beat the baseline model by 53% on average and had the best performance in seven hotels. In addition, recent historical hotel bookings, air passenger traffic, and air shops were the three most important predictors among all. However, there was no external variable combination that worked well for all hotels, and thus a stepwise algorithm was implemented to select the best external variables. Finally, two hotels of different weekly average bookings were compared, and models for high-volume hotels performed on average 67.8% better than models for low-volume hotels in the same cities. 

 

The findings of this project suggest many potential avenues of further research. First, including additional data sources such as city-specific features regarding city type (tourist city or not) and local events (e.g., conferences, sports events, etc.) would be helpful in better understanding and predicting the hotel demand. In addition, although the given models were built using only pre-covid data, it is worth exploring more recent post-covid data to obtain a more comprehensive picture of the hotel demand. Future work should apply other supervised machine learning models as well as deep learning models which have proven to work well on time series data. For instance, RNN is a powerful deep learning model with a large network architecture that aims to handle sequence dependence. Finally, given that the best model type varies across different hotels, it could be valuable to create clusters and see if there are similar characteristics among hotels that perform the best with a certain set of external variables. 

References 

[1] Office of Personnel Management, 2022. “Policy, Pay, and Oversight, Pay and Leave”. United States Federal Government. https://www.opm.gov/policy-data-oversight/pay-leave/federal-holidays/#url=2022 

 

[2] Kirsten Perry, 2019. “Unsupervised Machine Learning Approaches for Outlier Detection in Time Series, using Python”. Towards Data Science. https://medium.com/towards-data-science/unsupervised-machine-learning-approaches-for-outlier-detection-in-time-series-using-python-5759c6394e19 

 

[3] A. Tayip Sakka, 2019. “Median Absolute Deviation tutorial from Medium”. Towards Data Science. 

https://towardsdatascience.com/practical-guide-to-outlier-detection-methods-6b9f947a161e 

 

[4] Documentation for the scikit-learn Isolation Forest function: 

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html 

 

[5] Documentation for Prophet Package 

https://facebook.github.io/prophet/ 

 

[6] Sean J. Taylor and Benjamin Letham, 2017. Forecasting at scale. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.3190v2 

 

[7] Documentation for Prophet Hyperparameter Tuning 

https://facebook.github.io/prophet/docs/diagnostics.html#hyperparameter-tuning 

 

[8] Documentation for Stan Package 

https://mc-stan.org/ 

 

[9] Documentation for Multiprocessing Package 

https://docs.python.org/3/library/multiprocessing.html 

 

[10] Documentation for the statsmodels SARIMAX function: https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html 

 

[11] Documentation for the pmdarima auto_arima function: https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html 

 

[12] Documentation for the Random Forest model: 

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html 

 

[13] Documentation for the scikit-learn TimeSeriesSplit function: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html 

 

[14] Audhi Aprillant, 2021. “Walk-Forward Optimization — Cross-Validation Technique for Time-Series”. Medium. https://audhiaprilliant.medium.com/walk-forward-optimization-cross-validation-technique-for-time-series-data-61739f58f2c0  

 

 

Prediction of demand elasticity for different markets based on average airfare change

Yiyue Xu, Anthony Yang, Ke Xu, Wencheng Zhang, Yuwen Meng, Hao Liang 

Problem Description

The ability to accurately predict changes in demand in response to fare alterations is of paramount importance in the travel industry. It allows airlines to optimize their pricing strategies and revenue management systems. Historically, the demand estimation was solely considering passenger volume and average fare of different combinations of cabins and the number of connections. In this project, we have broadened the scope of these predictions, weaving together diverse data sets to generate more robust and comprehensive demand elasticity models. Understanding the key drivers of demand fluctuations and whether they are universal or specific to different markets has been a crucial facet of this research.  

We aimed to construct a model that could predict how changes in average fares impact the demand for different markets. Our work examined these relationships on a market, cluster, and global level, enabling us to tailor our recommendations to the specific characteristics and requirements of different markets and clusters. By leveraging a blend of data sets, including historical booking data and exogenous data sources, we were able to develop a nuanced understanding of the multifaceted factors that drive demand. This comprehensive approach has led to the development of a model that not only predicts demand elasticity but also provides actionable insights for airlines to strategize their market entry and response to new competitors.  

Data Description 

Internal data 

The internal dataset provided by Sabre Corporation for this project comprises a rich assortment of information across several key fields, which are critical to understanding demand elasticity in various markets. Each row in the dataset represents a unique booking, with specific details about the booking and the market in which it falls. The dataset represents bookings for 18 different markets and spans a considerable time frame, with the departure date ranging from ‘2015-01-02’ to ‘2023-12-15’. However, it is important to note that there is a gap in the data for the entire year of 2020 due to COVID-19, where the booking date information is missing. This gap poses a challenge for the analysis, and the strategy to account for this missing data is a part of our methodology. 

External Data 

The external data for this project consists of a variety of metrics that provide additional context and insights to complement the internal booking data from Sabre Corporation. This data includes both air travel volumes and key economic indicators. The project also incorporates several key economic indicators gathered from official sources, including CPI (Consumer Price Index), PMI (Purchasing Managers’ Index) and Exchange Rate that could affect the affordability of travel and international travel demand. We also incorporated additional external data points focusing on geographic1 and economic2 3 factors, as well as tourism4 data to better cluster the markets. The integration of these additional external data sources into our analysis provides a more nuanced and comprehensive understanding of the similarities among markets and helps categorize the markets into distinct clusters. 

Exploratory Data Analysis  

By visualizing and summarizing the data, we can gain insights, identify trends, and detect potential issues that may inform our modeling and prediction efforts. Image 1 presents a global view of our nine selected routes, comprising three domestic US and six international round-trip flights. Image 2 reveals that the markets most significantly impacted by the pandemic are domestic flights within the US. Image 3 displays airline market shares for nine round-trip routes, aggregating airlines with less than 20% market share into “Others”. The visualization reveals that for most of these routes, one or two airlines predominantly dominate the market. 

Image 1. The view of nine selected routes.

Image 1. The view of nine selected routes. 

Image 2. Markets impacted by the pandemic.

Image 2. Markets impacted by the pandemic. 

Image 3. Airline market shares 

Image 3. Airline market shares 

Image 4. Passenger volume. Image 4. Passenger volume. 

The passenger volume shows significant variation between the pre-COVID period and the time during and after the pandemic (Image 4). This discrepancy is partially due to the unavailability of booking data for 2020 and the inherent changes in travel patterns caused by the COVID-19 pandemic.  

Preprocessing 

Data Cleaning 

Data cleaning process performed three core tasks: convert data types, remove negative values, and eliminate outliers, which can substantially distort the statistical metrics and lead to misleading conclusions. Outliers are identified as the top 10% of ‘AvgFare’ by the cabin for each market. This approach allows us to account for inherent variability between different markets and cabin types. We aim to strike a balance between retaining maximum data for robust analysis while also curbing the influence of extreme values that could distort our findings. 

Feature Engineering 

Feature engineering is an indispensable process that involves creating new features from existing data to improve the performance of models. By extracting more information from the data, we can provide the models with additional signals, enhancing their capacity to learn and make predictions. A set of new features from our cleaned dataset were generated by transforming and categorizing existing features in ways that highlight the factors most relevant to our analysis. By doing this, we enabled our models to capture more complex patterns and relationships within the data, ultimately leading to more robust and accurate predictions. The engineered features are chosen to represent a broad range of factors, from timing and logistics to market competition, allowing us to model the intricacies of the airline industry comprehensively. 

Data Aggregation 

In this project, data aggregation plays a pivotal role in aiding us to examine and model our data at multiple levels of granularity. Various versions of the dataset were created according to a set of predefined feature selection strategies and aggregation methods. 

Data Grouping Process 

We define a grouping as a combination of a feature selection strategy and an aggregation method. Hence, with three feature selection strategies and four aggregation methods, we form twelve distinct groupings. Each grouping provides a unique view of the data, capturing different levels of detail and different combinations of factors. 

This systematic and flexible approach to data grouping offers a way to generate a multifaceted understanding of our data. We can scrutinize how different factors influence airline fares under various grouping methods, leading us to more accurate and robust predictive models. By analyzing results across different groupings, we can identify the most informative level of data aggregation and discover key drivers of airline fares, offering valuable insights for strategic decision-making. 

Data Scope Adjustment: Pre-Covid and Full Data Options 

To enhance the flexibility and applicability of our model, we introduce a data scope adjustment feature that allows users to tailor the data inputs according to their specific needs or analytical objectives. This feature enables users to choose between the full dataset and the pre-Covid subset for their modeling process. By offering these data scope adjustment options, we ensure that our model can cater to a wide range of analytical goals and research questions, from understanding long-term industry trends to investigating the impacts of significant global events like the Covid-19 pandemic. 

Cross Validation 

The nature of our project requires us to carefully consider the unique characteristics of our dataset when implementing model validation techniques. Cross-validation, a widely used method for assessing the robustness and generalizability of models, is a crucial part of our analysis. However, we can’t use it in its traditional form due to the temporal structure of our data. Here, time plays a vital role, and the order of the data points cannot be shuffled as it may lead to data leakage – using future data to predict the past. 

To circumvent this, we resort to Time Series Cross-Validation, a specialized form of cross-validation designed for time-dependent data. This technique respects the temporal order of observations, which is crucial for our airline booking data. The primary advantage of this method is that it prevents overfitting and provides a more accurate measure of model generalizability. 

In the Time Series Cross-Validation process, the data is divided into a number of folds. Unlike traditional cross-validation, the folds are not created by random sampling but by moving a time window across the dataset. For each iteration, the model is trained on the data within the time window and validated on the data following this window. 

This approach aligns with the inherent diversity and complexity of our data, reflecting the variable patterns and trends present across various granularity levels of data. By tuning and validating our models in this way, we are better equipped to understand the data’s underlying structure and dynamics and can provide more accurate and insightful forecasts. 

 

Model Summary 

  1. Linear Regression: Linear regression is a basic predictive analytics technique. It is used to explain the relationship between one dependent variable (the variable you’re predicting) and one or more independent variables (the variables used to make the prediction). The output of a linear regression model is a straight line that best fits the distribution of the provided data. Linear regression is a good baseline model as it is simple, interpretable, and fast to train. 
  1. Poisson Regression: Poisson regression is a type of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable has a Poisson distribution and uses a log link function. 
  1. Random Forest: Random Forest is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. This model is known for its flexibility, ease of use, and robustness to overfitting. The “randomness” in its name comes from two aspects: it randomly samples data points to build trees and it randomly sample features for each split during the tree building process. Random forest can capture non-linear relationships and provides a good balance between simplicity and prediction power. It also handles missing values and categorical features well. 
  1. XGBoost: eXtreme Gradient Boosting or XGBoost, is a powerful machine learning algorithm based on the gradient boosting framework, which is a way of combining multiple weak prediction models, typically decision trees, into a stronger model. XGBoost is designed to be highly efficient, flexible, and portable. It provides parallel tree boosting and has a variety of regularization parameters which help prevent overfitting. XGBoost can potentially provide more predictive power than Random Forest. 
  1. Neural Network: Neural networks are a class of models within the general machine learning literature. They are inspired by biological neurons and the connections between them. Neural networks are composed of layers of interconnected nodes, or “neurons”, with each layer using the output from the previous layer as its input. This composition allows neural networks to model complex, non-linear relationships. Neural networks have been used successfully in a variety of applications, from image and speech recognition to natural language processing and airline ticket price prediction. Neural networks, especially deep learning models, are the most flexible and can potentially capture complex patterns and interactions in the data, but they can be more computationally intensive to train and require more data. 

 

Performance Analysis 

As mentioned before in the model pipeline section, all 5 model types are trained to fit 13 different group aggregations of the data. Afterwards, we train the models on all granularity levels (full/precovid, cluster/market/global), and performance statistics: RMSE, MAPE, and SMAPE are calculated to analyze the models’ performances. 

Formulas for the Performance Metrics 

The main evaluation criteria used here is SMAPE. The reason is that SMAPE combats the asymmetries and large value bias that comes with MAPE, and SMAPE is also relatively interpretable when the value is small. 

 

Overall Performance Analysis 

All model results from all three pipelines are being recorded for the purpose of comparison and analysis. Different grouping methods introduce different datasets to different models; consequently, model results are only comparable within but not across different grouping methods. To minimize the possibility of comparing model results across differently grouped datasets, averages of SMAPE were taken for every model and grouping method combination in order to find the lowest SMAPE average in turn the best performing combination.  

The three different pipelines not only break down the problem into all possible granularities but also shed light on the best starting point for this specific problem. When all pipeline results are taken as a whole, the final results are surprisingly consistent: no grouping at all has steady and far superior performances in comparison to all other grouping methods. The best performers in all three pipelines have SMAPE averages around 0.45, with cluster level Neural Network being the globally best performing model (SMAPE 0.43), individual level XGBoost being the second best performing model (SMAPE 0.46), and global level Neural Network follows (SMAPE 0.48). Model level differences vary for different markets/clusters, they could be as large as 0.2 (for example, SIN_SFO Random Forest 0.44 vs. Poisson 0.26) or as minute as unnoticeable.  

Image 5. Features Used in No Grouping

Image 5. Features Used in No Grouping 

Image 6. Clustering Model Performance  Image 6. Clustering Model Performance 

To consider the efficiency, adaptability, as well as tangibility of mounting the pipeline to production, the tradeoff between model accuracy and generalizability is one of the main things to factor into consideration. A one-model-fit-all situation has highest generalizability and easiness in implementation, yet the cost in lower accuracy could negate some of its benefits. On the other hand, if all markets were to have their best performing model and grouping combination, the actual implementation would be extremely costly and almost impossible. Our experiment seems to perfectly reflect such tradeoff – if each individual market were to have their own model and grouping combination, the SMAPEs could be as low as 0.2; if all markets share just one model, the SMAPE average would more than double. Consequently, our experiment underscores the possibility of clustering markets and deploying models in flexible and less costly manners. In fact, clustering markets not only balances the tradeoff but also has no negative impacts on model accuracy. Regardless of the final take on such tradeoff, the pipeline provides immense flexibility in determining the ideal granularity and generalization mechanisms therefore laying the groundwork for future research and production endeavors.  

One of the major findings from the experiment is the fact that grouping the data based on different time granularity might not be as beneficial as one would expect. Although the time-series nature of the problem naturally leads to data preprocessing steps such as grouping based on various time-dimension granularities, the experiment shows that it is best to leave the dataset as it is. One possible interpretation stems from the fact that hyperparameter tuning for all models was completed based on the dataset without any grouping (mainly for computational costs, as mentioned in the model pipeline section). When the same hyperparameters are used to fit new data structures, the model might fail to learn new underlying patterns at all, therefore leading to less promising results for all grouping variations. Nevertheless, the idea of restructuring the data based on time-dimension granularities is still a valid and relevant approach to consider in future research and studies. With enough computational resources to tune different models on different data, one could potentially gain new discoveries regarding best practices in terms of time-dimension granularity.  

 

PED Analysis  

Price Elasticity of Demand Formula Price Elasticity of Demand Formula 

 To understand the underlying trend in changes of PED, each market/cluster is being considered by controlling for the analysis scope (precovid), number of stops (0), and tktAl (high demand) while changing cabin (economy vs. business), original price (average price for each market), and new price (20% increase based on respective original price). For example, the average price of economy tickets from ATL to BOS is $200, thus the PED for economy cabin was calculated based on the original price of $200 and new price of $240. With the year of 2023 as the focus of the study, the monthly-level trend in PED evolution is being plotted for all 12 months of the book dates and their corresponding departure dates spanning 12 months. To be exact, the PED plot for January 2023 was created based on 12 possible departure months ranging from January 2023 to December 2023. A PED of 0.1 for book date and departure date of January 2023 means that businesses could expect a 0.1% increase in the number of passengers booking a flight in January 2023 that also departs in January 2023 if the price increases by 1%. The same approach and logic also apply to cluster level analysis. This approach creates a clear visualization of how PED for a certain book date will change depending on how far the departure date is into the future and keeps the price change constant so that PED ranges are directly comparable across markets/clusters.  

Regardless of market level granularity, most PED trends for the business cabin are positive, and most PED trends for the economy cabin are mixed. The difference between the cabins is expected as most business travelers are less responsive to price changes, and most leisure travelers do consider more factors such as high-season, vacation time window, and so on. However, the drastic differences in PED ranges are still worth noticing. For example, for SFO_SIN, the PED range for economy cabin (Fig. 1a) is around 0.07 while that for the business cabin (Fig. 1b) is as large as 6. This pattern can also be seen in cluster 1, cluster 2, JFK_CDG, and so on.  

Figure 1a. SFO_SIN Economy Cabin PED Trend  Figure 1a. SFO_SIN Economy Cabin PED Trend 

Figure 1b. SFO_SIN Business Cabin PED Trend

Figure 1b. SFO_SIN Business Cabin PED Trend 

 

Cluster Level PED 

All four clusters show interesting trends in PED variation. Cluster 1 and 2 will be analyzed in detail for their inclusion of diverse markets and representativeness of cluster-level differences. Detailed cluster information is shown below. 

Cluster 0 

LHR_SYD, SYD_LHR 

Cluster 1 

ATL_BOS, BOS_ATL 

DEN_DFW, DFW_DEN 

LAX_SEA, SEA_LAX 

CDG_JFK, JFK_CDG 

Cluster 2 

FRA_SFO 

SIN_SFO 

MEX_IAH 

GRU_MIA 

Cluster 3 

IAH_MEX 

MIA_GRU 

SFO_FRA 

SFO_SIN 

           Cluster Information 

The interesting fact about Cluster 1 worth mentioning, other than the fact that its PED range for business cabin is much larger than that for economy cabin, is that its PED is almost always positive regardless of cabin. A consistently positive PED indicates that travelers in Cluster 1 are insensitive to price changes, which is understandable in the case of these eight popular markets. (Fig. 2) 

Figure 2a. Cluster 1 Economy Cabin PED Trend   Figure 2a. Cluster 1 Economy Cabin PED Trend  

Figure 2b. Cluster 1 Business Cabin PED Trend 

Figure 2b. Cluster 1 Business Cabin PED Trend  

Cluster 2 contains international flights coming into the states. Travelers are least responsive to price changes later in the year regardless of actual departure month. (Fig. 3) The potential change in travelers’ mindset with the holiday seasons approaching might be one of the major reasons for this noticeable pattern. However, the mixed PED for economy cabin shows how markets in Cluster 2 are not as consistently popular as markets in Cluster 1. The absence of a peak demand season further signifies the absence of leisure travel, aligning with this common characteristic of the four destination cities within Cluster 2. 

Figure 3a. Cluster 2 Economy Cabin PED Trend

Figure 3a. Cluster 2 Economy Cabin PED Trend 

Figure 3b. Cluster 2 Business Cabin PED Trend  Figure 3b. Cluster 2 Business Cabin PED Trend 

 

The advantage of clustering markets, as shown in this case, is that it could preserve the common characteristics across different markets and summarize essential travelers’ behavior patterns on a cluster level. In more complex cases, for example if more markets are present, clustering would require more market-level information in order to fully capture all common characteristics (i.e. both local and global underlying patterns across markets) thereby producing clusters that are representative as well as holistic.  

 

Market Level PED 

On a market level, 2 markets (SFO_SIN, JFK_CDG) have obvious PED trends for both economy and business cabin while the rest have obvious trends for one of the two cabins.  

An interesting finding is that travelers’ booking behavior (i.e. responsiveness to price changes) for a specific departure month could have drastic changes throughout the year. This trend is especially noticeable for known popular tourism cities such as Sydney. The demand for flights to Sydney reaches its peak in October, with travelers being least responsive to price changes when they book in April, July, and September. Travelers’ booking behavior reflects the fact that tourism high-season in Sydney starts in October. (Fig. 4) Similar pattern can also be seen in SEA_LAX, where travelers are the most insensitive to price changes for flights in July and August when they book in June. (Fig. 5) 

Figure 4. LHR_SYD Economy Cabin PED Trend

Figure 4. LHR_SYD Economy Cabin PED Trend 

Figure 5. SEA_LAX Economy Cabin PED Trend  Figure 5. SEA_LAX Economy Cabin PED Trend  

Mexico City can also be considered as a popular tourism destination, with dry-season being the best time to visit. Travelers are least responsive to price changes for flights in January and February, the peak of the dry-season. Their booking behavior in this case, however, is different from that of LHR_SYD. Travelers are less responsive regardless of booking time. (Fig. 6) Both Sydney and Mexico City are renowned tourist destinations. The consistent positive PED for IAH_MEX, in contrast to the somewhat varied PED for LHR_SYD, can be attributed to the close proximity between the United States and Mexico, facilitating frequent and seamless business interactions, labor transportation, and cultural exchanges. 

Figure 6. IAH_MEX Economy Cabin PED Trend  Figure 6. IAH_MEX Economy Cabin PED Trend 

These findings about travelers’ responsiveness to price changes on an individual market level indicates the need for a more market-specific, case-by-case approach. For known popular seasons for different markets, travelers’ behavior could be inconsistent (i.e. less responsive in certain times of the year) or surprisingly consistent. These minute differences and delicate trends need to be coupled with more market research in order for businesses to accurately dissect and predict travelers’ behavior changes.  

 

Conclusion and Next Steps  

Our experiment dissects the problem of predicting price elasticity of demand by exploring all possible combinations of research scope granularity, market level granularity, and time-dimension granularity. With our exploratory study as a starting point, the research in price elasticity of demand in the travel industry would bring immense value to businesses. By quantifying the responsiveness of demand to price changes, businesses can better understand traveler behavior, optimize pricing strategies, as well as reevaluate market segmentation.  

The results of our study show that while all market level granularities reach similar model performance, the balance of accuracy and computational costs on a cluster level highlights a potential starting point for future research. Our research concludes that time-dimension granularity might not be a vital factor to consider in predicting price elasticity of demand, yet limitation in both data and computational resources might be the underlying cause to this pattern instead of actual irrelevance.  

One of the most important findings from analyzing PED trends is that the cabin has a significant impact on travelers’ responsiveness to price changes. Business cabin travelers are almost always insensitive to price changes while economy cabin travelers react differently based on factors such as holiday season, popular tourism season, and other market-specific trends. Another important discovery from our prediction of price elasticity is that for known popular seasons (tourism or holiday season), travelers may react differently to price changes throughout the year. Travelers could be less responsive to price changes in certain months, and such responsiveness follows different patterns for different markets. Businesses could benefit from researching more on the traveler behavior patterns and set corresponding pricing strategies.  

Building on the insights gained from this research, several potential next steps can be taken to further enhance our understanding and application of predicting price elasticity in the airline industry. First and foremost, understanding market-level differences and similarities in order to better cluster markets could be of vital importance to the problem. Our study focuses on market characteristics for the most part, yet factors requiring more domain knowledge such as competition levels could also contribute to constructing more holistic clusters. Furthermore, finer time-dimension granularity design, conforming domain knowledge and common practices could be explored in further details. While our study considers four of the most common time granularity combinations, other combinations might be worth considering. Similarly, external data included in our study is limited to traffic and economic indicators. Other data such as common promotion periods could be included to capture more factors that could affect traveler behavior. Moreover, the variety of data structures originating from different grouping mechanisms encourages more research on best practices in comparing model performance across different data structures. Other than taking averages of evaluation metrics, we proposed the use of the Kruskal-Wallis test suitable for evaluating ranks along with Dunn’s post-hoc test with Bonferroni correction.  

In conclusion, our comprehensive study on predicting price elasticity of demand in the travel industry highlights its immense value for businesses. By quantifying demand responsiveness and optimizing pricing strategies, businesses can better understand traveler behavior and market segmentation. Our findings suggest that clustering markets on a cluster level granularity demonstrates a favorable balance of accuracy and computational costs. Further research should explore market-level differences, finer time-dimension granularity, and the inclusion of additional external factors to enhance prediction accuracy and decision-making. Additionally, developing best practices for comparing model performance across different data structures is recommended. 

Determining NHL player salaries and assessing General Manager effectiveness with Machine Learning

| By Louis-Charles Généreux and Allen Xu |

Problem definition

As avid sports fans, we have always been intrigued by the opaque world of player evaluation. In sports like ice-hockey, where teams’ salary spending is capped, signing the right players, at the right price is essential to attaining team success. With most National Hockey League (NHL) teams spending close to all the money available under the salary cap, a team’s competitive edge comes partly from its ability to identify “under-priced” players. General Managers’ (GMs) responsibility is to build rosters of high-performing, fairly-paid players to maximize their team’s chances of success. In 2021, the NHL salary cap is $81.5M, meaning that no team can see its total player compensation exceeding this figure; see the NHL’s statement and a brief explanation of the cap’s history.

Because hockey is a non-static sport, with continuous flow, statistical analysis is not as straightforward as it is for baseball (where there are binary outcomes for each play). GMs are often retired players, who rely more on subjective reports from a vast network of scouts than on data to make contractual decisions, creating biased decisions. Research by Lanoue (2015) at the University of Windsor has identified player attributes or variables that translate to salary premiums, with all other factors being held constant. For example, being a past Stanley Cup winner typically translates to a 19% salary premium, while being an enforcer (a player who engages in on-ice fist fights) leads to a 15% salary premium; these highly-priced attributes might not actually translate to team success.

Acknowledging that GMs have biases that lead them to paying salary premiums to players who might not be the highest performers on their teams, we wondered: are these premiums worth it? To answer this question we set out on a mission to determine each NHL players’ fair salary, or what each player should be paid based on his contribution to his team’s success. In doing so, we:

  • Built a tool to evaluate each player’s performance based on team contribution metrics
  • Clustered players based on said metrics and calculated within-cluster average salaries for comparable players
  • Developed a framework to assess General Managers’ effectiveness based on tendencies to over/under pay players relative to their contribution

Before discussing our approach and findings in more details, we recommend that readers not familiar with the game of ice hockey watch the following short video. It succinctly explains the rules of the game while also showing game footage.

Sources of inspiration for this project

Sports Analytics have become a hot topic in the last decades, especially after the public witnessed the success of analytics departments in professional sports, highlighted by the movie Moneyball (based on Michael Lewis’ best-seller). As previously mentioned, hockey is a continuous sport, making it more difficult to model statistically. However, many researchers and fans have published papers on the sport. While our research may not be fully exhaustive, we are fascinated by the work published by Ryder (2012), Gramacy (2013) and Lanoue (2015) and by statistics released on analytics websites such as Evolving Hockey.  Below is a short summary of their findings and a discussion on how we intend to build on some of their work.

Historically, one of the most common metrics used in the field of hockey player evaluation has been plus-minus (the absolute goal difference between goals for and against while a player is on the ice). While this metric is simple and does not require granular play-by-play data to be computed, Gramacy (2013) points out that this traditional evaluation metric is largely dependent on the performance and strength of the whole team instead of individual players. The same can be said of other shot-based metrics such as Fenwick and Corsi.

Ryder (2012) proposed the idea of marginal contribution, or crediting each goal and assist for a team to individual players with different weights based on factors like points produced, plus-minus, the strength of the opposing team and play situations. We believe this paper to be a major breakthrough in the hockey analytics space: Ryder attributed shares of each of a team’s goals (for and against) to players on the team, allowing his readers to compare offensive contribution with defensive liability. More recently, websites like Evolving Hockey have created some buzz in the hockey community with the publication of their Goal Above Replacement (GAR) metric, which aims to measure whether a player performs above “replacement level”.

Beyond the evaluation of player contribution, another common theme in prior research has been to determine whether or not players are economically valuable based on their contracts. Lanoue (2015) studied the relationship between winning a Stanley Cup and contract terms for players. Gramacy (2013) clustered players based on their position and created a probability density function based on performance across different salary brackets.

In this paper, we aim to highlight both offensive and defensive prowess of each player, but with more granularity than Ryder. Rather than base our analyses on season-long aggregated goals and assists data like Ryder (who computed metrics based on one row of summary statistics per player), we will build our metrics based on shot attempts and with many more nuances such as shot location, difficulty and context (thanks to the availability of play by play statistics). Building metrics using shot attempts rather than goals also allows us to compute expected goals, which can be compared with actual goals scored. This will allow us to explore the notion of “clutch” play or conversion of opportunities in this paper. We will then perform player clustering, taking into more dimensions than Gramacy, such as career stage, position, playing style and physicality of the players. Finally, we will evaluate GMs’ dealmaking abilities, which has not yet been attempted in quantitative fashion.

Analysis

Development of player evaluation metrics

To evaluate players’ performance, we analyzed each scoring opportunity of the 2019-2020 and computed new player evaluation metrics, taking shot difficulty and play situation into account. The metrics we developed provide more context into player contribution than simple statistics such as scoring related metrics (e.g., goals or assists produced, plus-minus), which are partially driven by luck and overall team strength. Compiling the number of opportunities created and assessing their likelihood of conversion to goals requires play-by-play breakdowns of games; more specifically we need to know which players were on the ice for each shot taken and what was the location of the shot relative to the opposing team’s net.

We obtained our detailed game data from the NHL’s official game-sheets, which report ~300 distinct plays per game (face-offs, shots, blocked shots, hits, goals, penalties). The game-sheets contain a list of all players present on the ice for every play, in addition to distance to goal for every shot or goal. In Figure 1, we show an official game sheet and the extracted data in pandas format after scraping. Play by play data for this project consists of:

  • Play description (shot, goal, face-off, shot block, play stop)
  • Distance from net, X and Y coordinates of a play
  • Strength (even strength: both teams have 5 players on the ice, power-play/penalty-kill: a team has a 1 man advantage due to a penalty)
  • Name of all players on the ice for a play

While all of these fields can be extracted from scraping of game-sheets (as displayed in Figure 1), they can also be extracted with a bit of data cleansing using the various tables in the following data set (https://www.kaggle.com/martinellis/nhl-game-data).

Example of game-sheet data leveraged for this project
Figure 1. Example of game-sheet data leveraged for this project

With data on each shot taken during the 2019-2020 NHL campaign, we computed the expected % conversion of shots to goals, based on shot location and play situation. More specifically, we split the offensive zone in three zones:

  • Zone 1: Net proximity
  • Zone 2: Slot (good angle to net and reasonable distance to net)
  • Zone 3: All other shots taken

We then segmented all shots taken based on three possible game situations:

  • Even Strength: Both teams are playing with equal numbers of players
  • Power-Play: The attacking team (the team taking the shot) has a one or more men advantage thanks to penalties taken by opposing players
  • Short-Handed: The attacking team (the team taking the shot) is short one or more men due to a penalty taken by a team mate

In Figure 2, we present 4 images:

  • On the top left, we show a heat map of shots per square foot in the NHL offensive zone during the 2019-2020 season. We see heavy shot volumes around the net and a reverse triangle (in zones where the shot angle to the net is advantageous)
  • On the top right, we show a heat map of goals scored per square foot in the NHL offensive zone during the 2019-2020 season. We see heavy goal volume in close proximity to the net, but minimal goals from elsewhere (indicating that conversion rates is not equal across the ice)
  • On the bottom left, we split the NHL ice in 3 zones, which we will use for our analysis (net proximity, slot, from distance)
  • On the bottom right, we compute the conversion rate of shots to goals based on shot zone (from the bottom left figure) and the game situation (even strength, power-play, short-handed)

These analyses confirmed general knowledge that:

  1.   the conversion rate of shots to goals is generally higher as distance to goal is reduced and
  2.   the conversion rate of shots from a given zone is higher in power-play situations
Fig 2
Figure 2. Representation of shot zones on NHL ice and conversion percentage of shots to goals per zone and situation

To compute player specific metrics, we first summarized all shots taken and all shots given to the opposing team while a player is on the ice, for each player-game combination during the 2019-2020 season (resulting in 40,000+ player-game entries). Based on this information and on the shot conversion rates displayed in figure 2, we began computing a set of new metrics:

(1) Expected goals for and against, generated while player is on the ice:

  • Expected goals contributed to: the summation of (total number of shots taken by team while player is on the ice per zone-situation * shot conversion rate)
  • Expected goals given: the summation of (total number of shots taken by the opposing team while player is on the ice per zone-situation * shot conversion rate)

These metrics were then normalized per 60 minutes played for each player, allowing for “apples to apples” comparison across players. For interpretation purposes, it is expected that players with high “expected goals contributed to” generate high number of scoring chances for their teams, while players with high “expected goals given” are defensive liabilities (they give up many scoring chances to opponents).

(2) Offensive contribution to defensive liability ratio:

  •  Expected goals contributed to / Expected goals given: the ratio of “expected goals contributed to” and “expected goals given”

A number above 1 indicates that the player generates more offensive chances than he gives up (positive contributor to the team).

(3) Player-specific ability to convert chances to points:

  • Clutch factor: Actual points produced (goals + assists) divided by expected goals contributed to

A high number can indicate “clutch” play, an ability to perform under pressure, or pure luck (leading to higher conversion). A low “clutch factor” indicates that a player converts opportunities to points at a lower rate than peers, all other factors considered.

We believe these metrics offer a much more telling story on actual player contribution than traditional NHL statistics. Traditional metrics, which are purely outcome driven are influenced by luck, team strength and do not factor in play difficulty or situational details. On the other hand, our new metrics allow us to see discrepancies between expected and actual offensive production, but also gain insight into a player’s defensive liability (which traditionally go un-measured and un-reported). In Figure 3, we provide a definition and interpretation for these new metrics and show sample player profiles.

Figure 3. Details on new player evaluation metrics and example of player profiles produced by the tool

To highlight the potential value of our new player evaluation metrics to NHL executives, we studied the profile of two 2019-2020 standouts who both were elite point-scorers, but had completely different “clutch factors”. For instance Bo Horvat, whose calculated “expected total goals contributed to” was 81.5, produced 53 points (22 goals and 31 assists). Mika Zibanejad, whose “expected goals contributed” was 69.2, produced 75 points (41 goals and 34 assists). Full player profiles for these two players are provided in Figure 4.

While fans and executives might jump to conclusions that Zibanejad is far superior to Horvat based on traditional statistics (22 more points produced over the season), the use of “expected goals contributed to” and “clutch factor” tells a more complete story.

Horvat managed to produce a first-tier offensive output with 53 points while being on the NHL’s low end of the clutch factor range (one might say that he had bad luck). Zibanejad, on the other hand was among the players with the highest “clutch factor” in the whole league, indicating that he certainly benefited from lucky bounces throughout the campaign. On the long run, we suspect that luck might display some mean-reverting behavior.

Figure 4. Illustration of “clutch factor’s” importance in understanding player performance
Player clustering and evaluation of fair salary

After having developed new metrics to compare players more fairly, we set out to cluster players in order to find “fair salaries” within each cluster. The dimensions used to cluster players were the following:

  • Net contribution per 60 minutes: the difference between goals contributed per 60 minutes and goals given per 60 minutes
  • Offensive skills: based on scaled expected goals contributed per 60 minutes
  • Physicality: based on scaled number of hits during the full season
  • Position: forward or defense-man
  • Career stage: veteran player (28 and older) or young player (27 and younger)

In Figure 5, we illustrate the full clustering and present some notable NHL stars who fit in each cluster.

At the highest level, all 671 regular NHL players from the 2019-2020 season can be split into those whose net contribution is positive (those who generate more scoring chances than they give up) or those whose net contribution is negative. We can then split these two groups again based on their offensive skills (those who generate above average offensive chances per 60 minutes played, versus those who don’t). Interestingly, it is possible for players to be offensive-threats but overall liabilities for their teams. It is also possible to not be offensive threats (i.e., produce less offense per 60 minutes than average NHLers) but be net positive contributors to the team.

For example, some high-end offensively-skilled players such as Sam Bennett or Bobby Ryan (negative net contribution players, who are offensive threats) generate more scoring chances per 60 minutes than their peers, yet are overall negative contributors to their teams… the quality chances they give up to opponents outweigh the scoring chances they generate! On the flip side, veterans such as Tyler Bozak and Derek Stepan (positive net contribution, all-rounders who do not figure within the league’s scoring elite) are not flashy offensively but are highly effective; they give up less chances to opponents than they produce for their teams.

Running this clustering exercise allows us to identify value-adding players (positive net contributors) and the true elite of the NHL (positive net contributors, with high offensive skills: those who create high numbers of offensive opportunities). This clustering was not based on traditional output-based statistics, but rather on the quality of chances created and on the overall net contribution of players (taking into account the defensive aspect of the game). To determine fair salaries within a cluster, we make the assumption that based on their high similarities in performance metrics, players who are within a given cluster are comparable. We averaged the salary of all players within a cluster to estimate the “cluster fair salary”, for each player archetype. We see a few trends:

  • For equivalent profiles (similar net contribution, offensive skills, physicality and position), veterans tend to be paid more than younger players. This is partly due to the fact that NHL entry level salaries are capped for the first 3 years of players’ careers).
  • On average, positive net-contribution players are better remunerated than negative net-contribution players.
  • Offensive threat, positive net-contribution players (the NHL’s real elite) are the highest remunerated.
Figure 5. Clustering of NHL players and evaluation of fair-salary within each cluster

Figure 5: Clustering of NHL players and evaluation of fair-salary within each cluster

Assessment of General Manager effectiveness

After having computed each player’s fair salary based on clustering, we set out to evaluate GM effectiveness. Our objective was to determine whether or not a GM has a tendency to over or under pay his players (relative to their fair salary).

The first step to running this analysis is assigning all NHL players to their respective teams and aggregated archetypes as displayed below in Figure 6. All players under contract by a given team and GM are classified based on their cluster, and are represented by a dot colored based on the comparison of their actual salary to their fair salary.

Figure 6. Count of players per aggregated archetype, by team

After assigning players to their teams and simplified cluster, we summed up the difference between total salaries paid and total fair-salaries for a given team and simplified cluster We summed the over/under-spend amount across all clusters to determine the team-wide amount saved or over-spent, relative to fair salaries. The amount over/under-spent by a team for a given archetype is displayed in Figure 7.

During the 2019-2020 season, the Minnesota Wild franchise overpaid its players by 13.7M relative to their players’ performance while the Vegas Golden Knights under-paid their players by 14.2M

Figure 7. Analysis of fair salaries versus actual salaries paid, by team

In closing, we mapped each NHL team based on two key GM characteristics:

  • To what extent does the GM spend the money available to him (how much money below the salary cap is left at the end of the season)?
  • How strong is the GM’s deal making ability (does he tend to over/under spend on player salaries versus their fair salaries)?

In doing so, we discovered that signing advantageous contracts is not enough to guarantee success in the NHL; winning GMs sign better contract deals than their peers, but also spend more in absolute terms. GMs who have constructed successful teams in 2019-2020 have spent close to all money available under the salary cap ($81.5M maximum), resulting in very deep teams. For example in Figure 8, we see that:

  • 8 of the top 10 teams in the 2019-2020 season spent all the money available under the salary cap.
  • 7 of the top 10 teams displayed both high spending on salaries combined with fair contracts (paying players what they should be paid).
Fig 8
Figure 8. Mapping of NHL teams (including end of season rank) based on GM decisions

Conclusion and discussion on possible future improvements

As we discussed in our short literature review, many have ventured into advanced player evaluation and forecasting of fair salaries for NHL players. We believe that we have made new contributions to the hockey analytics space through our novel player evaluation framework. More specifically:

  • Our entire analysis is based on the generation of player profiles based on expected contribution per unit of ice-time, rather than on player output (points produced or metrics like plus/minus). This allows us to compare and evaluate players fairly, irrespective of their ice-time, their team’s strength or luck.
  • We computed “clutch” factors for each player, allowing us to compare true offensive production to expected production (highlighting both luck and the ability to perform under pressure).
  • Our clustering of 600+ NHL players into clusters or archetypes (which takes into account offense, defense and physicality) introduces the notion of net contribution to the team (whether or not a player generates more offensive opportunities than he gives up to the opposing team). Such knowledge (as opposed to just evaluating offensive results) could allow GMs to make data driven claims like “the following players are offensive-threats but overall liabilities for our team”.
  • Our salary predictions are based on expected contribution metrics rather than on regressions using non-contribution related metrics. Some statisticians have built predictive regression models, trained on historical contract data. These reflect manager biases and are not necessarily a reflection of player performance. Our approach on the other end is based solely on comparison of players with similar performance metrics.

We believe that no player clustering or General Manager evaluation projects have been released to date. The release of player profiles and clusters could be valuable to curious fans, but especially to players’ agents and GMs, who are trying to understand the “true” value of their players. Moreover, the release of GM effectiveness metrics (the evaluation of over/under spending relative to player fair salaries) could be very valuable to franchise owners, trying to assess whether or not their GMs are making sound investments of their money.

We hope to continue working on this project in the near future, with a particular focus on:

  • Dynamically collecting player data (to update player profiles and clusters monthly rather than annually).
  • Taking more game situation scenarios into account when producing expected contribution metrics (e.g., a shot taken in zone 1, in a power-play situation, during the final minutes of a tied game might in reality carry more weight than the same shot in the first period of the game).
  • Because one season might be a short period to assess player or GM performance, metrics and analyses developed in this paper could be observed over 3-5 year time spans to confirm early hypotheses regarding relative performance.

References

Ryder, A. (2012, March 15). Player Contribution. Hockey Analytics. http://hockeyanalytics.com/2004/03/player-contribution/.

Gramacy, R., Jensen, S., & Taddy, M. (2013). Estimating player contribution in hockey with regularized logistic regression. Journal of Quantitative Analysis in Sports, 9(1), 97–111. https://doi.org/10.1515/jqas-2012-0001

Derek Lanoue. (2015). “Does it pay to win the Stanley Cup?,” Working Papers 1502, University of Windsor, Department of Economics. http://web2.uwindsor.ca/economics/RePEc/wis/pdf/1502.pdf

Links to data sources

Sample NHL game-sheet: http://www.nhl.com/scores/htmlreports/20192020/PL020227.HTM

Kaggle NHL dataset: https://www.kaggle.com/martinellis/nhl-game-data

NHL Salary cap details: https://www.capfriendly.com/archive

Player by player salary details: https://www.capfriendly.com/teams/coyotes/cap-tracker/2020

 

Making of a Sole Survivor

| By Uygar Sozer |

 

Figure 1: Three tribes of Season 28 getting ready for a challenge

Motivation

As I was winding down after the finals week watching Survivor: Africa, which marked the 22nd season I have seen of the reality TV competition series Survivor, I was once again theorizing about who among the cast was going to make it to the top this time around. As a fan of this show, I get very invested in trying to figure out the logic that underlies the personal relationships and strategic moves in the game, all for the sole purpose of making it to the end and getting a shot at winning the $1 million cash prize.

The reason why I love this show so much is that the game environment exposes so much about our psyche, and as with anything involving human psychology, what happens in the game is always so unpredictable. My personal goal embarking on this project was to determine what among the laundry list of “good qualities” of a winner are most influential in snatching the title at the end.

All code for this project could be found at https://github.com/usozer/survivor-project.

I. Survivor US

Survivor is a popular show and it’s easy to understand why. The rules of the game are very simple, it’s a proven formula, and it keeps the fans of the show coming back for more. As with any other TV series finishing up their 40th season, there are only a handful aspects of the game left that show producers haven’t tried to tinker with over the course of the last 20 years to keep things fresh. But the base rules are the same.

Sixteen to twenty castaways are placed in a remote location and divided up into tribes. Every three days, tribes compete at immunity challenges. The loser tribe goes to tribal council and votes somebody off. At around mid-point, tribal divisions are no longer, and all castaways merge to live on one camp. Immunity challenges are now individual, and the goal is to make it through all tribal councils to have a seat at the final 2 or final 3 on day 39. Now, the last standing castaways try to convince the jury, consisting of the last 7 to 10 players to be voted out of the game, to gain their votes. The finalist who gets the most jury votes wins the title of Sole Survivor and the $1 million cash prize that goes with it.

Figure 2: Final tribal council of Season 32

Over the course of the show, tons of archetypes of players emerged. Some are better at individual immunity challenges, and some place their trust in a small alliance they forge. Some rely on a social game, some choose to play strictly strategical. All the while, show producers are throwing every turn and twist they can think of at the contestants. Dodging all the obstacles to make it to the final is difficult, regardless of the season. Season 13 winner Yul Kwon explains it best; this game is about “maximizing the good luck, and mitigating the bad luck.”

In this project I decided to solely focus on the finalists of the game, the two or three survivors standing at the end with different (or similar) styles of game play. I tried to answer these two central questions:

  1. What are the quantitative measures that can describe how a finalist played the game?
  2. At the final, which few of those measures can predict who the winner is going to be?

II. Descriptive Statistics, Feature Engineering

Data was obtained from survivoR package in R, available at the public Github repo doehm/survivoR. The main dataset I used was the castaways dataset, which included a row for each contestant and some basic information about them like age, placement in the season, challenge wins etc. The other important dataset was voting_history which detailed every vote cast at tribal councils throughout each season.

Feature selection was a crucial part of deciding what my approach was going to be. An exciting example from 2017 attempts to encode each castaway by “behavioral vectors” to predict the winner, and focuses on the entirety of the season cast.

Objective judgment of behavior and emotions is difficult if not impossible. I might think a castaway was particularly “lazy” and “boastful” on an episode, while another person could watch the same episode and come up with the opposite conclusions. Also, the way castaways are depicted are highly dependent on the editing of the show. Portrayal of the eventual winner vs. an early exit are ultimately decisions from the post-production edit suite. If we were to interpret behavior through the lens that the show producers provide us, our independent variables are most likely going to be not so independent.

To answer the two questions I outlined earlier, I decided to go with show-historical data (i.e. challenges won, votes cast, age, tribe composition etc.) rather than the subjective approach, and created variables as described in the next section.

Primary Features

Here are the primary features that I considered:

  • Age: Instead of using raw age, I took Z-scores of ages within each season. So it would be more accurately called ‘relative age within season’.
  • Immunity idols won (necklaces): This was one of the variables I was personally curious about, because there have been winners that were very dominant in the challenges and won lots of immunity idols, but many winners famously have won without getting a single immunity idol (one particular castaway managed it twice). “Necklace” is just a synonym for individual immunity idol here (as they are usually referred to as on the show).
Figure 3: Age and idol distribution

The age distribution is identical for finalists and jury, but the distribution of the immunity idols suggest that most finalists have won the idol at least once, whereas majority of the jury members never had. This type of model-free evidence might not mean much, but as a viewer it made sense to me: players who win challenges are generally more adept at the game, but also finalists participate in more challenges simply by the virtue of making it through the full 39 days, so they have more opportunities to win than the jury. Age doesn’t seem to be so influential here.

  • Appearance: Many previous contestants return to play other seasons, and being a returning player might influence jury’s decision to award you with the title, depending on if they view it positively or not. Especially in season where half of the cast is new and half has played before, previous experience, both in survival and in social situations, could also be another factor that affects the game in a major way.
  • Date of final tribal council: I have access to filming dates that each season. The exact date of day 39 of each season, which is when the jury votes for a winner, could be used to control for seasonalities, but I did not attempt that here.

Derived Features

  • Original tribe representation (prevtribe_jury): This feature looks at the importance of day 1 bonds; it is the percentage of jury members who were in the same tribe as the finalist. In season 1 of Survivor, after the tribe merger happened, the former Tagi tribe fully eliminated the opposing Pagong tribe to become the final 5 standing. This the first instance (duh) of what fans now call “Pagonging”. In a Pagonged season, all finalists will have the same number for this variable, since they come from the same tribe. But what happens when oppposing tribe members make it to the end?

The next two features incorporate finalists’ voting histories. Using the voting history table included in the survivoR package, through which I was able to extract a matrix of “vote vectors” for each jury member and finalist per each season.

Here is what these matrices look like for episodes 8 through 14 of season 6: The Amazon for the two finalists, Jenna and Matthew.

Table 1.1: Votes cast by the finalists in Season 6

We have the same table for the jury members as well.

Table 1.2: Votes cast by the jury in Season 6

The following features compare the finalists with the jury matrix in two different ways.

  • Jury similarity index (jury_simil): This variable quantifies how similar the finalist voted compared to the jury members. The idea is that if the similarity index is high, that means the finalist was likely in alliances with the jury members or simply aligning interests with them. How does clashing or matching voting histories affect jury’s decision? In calculation, jury_simil is the average Jaccard index between a finalist and each jury member.
  • Majority vote percentage (majorityvote): The idea for this variable came from the final trouble council at Season 33 when a finalist was asked by a jury member how many times she was a part of the majority vote. How I interpret this is that if your vote belonged to the majority’s most of the time, then you likely had a good grasp of the game, command over tribe mates and successful alliance management. How does the jury value being the dominant force vs. the underdog?

III. Clustering

Just for the fun of it, since I have all this normalized data now, I decided to see if I can come up with a cluster solution that could help me gain some insight as to how I can group these finalists together.

I used k-means clustering algorithm and chose the number of clusters according to the highest pseudo- statistic, which occurred at K = 4.

Figure 4: Profiling the clusters

Here’s my attempt at profiling each of these clusters.

  1. This is a cluster of young finalists who came to the final collecting tons of individual immunities. Example: Kelly from Borneo, Fabio from Nicaragua
  2. These finalists were probably in the majority most of their time and as a result, voted very much like the jury members who were probably in the finalist’s alliance. Example: Amanda & Parvati from Micronesia, Rob from Redemption Island
  3. Players who didn’t win many individual immunities and relied on their alliances to carry them through. Example: Courtney from China, Sandra from Pearl Islands
  4. Older returning players who made tons of big moves against the majority and voted against most of the jury members. Example: Russell & Parvati from Heroes v Villains
Table 2.1: Summary statistics from the cluster solution

Surprisingly, or unsurprisingly, real life data says clusters 1 and 3 became more successful in their quest to win the show. It is interesting that these two clusters have the highest and the lowest mean of individual immunities won, respectively.

Another interesting statistic to look at is the “mean season number” for each of the clusters, which could serve as an estimation to which player type is more old school vs. modern. Older returning players at cluster 4 is the “most recent” type, whereas we do not run into players like the ones that belong to cluster 2 as much anymore.

One other immediate question that popped into my head was how often do the 2 or 3 finalists of each season end up belonging to the same cluster.

Table 2.2: Final 2/3 composition by clusters

Out of 40 seasons of the show, only 9 seasons featured a final 2 or 3 of the same cluster. Neat! An indication that finalists usually do have different strategies from each other.

Just to test the validity of my solution, I did dimensionality reduction through PCA, plotted the PC-scores and individually colored the clusters. Think of these coordinates as representing everybody by some “distilled score” of 3 numbers, each representing a different aspect of their data. These scores are centered at 0, so 0 means average.

Figure 5: Visualizing the clusters
Figure 6: 3D visualization of the clusters

2D plots are kind of like looking at two different sides of a cube, and then the 3D plots offer another perspective. Clusters 3 and 4 (in blue and purple) are well-clustered, while 1 and 2 appear highly variable.

It could be interesting to include clusters we found in the predictive model, but for now I’ll trust that any information we glean from the cluster solution is already included in the raw data. Now, onto the exciting stuff.

IV. Modeling

All statistical learning models were selected and tuned through n-fold cross validation. Here was the algorithm: The model is trained on finalists from all seasons but one. The predicted probabilities of winning for the left-out season finalists are obtained from the trained model, and the person with the highest probability in the final 2 or final 3 is declared the winner.

I considered various models but eventually landed on a neural network with logistic outputs. I ran into issues with nonlinear models like boosted trees and k-nearest neighbors where the model was yielding equal win probabilities and hence predicting multiple winners per season. Neural network predictions did not have the same issue. I used the nnet package to fit the model.

Now, what can we learn from it? We can look at relative feature importance to see which variables have the largest contribution in predictions. I’ll use caret package, which uses Recursive Feature Elimination (RFE) to determine the importance measures.

Table 3: Variable importance measure from the neural network model

And we can also use partial dependence plots (I use ALE plots here) to see the isolated effects of each variable.

Figure 7: Individual contribution of each variable to winning likelihood

It is slightly annoying to interpret the variables since the scales are normalized, but we can look at the curve to provide some explanations.

  • Age follows a roughly linear trajectory, meaning younger finalists are more likely to win the show.
  • majorityvote has a very interesting curve. Win probability is maximized at a little below the average and sharply declines as majorityvote approaches 1. The way I see it is that players who didn’t always vote with the majority are favored, but not to the extent where the percentage is extremely low (if a finalist never knew what was going on and always voted against the crowd, that’s viewed negatively.) But when the percentage is too high, close to 100%, then the finalist is probably viewed by the jury as cunning and devious which also works against the finalist’s favor.
  • Number of immunity idols (necklaces) was probably the most surprising here, where fewer idols didn’t have a positive effect (if you see the scale for x_2 (necklaces), main effect at y-axis at the lower end is 0) but more idols hurt the finalists’ chances.
  • Appearance didn’t appear to be one of the more important variables, and it is most likely heavily influenced by one particular data point for Season 22 when Rob “Boston Rob” Mariano won Survivor on his 4th try.

V. Bootstrap Estimates

Now that I have my model tuned, I can repeat this a bunch of times! Bootstrapping essentially just means that I shuffled the datapoints to re-fit the model many many times, and aggregating predictions from each model.

For each season, I generated 10000 predictions of who the winner is going to be, with a model trained on a resampling with replacement of the 39 other seasons. That is a whopping 400,000 neural networks fit! Thankfully I was able to set up this code to run on the MSiA servers overnight.

Table 4.1: Highest prediction rates for actual winners
Table 4.2: Lowest prediction rates for actual winners

How should we interpret these results? The way I look at it is, since the predictions for each season are done using a model trained on the other 39, a high predicted% indicates that the way the finalist played “a textbook game”. The model says Chris from Vanuatu or Ethan from Africa had a game style possibly similar to those we saw before, and the previous contestants that played like they did usually won.

For Jeremy and Yul, the opposite. They were possibly in the finals with a textbook contestant, and the model saw Jeremy winning only 450 out of 10000 simulations of the season. Against all odds, these winners were able to claim the title in the end.

Another interesting table to look at is the non-winners with the highest predicted%.

Table 4.3: Highest (false) prediction rates for finalists who did not win

The model says these shoo-ins could not convince the jury. Kelly and Colby came to the end with lots of immunity idols won along the way, but they lost. For Spencer and Woo, it is hard to tell. Woo was likely favored due to the fact that he was in the majority vote most of the time, and Spencer was relatively young at 23 when he played in Season 31: Cambodia – Second Chance, a returning player season with most contestants in their 30s and 40s.Table 4.3: Highest (false) prediction rates for finalists who did not win

Conclusions

This exercise was less about building the best predictive model to help me hit the jackpot at bets.com for season 41, and more about coming to understand various intricacies of gameplay and strategy within the realm of my favorite TV show. Normally, I don’t care much to interpret what a statistical model predicts a historical datapoint, but in the case of this particular dataset, one datapoint is a 13-episode season with story arcs, big personalities and iconic challenges, so reviewing the output of my n-fold cross validation was actually enjoyable for once. I feel like I gained some understanding as to what some past winners had to set them apart from the rest, and how each of these winners fit in (and not fit in) with each other.

How well did I do in the two questions I set out to answer? Since there were no previous analysis efforts (as far as I could search on the Internet), feature engineering & selection was difficult. But I think just by trial-and-error in things like choosing an appropriate distance measure metric to see finalist voting history similarity to the jury taught me a lot about data integrity and dangers of boiling an entire season’s worth of effort to a number. As far as trying to understand how each of my variables come in to play, I was not entirely thrilled with a rather low correct classification rate of my model, but it gave me some interesting insight on how age and immunity idols in particular affect finalists’ chances of winning.

Two big missing variables for me here are race and gender. In my personal opinion, demographics undoubtedly play a huge role in the personal relationships that the castaways form, just like in real life. I would love to repeat the analysis in the future with some demographic data as well to see if it improves the correct classification rate at all.

In any case, you can find me watching Survivor in my Engelhart apartment until I finish all 40 seasons.

References

Bertelsen, B. (2013, December 9). Validity Index Pseudo F for K-Means Clustering. Cross Validated. https://stats.stackexchange.com/questions/79097/validity-index-pseudo-f-for-k-means-clustering.

Brownlee, J. (2019, August 8). A Gentle Introduction to the Bootstrap Method. Machine Learning Mastery. https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/.

CBS. (2020, February 5). Survivor (Official Site). https://www.cbs.com/shows/survivor/.

Garbade, D. M. J. (2018, September 12). Understanding K-means Clustering in Machine Learning. Medium. https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1.

scikit-learn. (n.d.). Feature selection – RFE. https://scikit-learn.org/stable/modules/feature_selection.html#rfe.

Singh, A. (2019, September 15). Dimensionality reduction and visualization using PCA(Principal Component Analysis). Medium. https://medium.com/@ashwin8april/dimensionality-reduction-and-visualization-using-pca-principal-component-analysis-8489b46c2ae0.

Everybody Eats! Understanding COVID-19 & Food Insecurity in the Evanston Community

| By Aditya “Adi” Tyagi, Xiaohan (Aria) Wang, Mengyang (Doris) Cao |

1. Introduction

The COVID-19 pandemic has led to a drastic increase in food insecurity. The U.S. Department of Agriculture (USDA) defines food Insecurity as “a lack of consistent access to enough food for an active, healthy life.” It is measured via the following two questions (Link to source) posed to respondents:

  1. “We worried whether our food would run out before we got money to buy more.” Was that often true, sometimes true, or never true for your household in the last 12 months?
  2. “The food we bought just didn’t last, and we didn’t have money to get more.” Was that often true, sometimes true, or never true for your household in the last 12 months?

A response of “often true” or “sometimes true” to either question implies the presence of food insecurity.

The national food insecurity rate in April 2020 was predicted to be 10.1% following the historical trend. However, the actual national food insecurity rate reached 22.8% in April 2020, according to the COVID Impact Survey (Schanzenbach and Pitts, 2020). The food insecurity situation is even more severe for vulnerable populations across the country. Schanzenbach and Pitts (2020) reported that food insecurity rates in the U.S. tripled in April 2020 for families with children, as compared to the level from just a month earlier.

2. Project overview

This project was conducted to provide the City of Evanston and its many partner organizations with a greater understanding of the structural factors of community food insecurity as well as how the COVID-19 pandemic has affected food insecurity in Evanston.

Some of the contributions of our project are:

  • Help the City of Evanston Food COVID-19 Food Insecurity task force make more informed decisions about how food assistance is currently occurring in Evanston
  • Identify areas with high-impact opportunities of investments of tax and philanthropic dollars
  • Enhance partner collaboration and potentially optimize the use of local food supplies.
  • Provide a reproducible framework for other city governments looking to use data analytics to drive change in their communities.

In fact, our work was recognized by the Alliance to End Hunger, where we presented our findings as part of the Alliance’s Hunger Free Communities Summit 2020, which took place on November 18-19, 2020.

Thanks to Professor Karen Smilowitz, Professor of Industrial Engineering and Management Science at Northwestern University, we were introduced to this project led by Maura Shea, former VP of Innovation at Feeding America. We collaborated with the food security task force at the City of Evanston on this project.

Evanston is an affluent suburb of Chicago, located on the North Shore of Lake Michigan. It is famous for being the home of Northwestern University, as well as the headquarters for the National Merit Scholarship. It is a diverse but predominantly white population with a history of racial segregation and progressive politics. It has a generally older population, with a strong student/youth presence due to Northwestern University.  Recently, Alderman for the 5th ward, Robin Rue Simmons, is leading a reparations program for slavery and red-lining. The objective is to channel up to $10M/10 years from marijuana sales tax revenue into the program. On the surface, Evanston appears to be a low risk area for food insecurity; however, it is suspected that there are pockets of the city where food insecurity is prevalent.

We created a constellation of datasets (proprietary and free/online).  We then conducted data analysis, developed some initial concepts for data visualization, drafted a collective impact toolkit framework and provided recommendations on future data collection and management. Through this project, we aim to help the city and the task force to better understand the needs of our food insecure neighbors and food assistance providers, inform how the system of food assistance works in the city of Evanston and help each stakeholder to participate from a more informed position. Availability of this information can help each participant to make more informed decisions within the charitable food system in Evanston.

3. Methodology

With the help of our partners, we collected a number of different datasets from various sources. Below are the data that we used in this project:

  1. 2020 Food insecurity rates by census tracts and ZIP Codes in Cook County from Feeding America (Data: https://www.feedingamerica.org/research/map-the-meal-gap/by-county – 2018)
  2. Food capacity breakdown and information about distributions for the food providers in Evanston (Data: Proprietary from the City of Evanston)
  3. Measures of food access by census tracts in Evanston from the Economic Research Service (Data: https://www.ers.usda.gov/data-products/food-access-research-atlas/ – 2015)
  4. Free & Reduced Lunch program eligibility for schools in Evanston from Illinois State Board of Education (Data: https://www.isbe.net/Pages/Nutrition-Data-Analytics-Maps.aspx – 2020)
  5. Various demographic and geographical data by census tracts in Evanston (Data: PolicyMap’s https://www.policymap.com/maps)
  6. The monthly number of individuals/households receiving Supplemental Nutrition Assistance Program (SNAP) benefits by ZIP Codes in Cook County from the Greater Chicago Food Depository (July 2018 – June 2020)
  7. Proprietary data from each Producemobile distribution (2016-2018), a pantry partner of the Greater Chicago Food Depository (Data: Proprietary from https://www.cityofevanston.org/Home/Components/Calendar/Event/11993/250?curm=5&cury=2019)
  8. COVID Impact Survey data for Chicago-Naperville-Elgin, Illinois-Indiana-Wisconsin (Wozniak et al., 2020), supplied by the National Opinion Research Center at the University of Chicago.

Using these data, we analyzed food insecurity rates both in Cook County and in Evanston, and created prototypes for an interactive map, a Chatbot tool and a prediction model for food distributions, which the city could consider for further development in the future. Our suite of analytical tools involved:

  • Hypothesis testing (t-test) of differences in geo-characteristics between high food security and low food security areas.
  • Time Series Analysis (Holt-Winters Forecasting, decomposition) of various aspects of Producemobile’s operations
  • Inferential Modelling (Random Forests, Linear, Logistic Regression) to better understand the relationship between various background characteristics and food insecurity at various granularities and scopes.

The goal of each type of analysis was to better understand what background variables are associated with food insecurity at various temporal, geographical levels, and at different granularities.

3.1. Data analysis

In carrying out the data analysis, our basic philosophy was “not to predict, but to infer.” We sought to first understand the factors affecting food insecurity in our community, rather than use them to predict food insecurity outright. This informed our choice of model and approach, in favor of simple and highly interpretable models.

Secondly, our guiding rationale behind modelling food insecurity in Evanston was to break down factors into two types:

  • Structural Factors: Structural factors are those that were previously present and stem from are a function of Evanston’s historic social and economic policies before the pandemic started.
  • COVID-19 Impact Factors: COVID-19 Impact factors are those that have now come into focus after the onset of the pandemic into the Evanston community.

The rationale behind following this two-tier approach is that it gives decision makers a framework to understand food insecurity in their communities.

To understand structural factors, we limited our time scope to 2014-2018 data and built:

  • Cook County ZIP Code model
  • Evanston Census Tract model
  • Evanston Census Tract Hypothesis Testing

To understand the COVID-19 impact factors, we analyzed data from April to August 2020:

  • SNAP Benefit Distribution in Cook County at ZIP Code Level
  • ProduceMobile Food Distribution at Evanston location
  • COVID-19 Impact Data

3.1.1. Inference of structural food insecurity

The goal of digging into the structural food insecurity is to identify the underlying influential geographical or demographic factors that may explain the variation in food insecurity for different locations. With Feeding America’s data about the food insecurity population size for each geographical location in 2018, we applied predictive models (multivariate linear regression and Random Forest) on (1) the ZIP Code level in Cook County and (2) the Census Tract level in Evanston. With the first approach we aimed to get a complete view of the food insecurity situation in Cook County. The second approach allowed us to identify the factors that differentiate food secure tracts from food insecure tracts in Evanston. Our rationale behind using these two approaches was to get a complete view of the food insecurity situation in Cook County and at the same time find out the factors that differentiate Evanston from other cities in the county. We also ran independent hypothesis tests (two-sample unpaired t-test) comparing the most food insecure zones and the least food insecure zones of Evanston to identify risk factors.

A. Random Forest model on food insecurity population size by ZIP Code in Cook county

The key factors according to the Random Forest model are median household income, the percentage of families in poverty, life expectancy, the number of households receiving SNAP benefits, percentage of households with any type of computer and # of households without a vehicle.

As shown in Figure 1 below, lower median household income or higher percentage of families in poverty corresponds to a larger food insecurity in the ZIP Code area. These two variables can both be taken as indicators of the economic condition for the location. As the model results show, wealthier neighborhoods have relatively less severe issues of food insecurity.

Figure 1. Household income graphs?

According to Figure 2, areas with life expectancy around 70 tend to have a larger food insecure community than any other areas. For ZIP Codes with life expectancy larger than 70, a higher number that people expect to live in the location corresponds to a smaller food insecure population.

Figure 2. Life Expectancy

From Figure 3, we can see that ZIP Codes where more households have access to computers and a vehicle have relatively less people in food insecurity. These two factors can be interpreted as people’s access to food. Households with vehicles are able to access grocery stores that are farther from them, and therefore they have more choices of where and how to get food. Households with computers have more access to information and opportunities. If they are food insecure or at the risk of being food insecure, households with computers are able to learn about the support provided by the local organizations, and thus it is easier for them to get out of the food insecure condition.

Figure 3. Computer access

B. Multivariate linear regression model on food insecurity population size by Census Tract in Evanston

To understand the factors affecting food insecurity in Evanston in a simple and intuitive way, we built a multiple regression model. Each observation represented a census tract in Evanston. The dependent variable was the number of food insecure persons for each census tract in 2018. Factors covered included demographics, property values, etc. from the pre-pandemic era. The final dataset was formed by joining the PolicyMap dataset and the Feeding America food insecurity data on census tract. Mean/Median value imputation was carried out to address missing values.

Tracts with larger numbers of food insecure persons have:

  • A younger median age
  • A higher level of social vulnerability
    • The CDC Social Vulnerability Index measures a community’s resilience (at the census tract level) when confronted by external stresses on human health, stresses such as natural or human-caused disasters, or disease outbreaks.
  • A lower level of racial segregation
    • The Thiel Racial Segregation Index measures a community’s level of racial segregation. Higher Index values imply higher levels of racial segregation.

For further details, the following variables feed into the CDC Social Vulnerability Index:

  • Below poverty
  • Unemployed
  • Income
  • No High School Diploma
  • Aged 65 or Older
  • Aged 17 or Younger
  • Civilian with a Disability
  • Single-Parent Households
  • Minority
  • Speaks English “Less than Well”
  • Multi-Unit Structures
  • Mobile Homes
  • Crowding
  • No Vehicle
  • Group Quarters

Figure 4 below shows the 15 component factors of the Social Vulnerability Index grouped into four themes:

Figure 4. Component factors of the Social Vulnerability Index

C. Two-sample unpaired t-test on the most vs. least food secure tracts in Evanston

To provide another perspective on food insecurity at the census tract level in Evanston, we decided to apply a hypothesis testing approach. We compared the top five food secure tracts (hereafter, referred to as “high food security tracts”) with the bottom five least food secure tracts (hereafter, referred to as “low food security tracts” in Evanston across a range of factors. In Figure 5 below, we have depicted this division by coloring high food security tracts as GREEN, and low food security tracts as RED.

We used a two-sample (unpaired) t-test since the population standard deviation was unknown. Most of the data came from PolicyMaps between 2014 and 2018. For all hypotheses, we used a significance level of 10%; however, for a few hypotheses, we used a significance level of 15%.

Compared to high food security tracts, low food security tracts in Evanston have/are:

  • a lower household income
  • a lower age
  • more socially vulnerable
  • lower median home loan amount
  • lower median home value
  • more nonwhite in general
  • greater non-English speaking proportion
  • higher hispanic population
  • less racial segregated
  • lower proportions of people with bachelor’s degrees
  • lower computer access
  • lower average travel times to work

In addition to finding differences, we also noted similarities between the two sets. In the statistical testing terminology, these would be factors for which we failed to reject the Null Hypothesis (i.e. that the two sets of tracts are similar with respect to the concerned variable). However, high food security and low food security tracts in Evanston are similar when it comes to:

  • proportion of families in poverty
  • number of jobs
  • number of housing units
  • life expectancy
  • proportion of students in public school
  • median leverage ratios
  • proportion of disabled people
  • average household size
  • proportion of men
Figure 5. Evanston tracts classified by food security level

3.1.2. Impact of COVID-19 on food insecurity

Besides learning about the underlying factors to long-term food insecurity, we also hope to capture the short-term impact of COVID-19 on food insecurity. There was some difficulty for us to collect the data related to COVID-19’s impact. Our desired dataset should be updated on a regular basis and at a granular geographical level, whereas most public and reliable sources have data that either are collected only once every year or capture the food insecurity for the entire Cook County as a whole. However, thanks to the Greater Chicago Food Depository and Producemobile, we got access to the monthly data about individuals/households receiving SNAP benefits by ZIP Codes in Cook County and the data about each Producemobile distribution. By analyzing the time trend in these data, we aimed to get an idea of the food insecurity situation during the COVID-19 pandemic. In order to understand the food insecure population during the pandemic, we also did predictive modeling (multivariate logistic regression models) using the data from the COVID Impact Survey.

A. The population receiving SNAP benefits by ZIP Code in Cook County over time

As is shown in Figures 6,7,8 below, there are similar trends in the average number of households/individuals receiving SNAP benefits and the average allocated funding in all ZIP Codes in Cook County. They all have a very large increase since April 2020. The average number of people and the average allocated funding across ZIP Codes in Cook County have increased significantly and suddenly, reflecting the social isolation practices of closing businesses and their economic effects of laying off workers.

Figure 6. The average number of households receiving SNAP benefits
Figure 7. The average number of individuals receiving SNAP benefits
Figure 8. The average funding for SNAP benefits

The trend also shows up in Evanston ZIP codes for the number of households/individuals receiving SNAP and the allocated funding (Figure 9). This indicates that COVID-19 has led to more people/households receiving SNAP benefits and more people facing food insecurity.

Figure 9. The number of households receiving SNAP benefits (2018-2020)

Compared with the level in March 2020, the number of individuals receiving SNAP in April this year increased by 7% on average across ZIP Codes in Cook County. As for the ZIP Codes in Evanston, the highest increase rate is 13.38% in 60203, and the lowest increase rate is 5.26% in 60204. This is depicted in Figure 10 below.

Figure 10. Percentage point change in number of individuals receiving SNAP                                              benefits by ZIP Code (2020)

The data shows that there is a large increase in the number of people in need of SNAP due to the pandemic, and there should be potentially even more people experiencing or at the risk, since SNAP does not support families to the full extent of their monthly food needs. Meanwhile, among the people supported by SNAP, there is a gap between their entire households’ food purchasing needs and the actual amount they receive from SNAP. There is a gap between the households who are food insecure or at risk of food insecurity and the households who actually received SNAP benefits. Unfortunately, there is currently no data about how wide the gaps are and whether they got wider or narrower due to COVID-19. Further work is needed to understand the nature of these “SNAP Gaps”, and how city governments may best identify and serve them.

B. The population served by Producemobile distributions over time

According to the data provided by Producemobile, there is a drastic increase in the number of participants at distributions in April and June this year. Before the pandemic (except for March – September 2019), Producemobile distributed food once a month, but since April 2020, once the pandemic’s social distancing measures were put in place, the organization started to distribute food to people in need twice per month. Before April 2020, except for the period between March and September last year, the number of guests signed in at the distribution every month was relatively stable between 200 and 300. The increase in the unemployment rate has led to more people turning to Producemobile distributions for food assistance.

Similar to the gap of needs unfulfilled by the SNAP program, distributions by food providers, such asProducemobile could not serve all the needs of the entire food insecure population. Further studies are necessary for us to truly understand the actual needs of our food insecure neighbors across the city of Evanston and beyond what is provided by Producemobile. However, the data, visualized in Figure 11, allows us to take a peek into the situation.

Figure 11. The number of guests signed for Producemobile distributions

C. COVID-19 Impact Model on Food Insecurity

To get an idea of the risk factors associated with food insecurity during this volatile COVID-19 pandemic timeframe, we used the latest data from the COVID-19 Impact Survey. This survey is conducted nationally once a week for three weeks, three times during the height of the pandemic in the United States. The survey is managed by the National Opinion Research Center (NORC) at the University of Chicago, and it is sponsored by the Data Foundation. The NORC also produces representative samples for various metropolitan areas in the United States, one of which is the Chicago MSA (metropolitan statistical area). The Chicago MSA includes participants in the area informally known as Chicagoland and includes Evanston within its boundaries. In the survey, various questions were asked pertaining to economics, social life, physical health, healthcare situation, demographics, and food insecurity. Though we recognized that the dynamics of food insecurity may be different from Chicago compared to Evanston, this is the best available approach we have to understand the risk factors during the pandemic.

Our basic approach aims to infer the relevant risk factors and their effect on food insecurity. The tool we use to do that is a logistic regression model, postulated on the probability of a “food insecure response” from a survey participant. To build the model, we used a backward feature elimination process. The following risk factors associated with food insecurity were identified:

Conclusive Risk Factors:

  • Not applying/receiving Supplemental Social Security in the past seven days
  • Not being covered by a health insurance policy from your employer/union
  • Not being covered by Medicaid, Medical Assistance/other form of government assistance plans for low-incomes/disability
  • Not being covered by any health insurance plan at all
  • Being elderly (age bracket 65-74+)
  • Being Black (non-Hispanic), Hispanic, an other Non Hispanic,
  • Being Hispanic
  • An annual household income of between $10,000 to $20,000 (i.e. low household income)
  • Having children in the household aged 6-12

Risk Factors where the researchers suspect a strong presence of lurking/confounding variables:

There were some factors where we suspected the presence of lurking/confounding variables. Confounding variables are essentially hidden (i.e. lurking) variables that simultaneously affect two other observable variables, and hence lead researchers to the spurious conclusion that the two observable variables are related. For example, in 2013, researchers noted that countries with higher per capita chocolate consumption won more Nobel prizes. This dynamic is depicted in Figure 12 below. However, this conclusion is fallacious. A competing reason why countries with more chocolate consumption tend to win more Nobel prizes is that they are in general wealthier, and hence are able to spend more money on advanced scientific research and development, which are often necessary to win Nobels. The researchers suspect a similar dynamic in some of our findings, as visualized in Figure 12.

Figure 12. Nobel Laureates per 10 Million Population vs. Chocolate Consumption
  • Those who communicate electronically with friends/family a few times a week have a lower risk of food insecurity as compared to those who communicate with friends/families everyday (lurking variable: unemployment)
  • More number of hours worked per week prior to COVID-19 start (lurking variable: working many jobs)
  • Increased expectation of being employed 30 days from now (implicit: present unemployment)
  • Increased expectation of being employed 3 months from now (implicit: present unemployment)

3.2. Interactive map

We also built an interactive map prototype with multiple layers using Kepler.gl. The objective of this tool is to make food assistance information more visible. Its target audience are food insecure individuals within the Evanston community, food providers, and the City of Evanston. From the map, neighbors in need and service providers can know:

  • What food services are available and where
  • What additional capacity a provider might have
  • Where there are gaps in service compared to level of need and what additional investments might be needed

Data being used for map toolkit is mainly generated from the Evanston Food Resources, Meals Capacity Breakdown and Free Reduced Lunch files.

There are three layers in the current map (Map can be expanded to include additional functionality if necessary):

  • The first layer (Figures 13,14) shows distribution of food providers in Evanston.
    • Each dot on the map represents a provider, while color or the dot and color of outline of the dot represent the number of meals that are currently being provided at each location per week and maximum capacity of meals respectively. This helps to identify which provider has relatively higher/lower resources used and could potentially provide more if needed.
    • On clicking the dot, an information box with more detailed information (provider’s contact info, food providing type, food providing frequency, etc.) about each provider will pop up.
    • Another functionality in this layer is “Geocoder.” After a user types an address into the search box at the top right corner, a red icon will appear on the map which represents the location of the user. This functionality is added for the user to easily get a sense of relative distance between his/her location to a food provider and support the user to make better informed decisions when choosing the provider that best serves his/her needs.
Figure 13. Detailed Provider information on demand
Figure 14. Food Availability at each provider
  • The second layer (Figure 15) shows existing connections among providers. Each dot on the plot represents the location of a provider. If two dots are connected by a line, then it indicates that those two providers are currently connected.
  • The third layer (Figure 16) shows the Free & Reduced Lunch (FRL) eligibility rate in each school district (D65) in Evanston. The goal is to use free reduced lunch eligibility rate in school districts as an indicator of food insecurity rate in the close neighborhoods and help to understand whether the needs of food insecurity people are well served or not.
    • Each dot in the map represents an elementary school or a high school in Evanston. Color of dot indicates the free reduced lunch eligibility rate, darker color corresponds with higher rate.
Figure 15: Existing Connections between Evanston food providers
    • The bottom part of Evanston (Dawes and Oakton school districts) has relatively higher FRL eligibility rate compared to other regions.
Figure 16: Free & Reduced Lunch Eligibility for each Evanston school

 

3.3. Chatbot

Besides the interactive maps, a chatbot tool (Figure 17) has been developed in Azure QnAmaker so that our neighbors in need can receive recommendations based on their needs. It follows a simple decision tree structure (Figure 19). Users can easily start the chatbot service by typing “Hi” in the text box. Then there will be a set of questions asked:

  • Information about SNAP application
  • Preference for grocery items or prepared meals
  • Identification of need and priority for delivery

One question will be asked at a time with options that users can click on. Eventually there will be an information box (Figure 18) that shows all the details on the food providers that best match with users’ needs.

Right now, English is the only language supported in chatbot service, but more languages can be easily added to meet the requirements. One potential functionality to be added in the future is the “alert” function which allows service providers to contact each other if they have extra resources to share.

Figure 17. Chatbot in operation
Figure 18: Example of detailed information provided by chatbot on demand
Figure 19. Decision tree structure

3.4. Producemobile: A Mini-Case Study

A brief spotlight on the Evanston Producemobile operation is presented in this section. This is an example of how data mindfulness can lead to effective community responses to food insecurity.

3.4.1. Organization Description

Producemobile is a collaboration by the Interfaith Action Committee of Evanston and the Greater Chicago Food Depository (GCFD). It operates at two locations in the city. The food truck (fresh fruits, vegetables, a selection of meats, milk/dairy products) is provided by the GCFD, while the volunteers are provided by the Interfaith Action Committee. The frequency of operation is on the second Tuesday of each month.

3.4.2. Data Collection & Management Best Practices

At each Producemobile operation, there is a rigorous and disciplined approach to Data Management. For instance, the following variables in Table 1 are conscientiously recorded at each distribution.

Table I. Records of food distribution


3.4.3. The Operational Problem & Present Solution

The key problem faced by Producemobile was a planning one: “How to plan how much food to bring to a given distribution location to minimize excess and shortages?” This step requires prior knowledge about how many individuals will show up to a particular Producemobile food distribution, which is a classic problem faced by many retailers.

To address this problem, Shawn Iles, a veteran volunteer in charge of the operation, essentially uses a combination of guesswork and intuition built over a long time working at the site location. However, Shawn recognizes that there are numerous inefficiencies to this:

  • For one, the operation is highly dependent on him and his intuition.
  • Secondly, various trends that are beyond human detection are left unexplored.
    • This is especially problematic in the volatile COVID-19 era.

3.4.4. A Potential Data-Driven Solution

One of the big advantages of Producemobile is the wealth of data it collects about each distribution from 2016 to present. From a data management perspective, the following two characteristics followed by Producemobile are key:

  • Consistently formatted and disciplined data collection at each distribution event
  • Data storage in an easily accessible location (For Producemobile, this took the form of simple Excel sheets)

With marginal effort, this procedure allows a data driven solution to a classic business problem: forecasting via a time series analysis. Through simple time-series methods, we can detect things the following in the data:

  • Trend over time
  • Seasonal effects
  • Latent cyclic behavior

Using this knowledge, forecasting the number of attendees at a particular Producemobile distribution becomes possible. This advanced knowledge allows Producemobile to better serve its community by ensuring that shortages and excesses are avoided.

However, the real power of disciplined data collection and management lies beyond solely forecasting demand at a particular location: a virtuous cycle occurs that simultaneously improves Producemobile’s forecasting models and also enhances its ability to better serve its constituents. With each data collection, model parameters can be updated based on the divergence of their forecast from the actual number of attendees that arrived at that distribution. Over time, the forecasting model becomes stronger and so does Producemobile’s ability to meet demand at its location. Such a scientifically sound approach also makes Producemobile’s case stronger when requesting food from GCFD or funding from other agencies. The approach can also be applied to other aspects of Producemobile’s operations such as requisitioning volunteers and so on.

3.4.5 Forecasting in Action

To illustrate the actionability of the above recommendation, our team at Everybody Eats built a time-series based forecasting model. The data we used consisted of 12 measurements (one per month) for each year from 2016 to August 2020. The model used was a Holt-Winters model. Seasonal decomposition and linear interpolation were used to impute null values.

We analyzed the following three key aspects of Producemobile’s operations:

  • Number of individuals served at each distribution
  • Amount of food received from Greater Chicago Food Depository (GCFD) for each distribution
  • Number of volunteers expected to arrive at each distribution

Here are some key insights:

  • For number of individuals served at each distribution:
    • Tends to peak during: March, May, September, November
    • Tends to fall during: January, February, April, October, December
  • For the amount of food received from GCFD for each distribution:
    • Tends to peak during: January, March, May, July, September, November
    • Tends to fall during: February, April, June, October, December
  • For the number of volunteers at each distribution:
    • Tends to peak during: January, February, March, April, September, November
    • Tends to fall during: May, June, July, August, October, December

Figures 20, 21, 22 below visualize our predictions for the above three metrics till the end of the year:

Figure 20. Prediction for number of individuals
Figure 21. Prediction for the amount of food received from GCFD
Figure 22. Prediction for the number of volunteers

Some Key Takeaways from the above forecasts:

  • The number of individuals is expected to rise till the end of 2020 (December)
  • The amount of food received from GCFD is expected to stay level till the end of 2020
  • The number of volunteers attending is expected to fall till the end of 2020.

The key implication of the above is that there is a shortage of volunteers expected that coincides with an increase in the number of individuals attending Producemobile’s distributions, and a concurrent leveling of food arriving from GCFD.

Finally, we can have a look at the decomposition (Figures 23, 24, 25) of the three time series from 2016 to present (August 2020). This allows us to get a sense of the trend, and the seasonality of the time series.

1. Number of Individuals Arriving at distributions from 2016-August 2020

Figure 23: Time Series Decomposition for Number of Individuals Arriving at
Food Distributions (2016-2020)

2. Amount of food received from GCFD at each distribution from 2016 – August 2020

Figure 24: Time Series Decomposition for Amount of Food received from GCFD at each distribution (2016-2020)
Figure 25: Time Series Decomposition for Number of Volunteers at each food distribution (2016-2020)

3. Number of volunteers attending at each distribution from 2016 – August 2020

 

4. Potential Improvements & Further Steps

4.1. Food insecurity survey for Evanston residents

One of the major obstacles in this project was to collect data that can help us understand the food insecurity situation in a dynamic and granular way. As it is mentioned earlier, most of the public and reliable datasets are either updated very infrequently (US Census every 10 years), or not collected at the appropriate level of granularity (ZIP Codes or Census Tracts), such as the COVID Impact Survey. Data collected on a regular basis at granular levels is helpful for the city and organizations to understand the food insecurity in Evanston in a dynamic way, which is key to providing more efficient and prompt assistance to our food insecurity neighbors. In order to achieve this goal, we suggest the city to conduct a food insecurity survey among Evanston residents. To design this survey, we used a mix of questions from the COVID Impact Survey and our own Evanston-specific questions. Through the data collection and analysis, we hope the city can better understand the food insecure community, predict the future needs more accurately and get fully prepared. A sample survey draft is provided here.

4.2. Distribution cost survey for food pantries and organizations

During one of our meetings with the Evanston COVID-19 Task Force, we learned that it is important for the city to figure out how much funding should be allocated to each family. Therefore, we designed a cost analysis survey for food pantries and organizations. This survey aimed to help the city understand the distribution cost at each food provider and make more data-driven decisions on the funding.

4.3. Recommendations for data management

We identified the following areas where significant improvement can occur in the City of Evanston’s data strategy:

  • Lack of data-driven awareness: We learned that food pantries and the City of Evanston do not collect demographic or geographical information about the community related to food insecurity,
  • Extensive Information Siloing: Organizations do not work collaboratively on food distributions, which makes it difficult for:
    • The city and organizations to remain informed of the fast-changing situation during this pandemic
    • How they should deliver assistance in response to the increasing food insecurity.

Organizations might have different rules to collect personal and sensitive information from the community in food insecurity. However, we think that key organizations and food providers should aim to build a connected and shared database among to better serve the community. Therefore, we listed some of our recommendations for data management for the city and organizations. These recommendations are accessible here.

5. Conclusion

By using data analytics tools, we looked into the underlying factors that might determine the structural food insecurity or lead to higher risks of food insecurity during the pandemic in Evanston. We also created prototypes for the toolkit that would be applicable and helpful in delivering better food assistance to our food insecure neighbors. With the constraints of limited available public data, we set examples of how organizations can apply the analytics and make more informed decisions.

References

Abigail Wozniak, Joe Willey, Jennifer Benz, and Nick Hart. COVID Impact Survey: Version # [dataset]. Chicago, IL: National Opinion Research Center, 2020.

Economic Research Service (ERS), U.S. Department of Agriculture (USDA). Food Access Research Atlas, https://www.ers.usda.gov/data-products/food-access-research-atlas/

PolicyMap. https://www.policymap.com/maps

Schanzenbach, D. W., & A. Pitts. 2020. Estimates of food insecurity during the COVID-19 crisis:

Results from the COVID Impact Survey, Week 1 (April 20–26, 2020). Institute for Policy Research Rapid Research Report. https://www.ipr.northwestern.edu/news/2020/food-insecurity-triples-for-families-during-covid.html

Acknowledgments:

We would like to pay our thanks to Maura Shea, former VP Innovation at Feeding America, and Prof. Amy Smilowitz, James N. and Margie M. Krebs Professor in IEMS, Co-Director, Center for Engineering and Health. Both their domain knowledge, advice, and mentorship were invaluable in bringing this project to fruition. We also would like to thank Rita Bailey, Shawn Iles for extending their cooperation to us during this project. They were our contacts at the Producemobile and graciously provided us with the relevant data. Finally, we would like to acknowledge the help and support of our research facilitator, Dr. Borchuluun Yadamsuren. She patiently and carefully read our drafts, and was instrumental in producing a high quality final output.

 

 

Navigating the Ranking Forest: How Analytics Helps Understand University Ranking Outcomes

|by Aditya “Adi” Tyagi and Aditya Gudal|

1. Introduction

When you first considered going to graduate school or undergraduate school, you were no doubt quickly made aware of the ranking of the school. Every school is competing to be ranked higher in all ranking systems (QS world rankings, Times Higher Education and U.S. News) by marketing and advertising courses, student recruitment and diversity, faculty engagement. In fact, the above screenshot (from our very own MSiA program website), demonstrates how prominently academic programs advertise their rankings in order to shape student perceptions. Indeed, rankings have become an indelible part of the modern educational landscape.

The list of rankings can become increasingly diverse. Each ranking may prioritize different measures of excellence, and this in turn, may produce different ‘top schools’.  For example, here is a sampling of some of the rankings produced by US News & World Report:

  • Top National Universities
  • Top Global Universities
  • Top A+ Schools for B Students
  • Top Schools with the best “First Year Experiences”
  • Top Schools with the best Co-ops/Internships
  • Top schools with the best Study Abroad
  • Top Schools with the most Fraternity/Sorority Members
  • Top Schools with the most International Students

As you can see, each ranking emphasizes a different aspect of a students’ higher education experience. A major benefit is that students are free to pick a university, based on what aspect they find most valuable. However, the diversity of rankings available also has consequences for universities. For example, each university must decide exactly which institutional aspects to prioritize based on its educational mission. This is a decision problem that is amenable to data analytics.

2. Motivation

Since each ranking has its own methodology, it is often unclear, from an institutional perspective, which aspects one should focus on to maximize impact. For example, some metrics emphasize a university’s relationship with industry; others value an international outlook, and yet others prioritize research output. The result is a broad range of metrics with plenty of tradeoffs that each university must navigate. For example, most universities face problems in improving their international outlook, as they stress more on domestic research, which is generally due to territorial competition within different countries.

At Northwestern, the Office of Institutional Research (OIR) has approached the MSiA program to analyze rankings and various systems and explore discrepancy. Over the Winter Quarter of 2020 we collaborated with OIR to work on one of these discrepancies as a case study problem that Northwestern faces. We examined three prominent rankings, commonly referred to as the “Big Three Rankings”: The Times Higher Ed (THE) , the QS World Rankings, and US News & World Report (USNWR). Some would also include the Academic Ranking of World Universities (ARWU), which we did not analyze in this study.  The main goal of this project was to explore ways to improve ranking outcomes, with a focus on the Times Higher Ed Rankings.

3. Approach Overview

To understand the core of the potential problems and discrepancy with various ranking systems, we categorized the contributing factors into two types: endogenous and exogenous. Exogenous factors are inherent to the ranking system itself, while endogenous factors can be controlled by the institutions, such as Northwestern University. We placed a particular emphasis on Northwestern’s international profile as we analyzed with the assumption that it could be a potential root cause problem in the ranking position of the university. The following two research questions guided this study:

  1. Do elite schools on the Times Higher Ed Ranking benefit from higher international collaboration rates?
  2. What effect (if any) does THE’s European origins have on the ranking outcome?

Having established the root cause of the ranking outcome, we aimed to provide recommendations to improve Northwestern’s ranking, with respect to endogenous factors.

During our nine month long study, we analyzed the datasets from the following sources:

  1. Internal client provided data (THE DataPoints): THE DataPoints is a data product provided by THE to institutions that suggests areas of improvement and other analytics.
  2. Elsevier’s SCOPUS: Elsevier’s SCOPUS is a bibliometric service that has created SCOPUS. SCOPUS is a data product that allows researchers to get a sense of bibliography information regarding their school (co-authorship rates, university collaboration on academic research, etc.).
4. Literature Review

A) General Overview

According to  Bookstein et al. (2010), THE Rankings are unstable. For example, 95% of schools remain in the top 200, but experience drastic position change from year to year. The bottom two third of schools experience twice more changes in their positions than the top one third. Some of the reasons behind this instability are:

  • Power Law Distributions: Key components of the THE, such as peer review, employer review, citations per faculty originate in power law distributions. Power law distributions can be used to describe phenomena, where a small number of items is clustered at the top of a distribution (or at the bottom), taking up 95% of the resources. In other words, it implies a small amount of occurrences is common, while larger occurrences are rare. A classic example is the distribution of income. Mathematically, one may express a power law by Y = kXa , where k and a are constants determining the distribution.
  • Definition Changes: THE Ranking changes its definitions of how variables are measured over time. For example, when THE decided to measure student:staff ratio differently, the University of Copenhagen student to staff ratio went from 51 to 100. This gave the University of Copenhagen a 43 rank gain in a single year! It is unrealistic to think that the University of Copenhagen’s student to staff ratio actually changed so drastically in a single year; the ranking gain is most likely due to THE definition changes.
  • Sample Survey Changes: THE Rankings also include a peer review component. This peer review sample changes from year to year. For example, peer review sample size changed from 6534 in 2008 to 9836 in 2009.

The above-mentioned problems imply that there are certain random and exogenous factors in Northwestern’s ranking which is beyond the university’s control. In addition, the THE rankings are unstable and an institution’s instability depends on its current position in the ranking. For example, the higher up a given institution is on the THE ranking, the lower the instability.

Dobrota et al. (2016) investigated the ways to stabilize the QS World University Ranking system using the composite I-Distance Indicator method. A key implication of this method is that it can be used to generate new weights for the THE ranking system to make the ranking more stable to year to year fluctuations. This method allows researchers to evaluate a certain university’s ranking position independently of the entire system’s instability.

According to Bornmann & F. Aneg’on (2013), evaluating and quantifying research outcomes are crucial for universities. Citation impact, research reputation, and international outlook are particularly important. The authors proposed a key metric to quantify research impact using percentile of papers cited by others as an excellence rate in research output. This metric is applied to a single paper, and is defined as “the percentile rank of a paper when ordered in descending order of citations”. The authors suggested compiling the reference set is to “include all papers published in the same field in the same year”. A natural way to compare institutions’ research impact is to then compare the respective excellence rates. Bornmann & Aneg´on analyzed the following two databases in their study:

  • SCImago (https://www.scimagojr.com/): provides information, such as number of publications, excellence rate, proportion of international collaboration.
  • Leiden Rankings (https://www.leidenranking.com/) : provides key metrics on the number of publications, mean citation score, mean normalized citation score, and excellence rate.

Daraio & Bonaccorsi (2017) provide a novel perspective on indicators based on their variability over the years. In the current system, most indicators are based on Content, Aggregation, Methodology and Implications. The authors concluded that resources available to each university vary based on geography, government funding and initiatives and the granularity varies from territorial, institutional and disciplinary. Based on the European universities case, they introduced a new evaluation approach with an articulated ontology-based data integration model. This novel approach permits the construction of new indicators.

B) Case Study Overview: Case Western

Bernal (2017) determined three sources that were holding Case Western University’s ranking down. These were Institution Name Variations (e.g. Univ. of Northwestern vs. Northwestern University), Faculty Affiliation and Variation (E.g. J. Doe vs. Jon Doe vs. John Doe), and Faculty Impact.

Bernal used a bibliometrics approach to improve the ranking of the university. Bibliometrics is the use of statistical methods to analyze books, articles and other publications. This scientific approach uses both quantitative and qualitative indicators (peer review, funding received, the number of patents, awards granted etc.,) to assess the quality and impact of research. Based on bibliometric analysis, Bernal increased the number of publications from 89,000 to 110,000 in the following year. The author managed to improve the ranking of the university by identifying and fixing the core problem with variations in author name (e.g J. Doe vs. Doe, John) and variations in affiliation name (e.g. Northwestern University vs. NW Univ.; Univ. of Calif., Berkeley vs. Berkeley University; etc.) These two issues were successfully resolved by exhaustively enumerating author name variations for Case Western’s most productive faculty, and ‘claiming/attributing’ research output correctly on SCOPUS. Often, researchers’ work was mis-attributed by SCOPUS based on visiting appointments, etc. Finally, Bernal also integrated ORCID ID’s for each of their faculty members. ORCID ID’s are unique identifiers of a faculty that stays with him/her permanently even when they switch jobs at different institutions. This ID is important to keep publications correctly attributed to the respective institutions.

As a result of her initiative, Bernal managed to improve her university’s outcomes on the following rankings:

  • QS World University Rankings: 213 to 186 (↑ 27 positions)
  • THE World Rankings: 158 to 132 (↑ 26 positions)
  • CWTS Leiden Rankings: 143 to 57 (↑ 86 positions)

The CWTS (Center for Science & Technology Studies (Dutch) gain is particularly important because it is constructed solely on the bibliometric impact of a University. Bernal’s study illustrates the massive impact that correct bibliometric attribution could have on a university’s ranking.

5. Methodology

(A) Data Collection

As a reminder our two research questions are:

  1. Do elite schools on the Times Higher Education (THE) Ranking benefit from higher international collaboration rates?
  2. What effect (if any) does THE’s European origins have on the ranking outcome?

We distilled the above two questions into formal statistical hypotheses (explained further below). To formalize our approach, we developed a four step strategy to ensure consistency and integrity of data collection:

  1. Data Collection Plan
  2. Data Acquisition
  3. Collaboration Classification
  4. Data collection

Each of the above steps is explained in detail in the below sections.

(i) Data collection Plan:

We divided the schools into two roughly equally sized groups:

  • ELITES: Oxford, Caltech, Cambridge, Stanford, MIT, Princeton, Harvard, Yale, University of Chicago, Imperial College, University of Pennsylvania, Johns Hopkins, UC Berkeley
  • PEERS: Northwestern, Brown, Columbia, Cornell, Duke, ETH Zurich, EPFL, McGill, New York University (NYU), Notre Dame, Rice, USC, UCL, Vanderbilt, Washington University in St Louis.

Rationale: Schools near the top of the THE Rankings were assigned to the ELITES group, while schools in the neighborhood of Northwestern (above and below) were assigned to the PEERS group. The PEERs group was also supplemented by schools suggested by the client. There were 16 peer schools and 13 elite schools.

Each researcher was assigned a group and then collected data exclusively from that group.

(ii) Data Acquisition

The key data acquisition step consisted of:

Figure 1. Document distribution of collaborating affiliations of a University in Scopus

Figure 1 illustrates the data acquisition for Northwestern as an example. Each circled number indicates the number of documents that were co-authored by scholar/s of Northwestern University and various other institutions, such as University of Illinois at Chicago. Something important to notice is that the collaborations are by affiliation, so Northwestern University and Northwestern Feinberg School of Medicine can ‘collaborate’ even though Feinberg falls ‘under Northwestern. A helpful way to think of affiliation is as the name of the organization that is written under a researcher’s name on an academic paper.  A sample of data collection is presented in Figure 2.

Figure 2. Data collection of total documents for a University

(iii) Collaboration Classification

The next step consisted of classifying each collaboration as international or domestic. Since there were about 200 collaborations per institution and 10 institutions assigned for each researcher, classifying them solely by hand would have been tedious and inefficient. To speed up this work, we developed a classification algorithm to identify where an institution was from, and classify whether a given collaboration was domestic or international.

The algorithm relied on a database which mapped a university’s name to the country where it was located. The key idea was to lookup entries in the lookup database, and then return the university’s country. However, in this process we faced a major challenge with too many spelling variations/mistakes for a given university in the data downloaded from SCOPUS. Our algorithm is managed to solve that problem. We were able to automatically classify between 40%-60% of entries in the collaboration table we scraped. The remaining data was classified by hand. The end outcome was that dataset creation times were roughly halved. It is inspired by the famous MS Excel function “Fuzzy Lookup”, which looks up an entry in a lookup based on a “rough match” with a provided search key. A brief sketch of how our algorithm operates is provided below.

Fuzzy Lookup Algorithm

  1. For a given target university, search in the lookup table for an exact match. If there is an exact match, then return the country;
  2. If there isn’t an exact match, then compute the Levenshtein distance/similarity metric between the target university and all other universities in the lookup table;
  3. Select the university in the lookup table with the highest similarity to the target university;
  4. If the similarity metric is higher than the threshold, then return that university’s country; else return NA.

(iv) Data Collation

Our final step consisted of data collation. This step essentially involved converting the raw collaboration data to proportions of international collaboration for each school. This meant taking multiple tables, such as those depicted in Figure 3 and producing a table for each group (elites, peers):

Figure 3. Collaboration data sample of Universities

The key column above is Column D, which is the proportion of internationally co-authored documents by a given university. This is the final dataset that we used to test our hypotheses.

B. Data Points Analysis

We began our analysis by analyzing data from Data Points, which is an application programming interface tool provided by Times Higher Education with metric scores, values, university groups, subject distribution and benchmark statistics of different institutes all around the world. Access to the API of this repository was provided by OIR, Northwestern University. Based on this data we analyzed the institute (Northwestern University) and the institute’s peers (Harvard, MIT, Princeton, Upenn). A descriptive analysis was done using Python and Jupyter Notebooks to get a preliminary idea about how Northwestern’s ranks change with regards to its peers and the Times Higher Education metrics.

We looked at data over five years 2016 – 2020. The data included metric values and scores designated by THE to the institutes, subject distribution (including engineering, social sciences, arts, health, life and physical sciences), benchmark statistics in different subject areas. A few other important metrics were doctorate to bachelor awarded, doctorate awarded to academic staff, teaching reputation, research reputation, institutional income to academic staff, students to academic staff, publications per staff, research income to academic staff, research reputation, citation impact, industry income to academic staff, percentage of international staff, international co-authorship, percentage of international students.

We followed up with exploratory research and data analysis and constructed box plots, line and scatter plots and descriptive statistics to understand where Northwestern stands with regards to its peers. Our insights are presented in the findings and insights section.

C. Hypothesis Testing

We began hypothesis testing by condensing our questions into two core sub-research questions:

  1. Representation: How does Elsevier represent Northwestern to THE? Does it share authorship data on an institution basis (i.e. entire University, and all affiliation levels underneath) or an affiliation basis (i.e. affiliation must be explicitly Northwestern University as opposed to say, Feinberg School of Medicine)?
  2. International Collaboration: Why is Northwestern ranked low on the International scores section of the THE? (it has a score of 64.1)

The answer to the first question (representation) was easily found through qualitative analysis (i.e. no statistical hypothesis testing) of the descriptive writing of THE’s ranking methodology. Elsevier shares data with THE at an institutional level. So, regardless of an author’s individual school affiliation (McCormick School of Engineering, Bienen School Music, Feinberg School of Medicine, etc.), all his/her publications are attributed to Northwestern by Elsevier. We reached this conclusion based on our analysis of Elsevier’s SCOPUS description.

To answer the second research question (international collaboration), we run statistical hypothesis testing. This approach required collecting data, and then running various statistical tests to reject the null hypothesis.

We developed and tested the following two hypotheses:

  1. ELITES HYPOTHESIS: Elite schools (i.e. those ranked highly on the THE rankings) have higher rates of international collaboration than Northwestern University’s Peer Schools.
  2. ORIGINS HYPOTHESIS:  THE Rankings are European in origin, and thus are designed to favor European schools. (top three global schools are Oxford, Caltech, Cambridge).

In coming up with the Elites Hypothesis, our rationale was that authors worldwide are incentivized to co-author with researchers at Elite schools; this results in Elite schools maintaining their higher international rankings in a cyclical fashion. A potential implication of this “preference for elite collaboration” is that when domestic US universities engage in collaboration with the ‘elites’, that does not count as ‘international collaboration’. However, when non-US institutions engage in collaboration with the ‘elites’, that counts as ‘international collaboration’. It also is disadvantageous that most “elite” universities are in the US. Thus, the nature of the modern academic landscape disadvantages US Schools.

The Origins Hypothesis was based on our assumptions that rankings tend to favor the country that produced them. For example, USNWR is American, and lists the top three global schools as Harvard, MIT, Stanford (all three American). Similarly, THE is a British ranking, and lists Oxford University, CalTech, and Cambridge University as its top three universities.  However, there are exceptions to this dynamic: QS Rankings (a British ranking), lists the top three world universities as MIT, Stanford, Harvard (all three American). Meanwhile, Shanghai ARWU (a Chinese ranking) lists the top three Universities as Harvard, Stanford, Cambridge (two American, one British). If these assumptions are true, both of our hypotheses imply that NU’s poor ranking outcome on the THE Ranking is simply because it is an American school.

D. Hypothesis Testing Process

Statistical hypothesis testing is used to get a sense of whether research-based observation is due to random chance vs. a true phenomenon in nature. To test our elites hypothesis, we used a z-test for proportions. This test was used because we are testing whether there is a difference in proportions between two groups (elites and NU peers). The exact proportion we are testing is the ‘proportion of documents co-authored internationally. The two tests we conducted are formalized below:

  1. Test 1 (Z-test of proportions):
    Null Hypothesis: There is no difference between international collaboration rates between elite and peer schools.
    Alternative Hypothesis: The international collaboration rate of elite schools is (different from/greater than) NU peer schools.
  2. Test 2 (Z-test of proportions):
    Null Hypothesis: There is no difference between European vs. US schools international collaboration rates
    Alternative. Hypothesis: The international collaboration rate of European schools is higher than US schools.

Rationale

Our rationale behind using a two-sample unpaired z-tests for proportions:

  • Since we are comparing two groups of universities, it makes sense to use a two-sample z-test.
  • Moreover, there are not the same number of universities in both groups and this is not a before/after situation, hence the test needs to be unpaired.
  • Each paper published in the ELITES/PEERs group is considered as a random Bernoulli trial with success defined as ‘producing an internationally co-authored paper’
  • If we assume a Binomial distribution for the proportion of internationally co-authored papers, then the standard deviation of the distribution is known. This precludes the use of a t-test.

Assumptions:

The assumptions for a two-sample unpaired z-test are supported:

  • The expected number of successes np > 10 for each of the groups, where n is the number of trials, and p is the probability of success.
  • “Publishing a paper” can be reasonably viewed as a Bernoulli random variable. As a reminder, a Bernoulli random variable represents a binary outcome. In our case, this would be the “publish” “no publish” decision.
  • Each Bernoulli trial (to publish a paper) can be reasonably considered independent of the others.
6. Findings & Insights

A) Data Points Analysis

Figure 4. Correlation plot of THE metrics in the Year 2020

We constructed the above heatmap to identify linear relationships amongst the THE metrics. As you can see from Figure 4, there were several metrics that are highly correlated. For example, we observed almost perfect correlation (0.99) between teaching and research reputation.

In addition to this analysis, we found out that the Times Higher Education gives out a long survey to score teaching and research reputation. The survey instrument seems to present numerous flaws and confusing questions. From our conversations with Prof. Grayson, we discovered that many respondents fill up the same institutes for both the teaching and research reputation survey questions. This is one possible explanation for the high correlations observed above.

Figure 5. Line plot showing comparison between Northwestern University and UCL with the peer universities

The line plot (Figure 5.) demonstrates that Northwestern University (Blue line) is below the median for the past five years in terms of international co-authorship. Foreign universities, such as University College of London (Purple Line) dominate here. Northwestern University dominates in terms of industry and research income to academic staff, but they need to be more inclusive in getting more international students and staff to stand out in the Times Higher Education ranking which is relative.

Figure 6. Clustered Groups of Peer Universities with top metrics values highlighted

Figure 6 demonstrates which metrics are highest for each of the various groups, namely Technology Challengers, Old Stars and International Powerhouses. These groups were determined by SCOPUS after a clustering analysis on the rankings, and other similarities between institutions. Northwestern University is part of the International powerhouse group.

International powerhouse universities dominate in publications per staff, research income and industry income.

Technology challengers dominate in international staff, students and co-authorship.

All groups have improved significantly in Doctorate to Bachelor awarded over the years and moderately in Institutional income.

Figure 7. Box-plots showing the spread of various THE metric rankings

As it is presented In Figure 7, variability in Doctorate to Bachelor awarded scores keeps changing over the years (2016-2020). Citation impact is one metric with small variability or change.

B) Hypothesis Testing

The elites hypothesis stated that elite schools find it easier to attract international collaboration than NU peer schools. In other words, elite schools have higher international collaboration rates than NU peer schools. This results in a “rich-stay-rich” dynamic, where elite schools attract more international collaborators, and thus maintain their elite status, which in turn attracts even more international collaborators.

The Elites Hypothesis was supported (Figure 8).

Null hypothesis: Elite schools and NU Peer Schools have the same rate of international collaboration.
Alternative hypothesis: Elite schools have a higher rate of international collaboration than NU Peer schools.

Figure 8. Distribution of peers of proportional international collaboration distribution
Figure 9. Distribution of elites of proportional international collaboration distribution

From the two distributions presented in Figure 8 and Figure 9, it seems that elites have more mass towards the right end of the horizontal axis (i.e. they have higher rates of international collaboration).

Sample Values: p_intl _coll_elites = 0.453 >  p_intl_coll_peers = 0.415

Testing Results: P-val: 1e-16

Conclusion: At a significance level << 0.001, we conclude that the alternative is true: elite schools DO have a higher rate of international collaboration than NU Peer schools.

Origins Hypothesis

The origins hypothesis stated that the THE rankings, with its origins in the UK, is designed to favor European schools. In particular, it emphasizes international collaboration, which presumably is easier for European schools to demonstrate. Thus, the hypothesis we test is whether European schools have higher international collaboration rates than North American schools.

Null Hypothesis: European schools have the same international collaboration rate as do North American schools.
Alternative Hypothesis: European schools have a higher rate of international collaboration than American schools.

Sample Values: P_intl_coll_euro = 0.6295    >     p_intl_coll_na = 0.4008

Testing Results: P_val = 1e-16

At a significance level << 0.001, we conclude that the alternative is true: European schools DO have a higher rate of international collaboration than North American schools.

7. Summative discussion

A) Data Points Analysis

Given Northwestern’s comparatively poor performance (below median) in terms of international co-authorship for the past five years, it is clear that international co-authorship is an area where improvement needs to happen. Based on our research findings, we suggest:

  • Encouraging faculty to attend foreign conferences, and seek out the opportunities for international collaborations.
  • Encouraging Northwestern Qatar to collaborate locally in the Middle East. This strategy might improve Northwestern’s international collaboration rate.
  • Advertising Northwestern’s programs internationally as well as recruiting foreign staff (perhaps as Visiting Faculty, visiting scholars) to stand out in the Times Higher Education ranking. This approach might require additional funding. However, since Northwestern outperforms its peers in terms of industry and research income to academic staff, these could be the areas where Northwestern could look for potential funding sources.
  • Northwestern’s membership in the “international powerhouse” group might be misleading. This is because International powerhouse universities typically dominate in publications per staff, research income and industry income. However, a grouping that may be of interest to Northwestern would be “Technology challengers”. These schools interestingly perform exceedingly well in international staff, students and co-authorship. An analysis of the qualitative characteristics of schools in the Technology Challengers category might yield some useful insights. For example, it may be worthwhile to examine administrative structures, international student recruitment efforts, academic strengths/weaknesses, etc.

B) Hypothesis Testing

Our results from the Elites Hypothesis part indicate that elite schools have higher rates of international collaboration. However, a key limitation to acknowledge is that our analysis does not profess any causal relationship. A famous historical saying in the statistics community is that “Correlation does not equal causation.” This is applicable here. However, we would like to conjecture two reasonable causal relationships:

  • It is possible that elite schools (due to their higher status) are able to attract higher collaboration opportunities from international partners.
  • These schools are able to maintain their positions at the top because they are able to attract international collaboration.

Thus, it is evident that a “rich get richer/stay rich” dynamic is at work: elite schools maintain their positions at the top by collaborating internationally, and due to their resulting high positions, they are able to attract even more international collaboration. This creates a self-reinforcing dynamic.

Meanwhile, considering the results from the Origins Hypothesis, we can conjecture the following causes. Given the geography of the European Continent (equally well-renowned schools spread out across many small countries) vis a vis North America (well-renowned schools clustered mainly in the US, and to a lesser extent Canada and Mexico), it makes sense that European schools would have higher international collaboration rates. In other words, geography might be the hidden factor determining the ranking discrepancy.

It is interesting to note that this trend was visible in the US-based US News & World Rankings. In its Global 2020 rankings, the top five schools are Harvard, MIT, Stanford, Berkeley, and then Oxford. There are only two non-US schools amongst the Global top 10 according to USNWR Global 2020. As would be expected, international collaboration is factored into the USNWR global rankings, but only relative to country. An offsetting factor is introduced namely, global research reputation, which presumably US-schools dominate. Perhaps, the greatest insight to take from the above hypothesis testing results is how they affect Northwestern.

A valuable visualization to look at is where Northwestern stands relative to its peer group is presented in Figure 10.

Figure 10. Distribution of international collaboration proportion with Northwestern (purple horizontal line)

It demonstrates that Northwestern lies close to the bottom of its peer group when it comes to international collaboration. This is important because in combination with the other findings, it shows that Northwestern’s low position on the THE rankings is a function of intrinsic and systemic factors. Some of these factors Northwestern can’t control, namely the systemic ones of:

  • It is not at Elite school.
  • That the THE rankings are European

Yet others it can control, such as: here it stands with respect to its peer group, which it is a fair comparison given the many similar characteristics NU shares with its peers.

8. ConclusionRecommended Actions & Next Steps

In this paper, we demonstrated that Elite schools do have a higher rate of international collaboration than Northwestern University’s peer schools. This result could be hiding a “rich get richer” dynamic. Moreover, we demonstrated that European schools have higher international collaboration rates than North American schools. This results in European schools being ranked favorably by THE (which is a European ranking). We further conjectured that this is because of the geographical setup of the Continent. Despite the above extrinsic factors contributing to North American schools low international outlook position, there are also a few intrinsic ones: many top North American schools are in the bottom 25% of its peer group when it comes to International collaboration. This conclusion is separately confirmed by the DataPoints Analysis.

Finally, correct author and institution attribution on Elsevier’s SCOPUS is critical to any effort to improve ranking in general, especially in ranking systems that focus more on research output. As demonstrated by Bernal (2017), such efforts can yield tremendous gains. However, any effort to correctly attribute bibliometric information must be owned and directed by the University Library.

Another promising direction is to use the Composite I-Distance Indicator (Dobrota et al., 2016) to determine stabilizing weights for the THE Rankings. Having generated a ‘stabilized ranking’, subsequent analyses would provide greater insight into Northwestern’s position independent of ranking instability. This strategy would give a clearer picture of what actions Northwestern and most of the top North American universities could take to improve its ‘intrinsic position’ on the Times Higher Education.

9. References:

Bernal, L.(2017). Library Impact with International Rankings – One Library’s Continuous Journey to figure it out. Retrieved from: https://www.libraryassessment.org/wp-content/uploads/2019/09/17-Bernal-LibraryImpact.pdf

Bookstein, F., Seidler, H., Fieder, M., & Winckler, G. (2010). Too much noise in the Times Higher Education rankings. Scientometrics, 85(1), 295-299.

Bornmann, L., & de Moya Anegón, F. (2014). What proportion of excellent papers makes an institution one of the best worldwide? Specifying thresholds for the interpretation of the results of the SCI mago Institutions Ranking and the Leiden Ranking. Journal of the Association for Information Science and Technology, 65(4), 732-736.

Dobrota, M., Bulajic, M., Bornmann, L., & Jeremic, V. (2016). A new approach to the QS university ranking using the composite I‐distance indicator: Uncertainty and sensitivity analyses. Journal of the Association for Information Science and Technology, 67(1), 200-211.

Daraio, C., & Bonaccorsi, A. (2017). Beyond university rankings? Generating new indicators on universities by linking data in open platforms. Journal of the Association for Information Science and Technology, 68(2), 508-529.

Acknowledgments:

We would like to thank the Office of Institutional Research, especially Amit Prachand and Jason Ripple for leading this project, Professor Matthew Grayson for making us think out of the box and continuous motivation, support and encouragement. We would also like to thank our research advisor, Dr. Borchuluun Yadamsuren for propelling us to research, understand the problem better and achieve phenomenal results. Finally, we would like to thank Professor Diego Klabjan and the MSiA program staff for giving this opportunity to collaborate with OIR, which was a great experience and learning.

What Skills Do Data Scientists Need? — A Text Analysis of Job Postings

|By Yanmeng Song and Lang Qin|

Introduction
Either in the past or at present, when you try to find your way into the data science world, you might have this question in mind: what skills should I equip myself with and put on my resume to increase the chance of getting an interview and being hired. This question might keep popping up as new skills spring up quickly. After spending long hours searching for a job online, you close your laptop with a sigh. There are tons of information about how people define “data science” differently and it appears to be an ongoing discussion.

Based on LinkedIn’s third annual U.S. Emerging Jobs Report, the data scientist role is ranked third among the top-15 emerging jobs in the U.S. As the data science job market is exploding, a clear and in-depth understanding of what skills data scientists need becomes more important in landing such a position. This blog attempts to provide insights into this question in a “data science” way.

Problem Statement
According to the Business-Higher Education Forum (BHEF) report, by 2021, nearly 70% of executives in the United States will prefer job candidates with data skills. As stated in the Dice 2020 Tech Job Report, the demand for data science jobs will increase by 38% over the next 10 years. Catering to this growing need for data scientists in the job market, the past few years have seen a rapid increase in new degrees in data science offered by many top-notch universities. The demand for data scientists is booming and will only continue to increase in the future.

Figure 1: Data Scientist Demand Trend (Source: Google Image)

Though the data science job has become one of the most sought-after ones, there exists no standardized definition of this role and most people have an inadequate understanding of the knowledge and skills required by this subject. There are multiple other roles, such as data analysts, business analysts, data engineers, machine learning engineers, etc., usually thought of as similar, but could differ a lot in their functionalities. Radovilsky et al. (2018) pointed out that “data science emphasizes computer systems, algorithms, and computer programming skills, whereas business data analytics has a substantial focus on statistical and quantitative analysis of data, and decision-making support.” Furthermore, the required knowledge and skills are not static, but dynamic, as new and emerging skills spring up in this era of rapid technological development. Not to mention the required skill sets may vary among different business organizations for the exact same job title.

These situations pose great challenges for data science job seekers. What skills on earth do data scientists need to acquire? Just as resume parsers or resume extraction software applications are widely used by hiring organizations, the most direct approach to find the answer would be identifying skills from data science job postings, where qualifications and requirements for the role are usually clearly stated. However, there is usually a great deal of information contained in a single job posting. Manually analyzing them one by one would be very time-consuming and inefficient. This is exactly where natural language processing (NLP) can come into play and leads to the birth of this project.

Goal
In this project, we aim to investigate knowledge domains and skills that are most required for data scientists. In other words, we want to identify the most frequently used keywords for skills in corresponding job descriptions. We also aim to develop a pipeline that starts with raw job postings and produces analysis and visualization. In this way, new data could be fed in and it is possible to explore the dynamics of top required skills.

Bridging the gap between job postings and user profiles would tremendously benefit job seekers in the data science field. On the one hand, they would understand the job market better and know how to market themselves for better matching. On the other hand, it provides opportunities for them to learn or advance skills that they are not proficient in yet but are in high demand by hiring organizations. They could then decide to pursue a higher-level degree, or take an online course, or learn by practice, etc, to be well prepared for landing good jobs.

Methodology
We performed text analysis on associated job postings using four different methods: rule-based matching, word2vec, contextualized topic modeling, and named entity recognition (NER) with BERT. In the first method, the top skills for “data scientist” and “data analyst” were compared. A complete pipeline was built to create word clouds with top skills from job postings. The other three methods focused on “data scientist” and enabled us to experiment with the state-of-the-art models in NLP.

Data Collection
There is no such available dataset of data science job postings, so we collected them through web scraping from three popular job search engines — Indeed, Glassdoor, and LinkedIn. The associated job postings were searched by entering “data scientist” and “data analyst” keywords as job titles and “United States” as the location in the search bar.

Figure 2: A Sample Job Posting from Indeed

For each job posting, five attributes were collected: job title, location, company, salary, and job description. The job description is the desired information while the remaining four attributes were excluded from the analysis for this project.

We faced several challenges in the process of web scraping. Firstly, website scripts and structures are updated frequently, which implies that the scraping code has to be constantly updated and maintained. The three job search engines we selected have different structures, so scripts need to be adjusted accordingly. Secondly, take Indeed as an instance (Figure 2), only the first bullet point or part of the detailed job description is displayed. The scraping scripts should include the click function to obtain the full context. Similarly, the automatic scraping process could be interrupted by a pop-up window asking for a job alert sign up, so the closing window function is also needed. Furthermore, based on our experiment, Glassdoor detects the web scraper as a bot after a few hundred requests, either time delay should be embedded between requests or wait for a while before it resumes.

With a single search, three job search engines restricted us to scrape only 1,000 job postings from each. After removing those without job descriptions and duplicates within a single dataset or across three datasets, we obtained 2,147 entries for “data scientist” and 2,078 entries for “data analyst”. Data cleaning was applied to those job descriptions, including lower case conversion, special characters, and extra white space removal, etc.

 Analysis
This section gives a detailed description of the four methods. Both collected datasets were used in the rule-based matching method for the purpose of comparison. Only the dataset of “data scientist” was used in the other three methods to explore and identify the associated skills. In the future, the analysis can be replicated easily on “data analyst” by changing the input dataset to the pipeline.

I. Rule-Based Matching
Based on our job search experiences with “data scientist” and “data analyst”, we defined a dictionary containing commonly seen required skills into ten categories: statistics, machine learning, deep learning, R, Python, NLP, data engineering, business, software, and other. Some examples under the machine learning category are regression, predictive modeling, clustering, time series, PCA etc.

Using this predefined dictionary, we counted and ranked the occurrences of each skill in the two datasets. Then the corresponding word clouds were generated, with greater prominence given to skills that appear more frequently in the job description.

A complete pipeline was developed starting from web scraping to word cloud. As job postings are updated frequently, even within a minute, in the future, new data could be scraped and top skills could be identified from the word cloud through our pipeline. Additionally, the trend of top required skills could be captured by comparing data scrapped at different time points, in which we might see some particular skills gain more popularity in the industry as time goes by.

Note that the predefined dictionary is editable and expandable, to account for the rapidly changing data science field. Another feature of this method lies in its flexibility. We focused on the data science job market in this project, but it can actually be extended to other job positions/fields and tailored to specific locations you want. To achieve this, a new dictionary and new website URLs (for new job title and location) are needed.

II. Word2Vec
The Word2Vec algorithm (Mikolov et al., 2013) uses a neural network model to learn word vector representations that are good at predicting nearby words. It is considered one of the biggest breakthroughs in the field of NLP. As the name suggests, Word2Vec takes a corpus of text as input and produces a vector space, typically of several hundred dimensions, as output. Each unique word in the corpus is assigned to a vector in the space. Word vectors are positioned so that words that share common contexts in the corpus are located close to one another in the space (Innocent, 2019). Thus, word2vec could be evaluated by similarity measures, such as cosine similarity, indicating the level of semantic similarity between words.

There are two models for Word2Vec implementation — Continuous Bag-Of-Words (CBOW) and continuous Skip-gram (SG). The CBOW is learning to predict the word given the context, while the SG is designed to predict the context given the word.

Figure 3: CBOW and Skip-Gram (Source: Google Image)

In our case, Word2Vec could be leveraged to extract related skills for any set of provided keywords. Word-level tokenization and normalization (removing non-alphabetic chars and stop words etc.) were applied as the preprocessing step. We experimented with both models and conducted hyperparameter tuning, including the embedding size and the window size.

III. Contextualized topic modeling
Topic modeling is an unsupervised machine learning technique that is often used to extract words and phrases that are most representative of a set of documents. Different from traditional topic modeling techniques, such as Latent Dirichlet Allocation (Blei et al., 2003), contextualized topic modeling (Bianchi et al., 2020) uses a pre-trained representation of language together with a neural network structure, capable of generating more meaningful and coherent topics.

BERT (Bidirectional Encoder Representations from Transformers) was introduced in 2018 (Devlin et al., 2018). It is the latest language representation model and considered one of the most path-breaking developments in the field of NLP. It advances the state of the art for eleven NLP tasks. We used BERT as the pre-trained representation of language in this method.

The job descriptions are broken down into sentences and each sentence serves as a training sample. The model diagram is shown in Figure 4 below. Raw sentences went through a BERT embedding and were combined with the Bag-of-Words representation. The concatenated result went through a neural network framework, which approximates the Dirichlet prior to using the Gaussian distributions. The hidden layers were tuned to generate the topics. We chose the number of topics to be 20 with the assumption that job descriptions probably do not contain too many “topics.”

Figure 4: Combined Topic Model Diagram

IV. Named entity recognition with BERT
Named entity recognition (NER) is an information extraction technique that identifies named entities in text documents and classifies them into predefined categories, such as person names, organizations, locations, and more. BERT, briefly mentioned in the previous method, involves two steps in its framework: pre-training and fine-tuning. The pre-trained BERT model can be fine-tuned with just one additional output layer to create cutting-edge models for a wide variety of NLP tasks.

Figure 5: Overall Pre-Training and Fine-Tuning Procedures for BERT

Here we fine-tuned BERT for named entity recognition (Sterbak, 2018) to help identify the keywords for skills out of job descriptions. The skills correspond to entities that we want to recognize. If we highlight all the skills from the predefined dictionary in the sentence and feed them into the pre-trained BERT model, a more comprehensive set of skills could be obtained by analyzing the sentence structure. The input of the model is those sentences containing at least one skill from our dictionary. In other words, some sentences from the job description are not related to skills at all, such as company introduction and application instruction, and are thus excluded from the analysis. Each input sentence was first tokenized by the pre-trained tokenizer from BERT implementation. We limited the sequence length to be 50 tokens. Sequences less than 50 tokens were padded and sequences greater than 50 tokens were removed. There were only very few cases of the later one. In terms of the label, the tokens that match our dictionary were given labels of “1” (skill) and otherwise “0” (non-skill), but the tokens for padding purpose were labeled as “2” in order to differentiate from the rest. Similar to the masking in Keras, attention_mask is supported by the BERT model to enable neglect of the padded elements in the sequence. The output of the model is a sequence of three integer numbers (0 or 1 or 2) indicating the token belongs to a skill, a non-skill, or a padding token. An example from input to output is demonstrated in Figure 6.

Figure 6: Example of BERT Input and Output

Results
I. Rule-Based Matching

The output of the pipeline is two-word clouds as well as two full ranked lists of top skills with occurrence and percentage (i.e., count / total number of job postings) as shown in Figures 7, 8, and 9.

Figure 7: Data Scientist Top Skills
Figure 8: Data Analyst Top Skills

Figure 9 below illustrates the top ten identified skills, where the left one corresponds to data scientist and the right one corresponds to data analyst.

Figure 9: Left – Top 10 Data Scientist Skills; Right – Top 10 Data Analyst Skills

As we can see, Python, machine learning, and SQL are the top three for data scientists while SQL, communication, and Excel are the top three for data analysts. Among the two top ten lists, there are seven overlapping skills – Python, SQL, statistics, communication, research, project, visualization. We are surprised that R is not even in the top ten list for data analysts.

The above results are based on two datasets scraped in April 2020. We ran the whole pipeline again in September 2020, to test the functionality of the pipeline and investigate any potential changes of top skills. The results turn out to be very similar given the relatively short time interval.

II. Word2Vec
The results from the CBOW model and the SG model are similar. Different model parameters affect the result a bit but not that much. Overall the word embedding performed well in detecting other closely related skills. We picked “python” and “neural” as the candidate words and evaluated their closest neighbors in terms of cosine similarity. As we can see, the top 10 closest neighbors of “python” captured other programming languages, libraries, software applications, and frameworks. The top 10 closest neighbors of “neural” captured machine learning methods and probability related stuff in statistics.

Figure 10: Top 10 Closest Neighbors

Starting from the whole list of skills from our dictionary, a more comprehensive list of related skills could be identified, potentially including new skills not defined in the dictionary.

III. Contextualized topic modeling
Following the original paper of the combined topic model (Bianchi et al., 2020), the results were evaluated by the rank-biased overlap (RBO), which measures how diverse the topics generated by the model are. This measure allows disjointness between the topic lists and it is weighted by the word rankings in the topic lists. High value of RBO indicates that two ranked lists are very similar, whereas low value reveals they are dissimilar. We computed the rank-biased overlap diversity, which is interpreted as reciprocal of the standard RBO, on the top 10 keywords of the ranked lists. Correspondingly, high metric indicates the topic lists are dissimilar while low metric indicates the reverse. The result turned out to be 0.9937, demonstrating good topic diversity.

To identify the group that is more closely related to the skill sets, the bar chart was plotted showing the percentage of overlapped words out of the top 400 words in each topic with our predefined dictionary.

Figure 11: Percentage of Overlapped Words with Dictionary in Each Topic

Topic 13 has a significantly higher overlap percentage than the other topics. It is most likely to be the topic describing the skill sets, and this is validated by reviewing the top words in that topic (see Figure 12 for details). For comparison, topic 20, with a much lower overlap percentage, has its top 50 words listed. We can safely conclude that it is describing the benefits, as words like “insurance”, “vision”, “dental”, “coverage”, and “holiday” suggest.

Figure 12: Top 50 Words in Topic 13 and Topic 20

Topic 13 was selected for further analysis, and it would be called the skill topic for easy reference. We made a comparison between the words in the skill topic and those in the predefined dictionary. The objective is two-fold: (i) it provides a qualitative evaluation of the combined topic model, especially for the skill topic; (ii) it provides an insight into the potential of the skill topic in identifying new skills not defined in the dictionary. Word clouds in Figure 14 present the results in a visual way, and the annotations are explained through the Venn diagram in Figure 13.

Figure 13: Venn Diagram for Dictionary and Skill Topic

Uncaptured words are those defined in the dictionary but not captured by the skill topic. Examples like “communication”, “management”, “network” are more general skills and might be captured in another topic of the model. They could appear in another part of the job description and thus not be representative of the sentence describing specific skills. However, examples like “statistics”, “gbm”, “ai” might indicate the flaw of the model since they are expected to be captured skills. Overlapped words are those that appear in both the dictionary and the skill topic. Undefined words are those not defined in the dictionary but appear in the skill topic. This category is interesting and deserves attention. Some words are descriptions for the level of expertise, such as “familiarity”, “experience”, “understanding”. Besides, words like “postgre”, “server”, “programming”, “oracle” inform that the dictionary is not robust enough. More importantly, this category is able to identify new and emerging skills we are not aware of yet, rather than being limited to a set of known skills. This is unachievable by the rule-based matching method without the input of domain knowledge from experts, but of great importance to allow for rapid change in the data science field.

Figure 14: Comparison between Dictionary and Skill Topic

A further quantitative evaluation was conducted on the discrepancy between the dictionary and the skill topic. Take the predefined dictionary as ground truth, we define precision as “percentage of dictionary words in the top K words of the skill topic”, recall as “in the top K words of the skill topic, the proportion of overlapped words with dictionary to the total number of words in dictionary”. For instance, among the top 50 words in the skill topic, 21 of them (42%) appear in the dictionary, so the precision is 0.42; these 21 words account for 9.5% of all words in the dictionary, so the recall is 0.095. By varying K from 50 to 400, a precision-recall curve is plotted. The steeper slope at the beginning indicates the proportion of overlapped words decreases as K increases. That is to say, the overlapping concentrates in the top words of the skill topic. The slope flattens after 150 words, so 150 is a proper K to capture enough skills while ignoring irrelevant words.

Figure 15: Precision-Recall Curve

IV. Named entity recognition with BERT
We randomly split the dataset into the training and validation set with a ratio of 9:1. The training data was fed into the BERT model for 3 epochs of fine-tuning. Cross entropy was used as the loss function and AdamW was used as the optimizer. The training process took around 7 hours using our own computer. Training loss and validation loss are plotted below. After 3 epochs, the training loss is 0.0023 and the validation loss is 0.0073.

Figure 16: Learning Curve

Performance metrics for the validation set are summarized in Table 1. All four metrics have high values. However, such a high value of predictive accuracy actually means a high degree of coincidence with the rule-based matching method. We wish the model to have a high recall but relatively low precision. In this way, it is extensible and probably helps us identify new skills not included in the dictionary, namely the false-positive part.

Table1: Performance Metrics

Even with high precision, this method still finds some extra keywords, as shown in the figure below, such as “randomized grid search”, “factorization”, “statistical testing”, “Bayesian modeling” etc.

Figure 17: New Skills Identified by NER with BERT

Summary
We applied four different approaches of skills extraction from data science job postings. Data science job seekers could use identified knowledge domains and skills from these four approaches as a guide in their job search, not only to understand the job market and better market themselves but also to improve and/or learn new skills if necessary.

From the methodological point of view, in the first method, in addition to identifying top required skills, a complete pipeline was built to address the variability property of skills and enable to explore the trend of top required skills in the data science field. The other three methods are more like applications of traditional as well as superlative models in NLP.

  • The rule-based matching method requires the construction of a dictionary in advance. The identified top skills would be limited to the predefined set of skills. This limitation could be alleviated thanks to our pipeline. As long as the dictionary is updated, new word clouds would be generated quickly, though updating requires knowledge from domain experts and it is prone to subjectiveness. The good thing is that no training is needed and new data could be easily fed in by changing the website URL in web scraping script. It is also possible to learn the trend of top required skills in the data science field.
  • The word2vec method is able to find new skills. Using the dictionary as a base, a much larger list of skills could be identified. The key in this method is word embedding.
  • The contextualized topic modeling method is an unsupervised machine learning technique and is able to find new skills too. Our current evaluation is dependent on the dictionary. It is inevitable that some skills are distributed in other topics besides the most concentrated one, so the choice of topic numbers is important.
  • The named entity recognition with BERT method belongs to the supervised machine learning category and is able to identify new skills. The current labeling is imperfect due to its complete dependence on the dictionary. If equipped with better labeling, this method should be more powerful. Another unique advantage of this method is that it can capture phrases with two or more word grams.
Table 2: Comparison Between Four Approaches

Limitations and Future Work
Due to the limitations on the maximum number of job postings scraped with a single search, our data size is very small. Future work should consider combining data from more job boards or from several periodic scraping. A larger data size would be beneficial to all four methods and improve the results.

More text preprocessing and cleanup work could be done in the future to reduce noise. Stemming and word bigram might also be helpful.

The dictionary is defined by ourselves and definitely not robust enough. The ability to identify new skills of other methods would be augmented using a more comprehensive dictionary.

In the NER with BERT method, it might be worth trying an iterative approach. That is to say, the first iteration does labeling by matching against the dictionary, then the identified new skills together with the dictionary function as new labeling for the next iteration. This is sort of semi-supervised and the skill set would gradually enlarge. Another alternative would be manual labeling, though very laborious. Note that BERT takes a while to train, so future work should consider the training on GPU.

We experimented with the long short-term memory (LSTM) architecture but it did not produce good results because of the small data size and skill versus non-skill imbalance. Deep learning methods are worth trying if these issues could be addressed.

References
Bianchi, F., Terragni, S., & Hovy, D. (2020). Contextualized Topic Models, GitHub repository, https://github.com/MilaNLProc/contextualized-topic-models

Bianchi, F., Terragni, S., & Hovy, D. (2020). Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. arXiv preprint arXiv:2004.03974.

BHEF (2017, April). Investing in America’s data science and analytics talent. Retrieved from https://www.bhef.com/sites/default/files/bhef_2017_investing_in_dsa.pdf

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dice (2020). The Dice 2020 Tech Job Report. Retrieved from https://techhub.dice.com/Dice-2020-Tech-Job-Report.html

Innocent, A. (2019, September 29). Word Embeddings: Beginner’s In-depth Introduction. Retrieved from https://medium.com/@melchhepta/word-embeddings-beginners-in-depth-introduction-d8aedd84ed35

LinkedIn (2020). 2020 Emerging Jobs Report. Retrieved from https://business.linkedin.com/content/dam/me/business/en-us/talent-solutions/emerging-jobs-report/Emerging_Jobs_Report_U.S._FINAL.pdf

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

Radovilsky, Z., Hegde, V., Acharya, A., & Uma, U. (2018). Skills requirements of business data analytics and data science jobs: A comparative analysis. Journal of Supply Chain and Operations Management, 16(1), 82.

Sterbak, T. (2018, December 10). Named entity recognition with Bert. Retrieved from https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/

Relevant code is available here: https://github.com/yanmsong/Skills-Extraction-from-Data-Science-Job-Postings

Acknowledgments
We would like to express our very great appreciation to Dr. Borchuluun Yadamsuren for research guidance, feedback, and copyediting. Special thanks also to Dr. Emilia Apostolova for professional guidance and constructive suggestions.

Running a Marathon in 45 Seconds Flat

| By Chris Rozolis |

I should start by saying I have absolutely no long-distance training experience and have never completed a marathon, half-marathon, or even a 5K in my life — that being said, I have experience in running marathons…or at least simulating the running of them using Python. Over the past 8 months I have had the pleasure of working with Dr. Karen Smilowitz and with a team of computer science undergraduates to develop and deploy a data visualization system that aids marathon organizers real-time on event day.

 

Background
Large marathons such as the Bank of America Chicago Marathon have followed an operations plan known as the “Chicago Model” since 2009. This operations plan treats an entire race as a mass casualty event, as the number of medical treatments at course aid stations and medical tents historically reach one to two thousand. This model is used because marathons are resource intensive, but predictable as so because there are expected runner injuries/aid requests. This Chicago Model unifies operations staff in one location on race day with the race organizers (Chicago Event Management), Chicago Police Department, Chicago Fire Department, Department of Homeland Security, American Red Cross, and many other organizations coexisting in one physical place. All represented groups in this Unified Command center have one common question: how is the race progressing?

Runner tracking data necessary information in order to answer one simple question for every mile segment of the Chicago Marathon: how many runners are in this segment right now? While the answer to this question can help many different stakeholders within the Unified Command center, the answer is also needed in the event of an emergency. Should anything occur on race day that would trigger emergency responses, the “Chicago Model” may dictate runners shelter in place or move to shelter locations. This requires accurate estimates of the number of runners present at each mile segment at any given minute for the correct resources to be dispatched proportionally in the event of an emergency.

One might think the Data Visualization System (DVS) could simply display real-time numbers of actual runners and map them to their correct mile segments. Well…this is where things get tricky.

Figure 1: Snapshot of the DVS deployed for the Houston Marathon/Half Marathon Event

 

The Problem
Runner tracking feeds can be inconsistent. The marathon tracking mats that determine runner results and track how fast runner splits are between each 5K markers are fantastic pieces of technology, however, they are best used for post-race results. During the race the mat-tracked numbers are manually adjusted for various reasons. Additionally, these mat-tracked runner numbers are split by 5K distances instead of by mile marker. Information at the mile-level is a necessity for contingency planning. Using these feeds real-time does not enable the granular level of information needed for emergency planning and preparedness.

 

The Solution
Due to real-time data issues, the original DVS team decided to build a marathon simulation. The idea is simple: create a bunch of Python objects to represent a group of runners and have them “run” a marathon. Over the past five years, the simulation has steadily improved. Initially the simulated runners had distances logged every 10 minutes of their race, meaning the simulation could give runner density counts in ten minute intervals within the Unified Command center. Later iterations updated runner speed estimates from interpolation to piecewise regression models. Estimates narrowed down to the 2 minute level, but still left room for improvement to predict better numbers for contingency planning.

Today, I can simulate every single minute of the Chicago Marathon at every mile marker, and can map 10,000 simulated runners to 46,000 real marathon runners. This allows the DVS team to achieve the desired level of granularity needed for accurate runner locations and emergency preparedness. This recent iteration is the result of a major overhaul I implemented that now accounts for a large portion of the predictive and prescriptive power the DVS system has to assist marathon organizers.

 

Predicting a Marathon Runner’s Speed
One of my goals when redesigning the simulation was to have a grounded analytical approach to predicting runner speeds during the race. From historic runner results, I was able to build a multiple linear regression model that accurately predicted a runner’s speed given their starting assignments, their location on course, and the course weather conditions. Marathon runners are assigned to “corrals” for when they start a race. These corrals are essentially groupings of runners by expected pace. For example, the Elite corral pace is about a marathon under 2 hours and 45ish minutes. However, these corral paces are just expected means, there is a lot of variability in runner speeds. This variability is what the simulation aims to capture and mimic.

I started by scraping a decade’s worth of race data from official websites with posted marathon results using Python and the BeautifulSoup4 library. Like any data science project, the scraped data was messy and required a lot of cleaning. One of the key components of cleaning the data was to identify what runner corral someone belonged to as this would help determine their expected pace. I used unsupervised methods like K-Means Clustering and Gaussian Mixture Models to achieve this and I was able to cluster historic runner data into corral groups before building a predictive model.


Figure 2
:
Runner corral historic average speeds and variances

Exploratory analysis confirmed runner grouping differences in speed and variance of speeds based on corrals (Figure 2), which I was able to account for when simulating a runner population. Because of the grouping differences, a runner’s corral became a feature along with temperature, humidity, and course position in the multiple regression with runner speed as the response. I then performed 15 replicates of 10-fold Cross-validation which returned promising values ranging from 0.75 to 0.85 (Houston analysis returned 0.8462 and 0.8253).

The trained model also provided intuitive results that matched real-world expectations. For example, the highest predicted speeds belonged to Elite runners, and the regression coefficients for distance into the race indicated that runner speeds decrease the further a runner progresses on the course. These coefficients also backed one of my ground assumptions: runners get a final adrenaline push at the very end. I found that for all runners, regardless of corral, being on the last 1.5 miles of the course leads to an increase in speed that had otherwise been gradually falling throughout the progression of the race (Figure 3).


Figure 3
:
An Elite runner’s predicted speed as a function of distance (holding all other parameters constant)

With a meaningful regression as input, the simulation mimics the various speeds runners may maintain during the race and also injects the appropriate variance based on the runner pacing distribution. One would expect that an Elite runner who can run a marathon in a little over two hours should not be simulated the same way that I can run a marathon (probably in 5 hours if I’m being generous). I performed an exploratory analysis on historic Chicago Marathon data and found runner speed variances had different spreads based on the corral of a runner.  These become inputs when predicting speeds, and a combination of dynamic temperature and humidity estimates, runner variances, and runner distances are used to simulate the real-word distribution of runners. While runner speed variance is expected to be different based on runner type (Elite vs. 5-hour pace), it is also expected that probability of dropout is not consistent for every single runner. One of the current ideas to improve the simulation and expand its capabilities is to calculate “survival probabilities” based on a runner’s type and race position to determine if they will dropout.

One of the fun characteristics of this project was combining my growing machine learning knowledge with my industrial engineering simulation knowledge. Some argue simulation itself is machine learning, but as I’ve learned, people will call anything machine learning to make it sound glamorous…and while my code is well commented and documented, it’s far from glamorous. Given all of the relevant inputs and the regression model trained on historic data, I can simulate the Chicago Marathon in 45 seconds and deliver accurate runner estimates to the Unified Command center on raceday. The speed of the simulation allows for rapid re-modeling in the event of inclement weather, some runners failing to show up, or any other unforeseeable circumstances during the race.

When my model and simulation were implemented at the Bank of America 2017 Chicago Marathon, we had highly positive results. At the end of the day the predicted finishers were only off by 30 runners… a 0.04% error for the 45,000+ runner population. Consistently throughout the day, the simulation and regression model was fairly accurate in predicting speeds, and was off by only 1,000-1,500 runners at any given 5K marker at any given time (which corresponds to 3.5% of the runner population). The opposite side of that accuracy corresponds to 96.5% of the runner population accurately matched to their locations—close to ideal when it comes to situational awareness.

 

Exploratory Work
Among the bells and whistles I have added to the updated simulation is the ability to input runner corral start times for the race event so that I not only mimic different runner speeds, but I also mimic when different runners begin the race. This feature unintentionally enabled me to look into “runners on course” distribution smoothing and how a race comes to a close at the finish line overall. Large marathons with upwards of 46,000 runners presumably have runners consistently finishing throughout the day however, the way runners are released on course in the morning significantly impacts how the pack of runners approach the finish line at the end of the day. Originally, I updated the simulation to inject more predictive power. Consequently, I now also have a prescriptive and visualization tool that can better inform race organizers when planning race-day logistics.

For example, the 46,000 runners in the 2017 Chicago Marathon started in three waves, releasing at 7:30 AM, 8:00 AM, and 8:35 AM. Previous marathons released in 2-wave systems. I simulated 2-wave and 3-wave release systems and compared the cumulative number of runners finishing the race compared to the time of day. The newly implemented system appears to do a better job of spreading out runners at the finish line (indicated by the smoother dark blue curve in Figure 4).


Figure 4:
Start and finish line cumulative distributions for the 2-Wave and 3-Wave start models

This result of a “smoother finish line” prompted me to look into how runners approached various mile markers on course and to see how the 3 waves of runners ultimately smooth as they approach the finish line. This curiosity led to a unique visualization that helped to answer a common question in Unified Command: “where is the highest number of runners, like the bell of a curve?”. Figure 5 shows the number of runners that pass the starting line or respective mile markers at a given point in time. These clearly show how runners start the race in 3 packs, and by providing this visualization when I answer: “which bell curve are you asking about?” I’m not met with blank stares and confused looks.


Figure 5:
Simulated Runner Distributions (across time)

These same distribution plots were generated for the race midpoint mile markers to see how simulated runner speeds (with the anticipated corral variance found from the historic EDA) would change the distribution of runners on course. Although these are snapshots of specific mile markers, each individual distribution compared to another can be interpreted as how the runner population is moving and changing as runners continue to race.

 


Figure 6:
Halfway mile markers

As depicted in Figure 6, the tri-modal runner distribution that started the race ends up flattening into a clunkier distribution with wider spread when passing key midpoint mile markers. This completely changed the “where is the bell curve?” question and helped our DVS team to decide we should start providing median runner numbers to race organizers interested in halfway points of runners. We also provide information about where the bulk of runners or these peaks are because the common “bell” or mean value may be misleading or simply unhelpful. Putting these distributions all on one plot fully illuminates just how much a runner population spreads.

Figure 7: Runner distribution by time at various mile markers on course

 

Conclusion
Visualizations like the distributions seen above came from the simple structure and output of the simulation. They can be used to estimate number of runners an aid station may see at a certain time, or to determine where the median runner is. This type of marathon modelling is an untapped field with the potential for many analytic projects. Modelling anticipated resource usage such as water cups or blister tape needed throughout the day at aid stations would be one direct-impact example. Other projects such as modelling different start-line procedures and evaluating their impact on the whole course are now possible too. Within my role in the research group, I am now beginning to investigate the Houston Marathon’s start-line system to evaluate the impact on the finish line and their operations.

Throughout this project, I learned marathons require training. Whether it’s training data or waking up every morning at the crack of dawn and running 10 miles, it takes time. It took me 3 months of “training” over the summer, but the work paid off: I improved my running time from 18 minutes to a personal best of 45 seconds. *


*runtime is computational and unfortunately not superhuman physical speed

** This work recently was a finalist for, and won first place in the 2018 INFORMS Innovative Applications in Analytics Award


I’d like to thank Dr. Smilowitz for her help and mentorship in this project. I’d also like to thank the DVS team for seamlessly integrating the analytics and simulation into the real-time system. Additionally, I’d like to thank Dr. Barry Nelson, whose advice was crucial in decreasing simulation runtime and expanding the analytic capabilities of the system. Finally I’d like to thank all who provided inputs and edits to this post!


This research has been supported by grants 1640736 PFI:AIR – TT: SAFE (Situational Awareness for Events): A Data Visualization System and 1405231 GOALI: Improving Medical Preparedness, Public Safety and Security at Mass Events from the National Science Foundation