The term “big data” has been a staple of corporate executives and analytics strategists in recent years, and with good reason. The use of big data has resulted in numerous innovative uses of analytics, including Google’s use of big data to predict flu outbreaks days in advance of the CDC using search queries. Yet the presence of large amounts of data alone is not always a sure path toward valuable insights and positive impact due to the inherent messiness of big data. Forms of data collection not associated with big data can still be highly relevant, even in the interconnected and data-driven world we live in today.
Observational data can still have a significant impact towards guiding strategy or assisting in research, even in the presence of ever greater emphasis on big data. During the Chicago Marathon, a participating runner’s health information is recorded whenever he or she has an injury, whether it be as minor as a blister or as major as a heart attack. This field data is important to race organizers and operations researchers that are seeking to enhance the safety of the event through course optimization, and to health professionals who use the marathon as a proxy towards understanding rapid-response health care in disaster relief scenarios. However, this data is recorded by hand, increasing the likelihood of human error and the possibility of making incorrect inferences.
In contrast, in big data schemes, the information is usually collected in a regular manner, with consistent formatting from one time period to the next. A user’s clicks will always be collected in the same way, as will trading transactions on Wall Street, as these actions are recorded automatically by computers. This is rarely the case with field data. In the case of the Chicago Marathon, the variables of data collected differ from year to year, but may be trying to capture the same information. What’s more, variables that align from year to year may have different formats of data, often with similar but still distinct classes of categories. This presents a unique challenge to researchers that are seeking to compare the data across time or within a specific variable. The method of overcoming this challenge is to understand the context of how and why the data is collected in a particular way, not only when conducting analysis but also prior to data cleansing and aggregation.
As an example, two of the variables collected during a patient visit at the Chicago Marathon are check in time and check out time. These variables are used to indicate the time an injured runner entered an aid station for care and when the same runner was released from the aid station after treatment. As researchers, we may be interested in knowing the total visit time of each runner visiting an aid station, and hypothesize that the severity of an injury is positively correlated with the visit time. A runner can be treated more quickly if his or her injury is simply a blister, as opposed to knee pain or a laceration resulting from a fall.
However, pursuing this hypothesis naively would result in misinformation and false analyses. We know from speaking with medical professionals on site that once injuries reach a certain level of severity, these runners are transferred almost immediately to a local area hospital. In addition, the check out time of severe injuries is often incorrectly recorded as a result of the medical professional’s focus on the patient, sometimes resulting in a high value for visit time.
Another issue to consider is when a bottleneck exists at the medical tent. If multiple injured runners are waiting for treatment, the visit time can be non-representative of their true treatment time. Without this contextual information, we may have asserted that severely injured runners would spend the most amount of time at an aid station or that the visit time is directly correlated with treatment time. Yet, we know from information provided by medical volunteers at the Marathon that the more severe cases of injury usually result in a visit time that is on or below average since these injuries usually result in transference to a local area hospital. In this case, context has provided us with the knowledge needed to quickly identify outliers in the data, understand how the data is collected, and take the appropriate action when conducting analyses.
Another example from the Chicago Marathon concerns the patient’s chief complaint when entering an aid station for treatment. Some prominent chief complaints include knee pain, blisters, and muscle cramps, among others, as shown in the graph below.
From 2012 through 2014, this data is collected in an organized fashion and in the form of categorical responses. However, the 2011 data is free-form text and varies dramatically. While the 2011 data contains text that would fall into one or more of the categorical responses in the data from 2012 through 2014, there is no direct match between the two data fields. Here, context provided by a medical professional is incredibly important. By speaking with health professionals who were on site and with health professionals familiar with chief complaints resulting from running activity, the 2011 data was properly transformed into categories consistent with the remainder of the data and that have a sound basis medically.
In the case of the Chicago Marathon, context is key toward understanding the data and applying appropriate methodologies towards data cleansing and aggregation, and also towards analysis. The lessons from the Chicago Marathon explained above illustrate that without proper context of how the data is collected, researchers may make incorrect assumptions or present misguided insights. Context is important in data analysis and data cleaning for any data source, but is especially crucial when understanding and working with field data.