Skip to main content

Common Consults: Indicator Variables from Multiple Response Survey Questions

A common question we receive from researchers is how to create a series of indicator variables (0/1 binary variables) from a single data column that has a list of variable length responses. For example, taking data that looks like this:

id age fav_colors
10 21 blue,green
11 25 red,blue
12 30 purple
13 41 pink,blue,gray
14 40 NA

And turning it into data that looks like:

id age fav_colors blue green red purple pink gray None
10 21 blue,green 1 1 0 0 0 0 0
11 25 red,blue 1 0 1 0 0 0 0
12 30 purple 0 0 0 1 0 0 0
13 41 pink,blue,gray 1 0 0 0 1 1 0
14 40 NA 0 0 0 0 0 0 1

This situation commonly arises in data sets containing answers to survey questions, but it can also occur in other contexts. Qualtrics survey questions with checkboxes for multiple answers, for example, will result in data of the format above.

There are multiple ways to achieve this data reformatting in both R and Python, but our recommended methods are below.

Python

 

R

The R method does not keep the original column in the data frame. It is destroyed in the process of creating the indicators.

Questions?

Have a data question of your own? Schedule a consult with us. We can answer simple questions over email; otherwise we’ll set up a time for a video chat.