Online Learning Resources: Python pandas – Research Computing and Data Services Resources

This post is part of a series of posts on online learning resources for data science and programming.

By Antonio Nanni, Data Science Research Consultant

Pandas is the most popular Python package for handling and analyzing data. It mostly deals with tabular data (such as Comma Separated Value) and implements a wide range of functions to help users with selecting, manipulating and visualizing the loaded data. Its powerful syntax can often reduce long for-loops or plotting code into just one line of code. One of the best features of pandas: it easily communicates with other popular data-analysis packages, such as matplotlib (for plotting) or scikit-learn (for fitting machine learning models). In the galaxy of Python data analysis packages, pandas is right at the center – and a lot of other packages rotate around it. Therefore, if you are building a Python pipeline for data-analysis (or planning to) it is extra-helpful to know how to use this package and what it can do for you.

As with other guides in this series, we’re focusing on resources that can be accessed for free by members of the Northwestern community, and we’re focusing on resources other than full-length online courses.

Getting Started

Intro to Pandas Data Structures
Greg Reda
If you are really new to pandas, this is the tutorial I would advise you to read first. It introduces the two most important objects of pandas (Series and DataFrames) in a very clear way. It is a very gentle first introduction to the heart of this package. As a bonus, the tutorial has two other parts where the author presents even more features by analyzing actual data – always in a very clear way. The tutorial focuses in particular on using pandas alongside and on SQL databases, but it is a very good introduction for everyone.

10 Minutes to Pandas
When I approached pandas for the first time, everyone suggested I take a look at this super-popular tutorial. The tutorial covers a lot of ground and does a great job at showcasing the power of pandas. I should probably warn you that my completion time was actually longer than 10 minutes (but still short enough!). I personally felt the tutorial requires some minimal background knowledge of pandas (or NumPy). If you are completely new to pandas, I would suggest to look at this after some other introduction.

Getting Better

My number one advice to get better at pandas is to learn by doing. As mentioned, pandas is supported in many popular libraries, so it will naturally come your way as you write your code. However, there are some resources I would still like to point out.

Indexing and selecting data
Selection is where the powerful syntax of pandas really shines. Chances are you can do your selection on rows and columns with just one line of code. For those of you experienced with R, the pandas selection syntax is close to the selection of rows and columns on R dataframes. However, the syntax also presents subtleties. For example, is my selection returning a view or a copy? The official guide is really good at showing how you index data in pandas. I suggest you take the time to navigate it once and then use it as reference every time you have doubts – at least, this is how I use it. This page is also updated regularly, since it is official.

Working with Text Data
If you work with a lot of natural language data, you certainly want to have a look on how to handle it in pandas. The way pandas treats strings changed as recently as January 29, 2020, when pandas 1.0.0 was released. The new version of Pandas contains a brand new datatype: the StringDtype. As the name suggests, this data type was created to handle strings, which did not have a dedicated datatype in the previous versions of the package. In my own experience, the lacking of a dedicated dtype for strings created all sorts of strange behavior, warnings and funny coding. This is a very welcomed entry and I really recommend you catch up with the new data type by looking at the official user guide linked above.

Quickstart tutorial to NumPy
The reason why Pandas was not great at handling natural language data is that it heavily relies on NumPy. Let me be clear, NumPy is a great package, but it was not originally designed for strings. As the name suggests, this package is really excellent at handling numerical data: it is the backbone of pandas and many other data-analysis package in Python. For this reason, it is very easy to use NumPy with data loaded with pandas. Knowing some NumPy will certainly enhance your understanding of pandas and, most likely, many other packages you are currently using. It will allow you to easily (and efficiently!) implement mathematical functions above and beyond what pandas already offers. This tutorial is a very good entry point to NumPy. It provides a good overview of the way NumPy handles data and how you can manipulate NumPy data in a vectorized form – without using loops.

Python for Data Analysis (2^nd edition)
Wes McKinney
This book wraps it all up. It is really a great resource for data analysis in Python in general. It is especially good in the chapters about pandas since it was written by the original developer of pandas himself. The chapters are mostly stand-alone, so you don’t have to go through the entire book to really learn about pandas. The exposition is detailed and extensive (a lot of ground covered!), but still very accessible. This is highly recommended as a reference book.

Stuck?

If you have a question about pandas, don’t know what resource to start with, or need to learn something not covered above, remember you can always request a free consultation with our data science consultants. We’re more than happy to answer questions and point you in the right direction.