Skip to main content

Online Learning Resources: Parallel Processing with Python

This post is part of a series of posts on online learning resources for data science and programming.

By Raymond Wang, High Performance Computing Consultant
and
Alper Kinaci, Sr. Computational Specialist

Without a question, Python is one of the most heavily used coding languages in many sectors from academia to industry.  Python is flexible and portable and offers amazing number of packages to accomplish your tasks in a short time. By default, Python uses a single processor core with one thread to execute your task, which doesn’t take advantage of modern multi-processor and multithreading-enabled computers. In this post, we go beyond the basics and present resources that will help you in accelerating your code using parallelization methods for Python. (If you’re just starting out with Python, I recommend checking out our blog post about online resources for learning Python.)

Parallel programming refers to developing code that can break a large computation into smaller pieces and execute those pieces simultaneously, allowing the parallelized code to utilize the full power of modern computers and shorten run time. Today, there are several parallel programming techniques that primarily differentiate on how hardware resources (such as memory, CPU cores, interconnects etc.) are employed when parallel tasks are distributed and executed. While they are different, there are a lot of similarities on how these parallel programming techniques divide the problem, execute the tasks, and finally gather the parallel pieces.

The resources below will guide you through several ways to effectively utilize your laptop, workstation, or even computer clusters for parallel Python tasks.

Getting Started

When learning parallel programming, you will frequently encounter terms such as multithreading and multiprocessing. For background you may wish to read the differences between multiprogramming, multitasking, multithreading and multiprocessing with examples from day-to-day operations of your computer.

In the next sections, we will look at applications of these parallel programming approaches in Python.

Multithreading

A thread is the smallest unit of parallel execution in a process. Threads can share the memory of their parent process to accelerate thread-thread and thread-process communications. When you begin learning about parallel programming in any language, we recommend exploring if multithreading is an option. Multithreading methods offer acceleration in run time with few modifications in your code.

One caveat in Python is that threads do not really execute at the same time. However, you can still improve the speed of your code if you use threads for tasks that spend a significant portion of their run time waiting for data or other information. For instance, reading, writing, and communications through network can delay the normal data processing within a thread (i.e. operations that are I/O-bound). Multithreading in Python article briefly explains threading starting with relevant terminology, illustrations and code examples. To see threading in action, you can also watch the video tutorial by Corey Schafer.

Multiprocessing

A process is an instance of a program running on your computer. A process has its own memory space and it is isolated from other processes. By distributing our Python tasks over processes, we can actually utilize our multi-core computer systems in parallel. While taking advantage of the parallelization, a programmer should also be careful about the overhead for slower inter-process communication and memory duplication in multiprocessing.

In contrast to threading, you will benefit from multiprocessing if your tasks involve heavy operations that keep the CPU busy (i.e. CPU-bound). Matrix multiplications, for instance, need a lot of CPU power. Multiprocessing in Python article further explains the topic with basic examples. You can also watch multiprocessing video tutorial by Corey Schafer for parallel image analysis.

Getting Better

Resources introduced so far in this article provide basic syntax in multiprocessing. The next step is a deep dive in Python reference for native multiprocessing package. For beginners, it is highly recommended to start from the Pool class for data parallelism. Python Multiprocessing Module with Examples blog post introduces more complicated control algorithms using the metadata for your parallel processes and also explains how to schedule processes using locks.

If you are a Pandas user, then Pandas Functions in Parallel blog post will be interesting for you. The author shows how to split your dataframe, operate the parts in parallel, and reassemble the data by using Pandas and Multiprocessing packages together.

MPI for Python

Message Passing Interface (MPI) is the powerhouse standard when it comes to developing codes that can use thousands of computers simultaneously. The majority of resources available at high performance computing centers around the world are utilized by MPI-based software. Compared to other methods, MPI offers great flexibility to the programmer because you can directly control the process scheduling and data communication. However, this explicit parallelization may cause the programmer to spend more time on coding.

If you are already familiar with parallel programming using compiled languages (e.g. C++, Fortran, etc.) or need to scale your code up significantly, the MPI for Python (mpi4py) package will provide you with most of the capabilities of MPI. To practice some of the basic MPI concepts with Python, I recommend Parallel Programming with MPI for Python tutorial from Columbia University.

Scale-up Data Analytics Using Apache Spark

Python is the go-to language for big data-processing (e.g. Pandas). However, the packages for data analytics are typically not designed to scale beyond a single machine. Distributed data-processing approaches can address this issue and allow working with large data sets. Apache Spark is one of the solutions to this problem. Spark allows you to make some minor changes to your existing Python workflow yet significantly speed up the processing time. Under the hood, the algorithms are similar to MPI but Spark developers intelligently hide most of the details from the programmers. Thus, we can focus on the high level tasks rather than coding explicit parallelization.

To explore Spark for Python, DataCamp presents a beginner’s guide. If you are interested in more hands-on experience and coding samples for distributed data analysis, I recommend Apache Spark with Python tutorial.

Dask

Dask is a light-weight Python package for parallel computing. Your Python program may have its own ecosystem, using packages such as Numpy, Pandas or Scikit-Learn. Dask uses existing APIs and data structures from those packages to provide an easy adaptation to parallel computing. If you are planning to modify your code to take advantage of parallel computing, but could not decide between Dask or Spark, we recommend checking out this guidance from the Dask developers.

An introductory tutorial by Steven Gong summarizes some of the technical details you will need before parallelizing your Python code with Dask.

If you are already familiar with high performance computing clusters (such as Quest at Northwestern), Dask on HPC tutorial by Matthew Rocklin covers deploying Dask jobs on a supercomputer.

Stuck?

If you have a parallel Python question, don’t know what resource to start with, or need to learn something not covered above, you can always request a free consultation with our consultants. We’re more than happy to answer questions and point you in the right direction.