Resource Guide: Clustering | Research Computing and Data Services Resources

Last Updated: January 2024

Clustering, or Cluster Analysis, is a machine learning method in which data points are grouped together (or clustered, as the name suggests) according to some notion of similarity. Similarity is often based on underlying geometric, topological and/or statistical assumptions. For example, spatial observations may use Euclidean distance, document clustering may use cosine similarity, and networks may use path distance.

The main assumption is, of course, that the data is indeed structured in clusters, even if you do not know what these clusters are. By learning these clusters, we expect data points to be, on average, significantly more similar within a cluster than between clusters. Since in general we do not have information about cluster membership, clustering falls under the family of unsupervised learning algorithms. Note that the number of clusters is also unknown, although there are heuristics for choosing it.

The best known clustering algorithm, and possibly the most used, is k-means clustering. This can serve as your starting point. Below are some general resources on clustering and also some specific for k-means.

Learning the Theory and Method

A review of clustering techniques and developments
Saxena et al.
This paper provides a well organized review of different clustering algorithms and their relationship to optimization.

Cluster Analysis
Brian S. Everitt, Sabine Landau, Morven Leese, and Daniel Stahl
This book provides a practical text on cluster analysis and its potential applications in a wide range of disciplines (e.g., medicine, psychology, market research, and bioinformatics).

The Elements of Statistical Learning: Data Mining, Inference and Prediction
Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
The chapter entitled “Unsupervised Learning” provides a more advanced overview of unsupervised machine learning, including clustering. The chapter also concludes with a series of exercises to test your understanding of unsupervised learning.

Implementing in Python

Google’s mini-course on clustering
This mini-course requires a bit of time commitment (~4hrs), but if you will be using clustering extensively it is a good option. It requires some knowledge of linear algebra and the basics of Python programming.

scikit-learn’s clustering guide
A high-level overview of a variety of clustering methods and how to implement them using the sci-kit learn machine learning Python library. If you want to learn about different methods for clustering and can’t commit much time, this is a good starting point, even for non-python users.

Real Python’s k-means clustering practical guide
A very practical and quick guide to k-means clustering in Python. A great place to get started if you are a Python user.

Implementing in R

K-Means Clustering in R (Julia Silge)
This tutorial explains how to implement k-means clustering in R. For some additional practice using k-means in R, see another tutorial by Silge here. A great place to get started if you are an R user.

Datacamp’s kmeans tutorial.
Teaches the basics of how to use R’s function kmeans().