Students learn about unsupervised machine learning and K-means clustering through a paper-based activity in which they cluster different types of colorful monster characters. This activity has intersections with math education objectives of learning how to read and represent data on graphs (Cartesian plane).
Key Vocabulary
- supervised learning: In supervised learning, the computer is given both input data and correct output data. It learns to make predictions or decisions based on this labeled data.
- unsupervised learning: Unsupervised learning is when a computer learns from data without being told what to look for. It discovers patterns and relationships all on its own.
- k-means clustering: K-means clustering is a method computers use to group similar things together in data. It’s like sorting marbles into different boxes based on their similarities.
- hierarchical clustering: Hierarchical clustering organizes data into a tree-like structure based on how similar they are to each other. It’s like organizing animals into groups and subgroups based on shared characteristics.
Activity
(K-means clustering)
Pretend you are a computer that needs to learn how to sort monsters into groups! Take a look at your monster bank. Choose two features that you will focus on. Some possibilities are: number of eyes, number of teeth, number of feet, color, shape, etc.
Taking a sheet of paper, draw a graph. Make sure to label the X and Y axes with the features you chose. For example, if you chose eyes, label with the number of eyes, or if you chose color, label from warm to cool, etc. There is no independent or dependent variable.
Place the stars on random spots on your graph. The number of stars you place is equivalent to your K value. The recommended starting number is 3 stars, or K = 3. For more of a challenge, increase the value of K!
Sort the monster bank according to the graph you drew, using your best judgment as to where to place each monster. If a monster is on top of a star, make sure to rearrange the star to sit on top
Now, taking a look at one monster at a time, figure out which star each monster is closest to. Using a ruler, measure the distance between each monster and each star on the graph, taking note of which star is closest for each monster.
Form a group around the monsters that are closest to each star by closely looping a string around each star and monster group.
Check! You should have the same number of loops as you do stars, or K.
Now pick up and place the stars in the middle of each loop.
Observe! Take a look at each cluster and the monsters inside of it. Take a picture if it will help you remember better.
Pick the strings up off the graph.
Now, check each monster again. Which star is each monster closest to?
Once you have recalculated the closest star to each monster, loop the string around the closest monsters to each star to form the clusters again. Note: the clusters could be the same as last round, or different!
Again, shift the star to the center of the cluster.
Now keep repeating the process until the stars cannot be moved anymore, and the monsters are all in the right clusters.
Reflection Questions
- Imagine you are sorting monsters by number of eyes and teeth. If your axes were the number of eyes and number of teeth, how many clusters would you have if the monster dataset was made up of only monsters with two eyes and two teeth?
- Building off of the question above, if a unique monster with three eyes and four teeth was added to the dataset, and you could not re-cluster, which cluster would the new monster be put into?
- Now, think about facial recognition technology. What might happen if it only got one kind of face, and therefore only had one cluster?
- Why is it important that facial recognition technology can recognize a diverse array of faces?
- What are some real-world examples where you think clustering AI might be helpful?
- What could happen if the AI makes a mistake while clustering? How might this affect its decisions?
Real World Implications
- To avoid getting junk emails in your main inbox, email companies use algorithms. The purpose of these algorithms is to flag an email as spam correctly or not.
- K-Means clustering techniques have proven to be a good way of identifying spam. It looks at the different sections of the email (header, sender, and content). Then the data is then grouped together.
- These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%.