NER Wars: Evaluating Named Entity Recognition Methods on the Star Wars Original Trilogy

Macon McLean ’16

Like all people with good taste[1], I have long been a fan of the Star Wars film series.  Ever since witnessing the theatrical re-release of the holy triptych when I was a kid, I have marveled at the way George Lucas invented a living, breathing universe for viewers to enjoy.

Part of making such a universe feel fully-realized is developing a unique vocabulary for its characters to use, and the Star Wars movies pass this test with flying colors.  The evidence is that nearly everybody who’s witnessed one of these adventures from a galaxy far, far away remembers the names of “Luke Skywalker” and “Darth Vader”, and can describe a “Jedi” or “the Force” with ease.  These four examples are some of the Star Wars universe’s named entities, words or phrases that clearly describe certain concepts in a way that differentiates them from other concepts with similar attributes.

Named entities are used to inform the process of named entity recognition (NER), the process of automatically identifying and classifying these entities in a given corpus.  This process can be used in creative and meaningful ways, like examining locations in nineteenth-century American fiction[2] for an analysis of named locations in literature[3].  It is often used as part of relation extraction, in which computers pore through large volumes of unstructured text information to identify possible relationships and record them in a standardized, tabular form.  NER classifiers are computational models trained to support these efforts, minimizing the need for manual tagging by humans and preparing the data for inference.  Consider the following sample sentence:

“Steve Spurrier played golf with Darius Rucker at Kiawah Island in May.”

Using just a few simple classes (PERSON, LOCATION, DATE) we can examine each word in the sentence and tag the named entities as follows (commas inserted):

“PERSON, PERSON, O, O, O, PERSON, PERSON, O, LOCATION, LOCATION, O, DATE.”

Tagged sentences like this one can then be used to train an NER classifier, helping it learn when and how to apply those three types of tags.

The downside of NER is that classifiers trained on the vocabulary of one domain do not generalize well to others.  This makes sense, as the conditional random field[4] (CRF) at the heart of the classifier is a context-based method.  The contexts surrounding named entities of chemistry differ significantly from those of named entities in economics, or Chinese literature, or Star Wars, so the CRF will train differently for each domain.

After learning all this, I began to imagine other uses for NER classifiers.  One idea was to use it as a means by which to compare the vocabulary of a specific text with similar works, as a very specific kind of lexical similarity measure.  Since these classifiers are so brittle, it would work only within a specific domain, and could only make very fine distinctions.  After an exhaustive[5] search of the top journals, I did not find any examinations of a Star Wars-specific NER classifier.  So obviously it became my mandate to train an NER classifier for the very necessary task of Star Wars named entity recognition via Stanford’s CoreNLP suite.

To train such a classifier requires a significant amount of legwork on the front end.  You must create a training corpus with each word tagged as part of a class.  For this research, I used the full scripts from Star Wars, The Empire Strikes Back, and Return of the Jedi.  My corpus was tagged with eleven classes:  other, person, location, duration, misc, date, number, organization, ordinal, time, and percent.

The procedure was fairly simple (albeit time-consuming).  I first created a Python script that augmented the out-of-the-box NER classifier in the Stanford CoreNLP suite with user-input dictionaries.  Ten of the above classes are standard, and I added the misc class to cover anything Star Wars-related that did not fit adequately in the pre-existing set of classes (e.g. vehicles, alien race names).  Then I manually inspected each script and developed a few rules to keep tagging consistent across the three scripts[6].  After some final formatting, the training files were ready to go.

Using the CoreNLP training commands, I fitted a classifier to the text from both the original Star Wars (SW) and the Empire Strikes Back (ESB) for testing on Return of the Jedi (RotJ), then did this two more times for the other two possible configurations.  I also ran the standard classifier on each script as a baseline for comparison.  It turns out that in this case (perhaps due to small sample size), the out-of-the-box standard classifier only tagged my corpus three classes (location, person, organization), so I have broken those results out below.  Each classifier is identified by the film it is tested on:

 

As expected, the trained classifier outperformed the standard one in each case.  I have omitted the full table because the trained classifier was also better in each subcategory in each case.  The average increase in precision was 18.84%, recall was 63.77%, and F1 was 55.32%[7].

Now let’s compare the performances of each of the three trained classifiers to one another.  First, here’s the full breakdown:

And now a graphical comparison of overall performance:

Finally, the same chart as above, but only using the four most domain-specific categories:

The model trained on Eps. IV and V and tested on Return of the Jedi leads the pack in precision, recall, and F-score both across all categories and on the subset of domain-specific terms.  In fact, recall and F1 increase as we move forward in time with regard to the movie we test on (precision is actually second-highest for the model trained on V and VI).  Precision is very good, with all but two figures higher than 80% (a decent helping of ones on the best model) and above 90% in the subset.  Recall is also fairly impressive (except in location and duration, and only in the few cases where the number of relevant words was less than twenty), and above 80% across the board in the subset.

As good as they were, the preliminary results did have me somewhat concerned, so I decided to do some further investigation.  I wanted to make sure that the model wasn’t simply regurgitating different words it had learned during training and just missing the ones that weren’t there.  I needed to make sure that this model actually learned.

So I manually inspected the corpus again.  According to my tags, Return of the Jedi has exactly thirteen instances of unique person-class words not seen in Empire or Star Wars.  If it was just tagging previously learned words and missing the new, heretofore unseen terms, we would expect the total number of false negatives (missed words) to be at least thirteen in number.  However, there are only eight, so we have good evidence that the model is really learning – at least five of the thirteen new words are being picked up by the trained classifier.

Repeating this analysis, there are thirty-one unique person-word instances in ESB and forty-one in Star Wars, so no pure regurgitation here either given the false negatives.  However, when I looked at location and duration, the two fields that actually suffer from high precision and low recall, I saw that there is a chance it is simply repeating learned words – though false negatives are less than or equal to unique instances in most, there are more false negatives than my tabulated number of unique location-words by one for Empire and by two for unique duration-words in Return (though these suffer from notably low sample size).

The bottom line?  Star Wars stands out as the film with the named entities least recognizable by a classifier trained on its two sequels, according to F1-score.  Empire is pretty close behind, and Return’s named entities are most easily classifiable by a longshot.  This is true both across all categories and the four-class subset.

But why?  One hypothesis is that Star Wars has significantly more named entities than the other two, which results in more chances for misclassification.  However, this fails to account for the similar performance of Empire.  Another (and in my opinion the more likely) explanation is that it is an artifact of the order of the films.  When the classifier has trained on Star Wars and Empire, it can refer to any context learned in those films when studying Return; this is ostensibly far more useful for classification than when the converse model refers to temporally subsequent contexts to classify terms in Star Wars.  Of course, there could very well be a third explanation I’m missing entirely.

One way to help decide this is to expand this analysis to the prequel trilogy.  Unfortunately, I’m not sure I can spend the requisite hours tagging the script for Episode II without losing my mind[8].  Until I can, I’ll just have to try to tap into the Force to figure it out.

 

[1] Though this is the subject of some debate, let’s just treat it as fact for this exercise.

[2] A pseudo-tutorial by the textual geographies people:  https://blogs.nd.edu/wilkens-group/2013/10/15/training-the-stanford-ner-classifier-to-study-nineteenth-century-american-fiction/

[3] The greater textual geographies project:  http://txtgeo.net/

[4] For a solid introduction to conditional random fields, check out this blog post by Edwin Chen:  http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

[5] The search was admittedly less than exhaustive.  But I doubt there’s anything else out there.

[6] I found myself thinking I could write a whole paper about this task alone.  Whenever possible I tried to use common sense.  This is a nuanced task and instituting different rules might well have yielded somewhat different results.

[7] For an explanation/refresher on the differences between precision, recall, and F1 score, it’s hard to beat the Wikipedia page:  https://en.wikipedia.org/wiki/Precision_and_recall

[8] Episode II is the weakest link.  Search your feelings, you know it to be true!

Learning to Fight: Deep Learning Applied to Video Games

Dr. Ellick Chan

Artificial Intelligence (AI) is commonly used in video games to control non-human “computer” players. The AI used in video games often relies only on simple logic and heuristics, so “computer” players do not always have human-like playstyles. This is unfortunate because games that feature human-like AI could be very popular amongst both recreational and competitive gamers. AI trained to play like specific humans could be particularly appealing. Imagine the headlines to market such games: “Play Grand Theft Auto cooperatively with AI trained by celebrity Snoop Dogg” or “Fight against AI trained by World Champion Player Mew2King”.

Inspired by the work of Stanford PhD students Chen and Yi [1], MSiA students Dustin Fontaine ’17, Dylan Fontaine ’17, Annie Didier ’17, and Michael Cho ’17 set out to use Deep Learning to create a human-like video game AI. Their approach relied on image-classification instead of reinforcement learning which is more commonly used with video games. The team started by recording a player (Dustin) playing the game “Rumblah: Flash Fighting Engine”, a simple fighting game where two players move around the screen and hit each other until one gets knocked out. The team recorded approximately 1 hour of gameplay, capturing 10 screenshots per second and labelling screenshots according to which key Dustin was pressing while the screenshot was taken. The four possible labels for screenshots were “Move Left”, “Move Right”, “Punch”, and “Do Nothing” (if Dustin was not pressing any keys). Some example screenshots are shown below. Dustin was controlling the green character.

The next step towards creating the AI involved fitting a neural network to the collected screenshots. If the team could build a model that could accurately label new screenshots according to what key Dustin would press in that scenario, it could be used to control an AI that plays the game like Dustin. They started by fitting a sequential neural network to screenshots and then fitting a more robust convolutional neural network. The convolutional model performed better than the sequential model, achieving nearly 60% accuracy and 80% top-2 class accuracy. This may seem low, but it is not bad when you consider the randomness of human behavior and the fact that Dustin may does not always press the same keys every time in a given situation.

In order to understand why the neural network made the classifications that it did and what parts of each screenshot the model was focusing on, the team used a visualization technique involving heatmaps. The white pixels in each heatmap were given large weight by the model when it was making its classification and the black pixels were given little (or no) weight. See the example heatmaps below. It is interesting to see that the models focused on the same parts of the screen that a human does when playing. In the screenshot classified as “Move Right”, the focus is on the head of each player. This makes sense, because a player considers the relative positions of their character and the opponent when deciding to move left or right – if the opponent is out of attack range then the player will move towards them. In the screenshot classified as “Punch”, the focus is again on the heads of both players and also the flash from a previous punch. Since the opponent is within the green character’s attack range, the model makes the correct decision to throw a punch.  Most of the time the model focused on the heads of each player, but sometimes other information was considered. In the screenshot classified as “Do Nothing” you can see that the combo meter at the top-right of the screen was given some weight by the model.

A human player considers more information than just the current screenshot when deciding which button to press, so shouldn’t a model trying to imitate human behavior do the same? After fitting the sequential and convolutional models to single images, the team decided to fit the same models to “filmstrips” of four consecutive images. “Filmstrips” capture information from the last few moments of gameplay instead of just the current moment. This technique was successful and the filmstrip models performed slightly better than the original models.

The heatmaps produced by the filmstrip models make a lot of sense. More weight was given to pixels on the screenshot representing the current moment (bottom-right) and less weight was given to pixels on the screenshots taken prior to the current moment.

Finally, after fitting various neural networks to recordings of Dustin’s gameplay, the team needed to operationalize their models to control an AI in the game. The AI would be controlled by a script that records screenshots of live gameplay, scales the screenshots to the appropriate size, plugs them into a neural network, then takes the output from the model (which key to press) and emulates keystrokes on a keyboard. The diagram below represents this process.

The team was worried that a script of this sort would not be fast enough to keep up with live gameplay. However, speed turned out to not be an issue. The Python script was able to execute in real time on a standard personal laptop processing screenshots and emulating keystrokes in a fraction of a second.

A video of the team’s AI in action is shown below. The blue player is controlled by a human and the green player is controlled by the team’s AI. You can see that the AI plays well, chasing the blue player around the screen and punching him repeatedly when he is within attack range. Although the AI does not perfectly copy Dustin’s playstyle, this demo serves as a good proof-of-concept for using image classification models to train video game AI. More training data, further tuning, and providing more information as inputs to the model (such as the time series of previous keys pressed) could make the model more accurate. The same framework could be applied to any action game that does not involve complex strategy, so it would be interesting to see how the model performs in other settings.

[1] https://arxiv.org/pdf/1702.05663.pdf

Explaining the inner workings of Deep Learning

Dr. Ellick Chan

MSiA students are helping lead the way toward building more reliable deep learning AI models by making them explainable. This Spring, 9 student groups developed ways to “peek into” models built for their projects. In previous years, students were able to do impressive work in their projects, but they were often left wondering why a model behaved the way that it does. The lack of explainability or interpretability raised some deep questions about the inner workings of their models, and that poses a barrier to adoption in safety-critical industries such as automotive, medical or legal.

 

This year, the students utilized several methods to show the “visual attention” of their models. Through this process, they were able to better understand how the model arrived at its prediction and as a result gained extra confidence in the model’s thinking process. This work parallels early studies in human attention performed by Yarbus, et al whom showed that human eyes fixated in different areas depending on tasks. The students found that machines can also behave similarly when classifying images, and students were able to leverage this insight to better understand how the machine was perceiving images to build more robust models that perceive the world more like how humans do.

 

We believe that this work is a helpful first step in making model interpretability and understanding a core component of deep learning analytics education. Machine learning today is very much a black-box art, and in order to gain wider acceptable in society and trust in the model, we must emphasize the need to peek into such models to see how they operate and what the limitations of the model may be.

 

For more information on our efforts to understand deep learning models, please view an overview presentation here. Course slides are available here.

 

Yarbus study of human gaze in a scene – (Left) original painting, (Middle) estimate the age of people in the room, (Right) remember clothing worn by people. Note how the gaze patterns depend on the task at hand.

Machine attention maps of ballet positions. Top row shows the attention map with white denoting areas of interest. Bottom shows a masked version of image. Note how PenchPonce is characterized by one leg up in the air. The machine attention map agrees with what a human observer would look for to classify this move.

Machine attention map for shoes. Note how the heel class is characterized by the higher heel and loop.

A Demonstration of “Popout”

Dylan Fontaine

When creating a data visualization, it is often necessary to emphasize certain points so that they stand out from their surroundings. Doing so can help the author communicate their intended message to the viewer quickly and clearly. For example, I made the point for Chicago a red star on the scatterplot below to highlight Chicago’s relatively low rent prices and large population compared to other major US cities.

In my data visualization class, I learned that the emphasis placed on certain points of a graph is called “popout”. Popout is a fitting name, because emphasized points do just that… they “popout” to the viewer. A variety of visual channels can be used to make points popout from their surroundings. They can be colored more brightly, given a unique shape, increased in size, etc. Sometimes a combination of visual channels is used to make points popout.

The method that a chart designer uses to make points popout is very important. The human eye responds much more quickly to some cues than others, so a viewer may quickly see points emphasized with an effective method of popout, while taking longer to see points emphasized with a less effective method of popout. Speed is not everything when it comes to making a good data visualization, as the entire design and impression made on the viewer must also be considered, but I wanted to investigate and quantify the speed at which a viewer can react to various visual channels.

I created a web-application using D3.js to help with my investigation. The application, which you can interact with below, presents the user with a 10 by 10 grid of points. After clicking start, the points are shuffled and the user must count (as fast as they can) the number of points that are emphasized according to the criteria described below the plot window. A timer runs as the user counts the points and it is only stopped when the user inputs the correct number into the textbox and clicks the Submit button. There are three modes on the app that use different visual channels to emphasize points: “Color”, “Shape”, and “Both”. Try each mode below and see how your times compare.

See the Pen V8 by Dylan Fontaine (@dyfontaine) on CodePen.

If you would like to see the HTML, CSS, and JavaScript code I used to create this app, click on the tabs at the top of the embedded CodePen.

Does color, shape, or a combination of the two “popout” fastest?

In order to investigate which visual channel makes points popout to viewers the fastest, I had 10 friends test the app. I had them each complete 5 trials on each mode and I recorded their completion times. While my study was not completely scientific, I did take some steps to ensure that the results were not biased. I gave each user the same set of instructions and I let them all practice a few runs beforehand to get comfortable with the user interface. I made the same selection in the dropdown lists for each trial (red for color and/or circle for shape). Also, I randomly assigned each user an order to complete the 3 levels in. (For example, some users completed the “Shape” level, then the “Both” level, then the “Color” level). I compiled the results and plotted the completion times for each level.

The results show that participants generally recognized points that popout due to “Color” the quickest, followed by “Shape” and then “Both”.  It is worth noting that each user had a relatively low variance among their 3 completion times (i.e. a user who was quick on one level was typically quick on the other 2 levels).  The outliers on the high end arose from participants miscounting on their first try, then recounting the emphasized points from scratch while the clock was still running.

It makes sense that the “Both” category was the most challenging for the participants, given that it requires points to be identified using two criteria instead of one.  This illustrates an important tenet of data visualization—keep things as simple as possible.  While adding multiple visual channels to a chart can allow you to show more information at once, only do so when it is a necessary component of the story you are trying to share with your audience.

Which color pops out the fastest?

While the above result was somewhat predictable (at least according to my intuition before completing the experiment), I did not have a strong feeling about which color would popout to viewers the fastest. I completed a similar experiment to the one described above, but this time only used the color mode on the app. I had each participant complete 5 trials with each color setting and recorded their completion times. See the results below.  It is important to note that HSL colors were used on the app — each color had the same saturation and lightness value, just different hue values. This ensured that only the color (hue), and not brightness, would affect popout speed.

As you can see, red and green points were counted the fastest, purple points were counted the slowest. However, the difference between the fastest and slowest colors was not very large. I did some research to see if my results aligned with any academic studies and was pleased with my findings. While none of the studies I found tested the exact set of colors that I did, Psychologists in India [1] found that red and green stimuli prompted faster reaction times than yellow stimuli and researchers in Japan [2] found that red stimuli prompted faster reaction times than blue stimuli.

Which shape pops out the fastest?

The goal of the last experiment I conducted was to determine which shape pops out to users the fastest. The procedure for this experiment was identical to the color experiment, except the shape mode on the app was used. Participants had to count points emphasized with a different shape rather than a different color. I was not sure before completing the experiment how times for each shape (circles, crosses, diamonds, stars, and triangles) would compare. See the results below.

Stars were counted the fastest and circles were the counted the slowest. After seeing this result, I thought about why this could be. I realized that since the control points (the points not counted) were all squares, the shapes that look very different than squares should stand out the most. The 5-pointed stars look very different than squares, but the circles can easily be mistaken for squares at a quick glance. Thus, the stars popout more than circles do when amongst an array of squares.

Conclusion

When creating data visualizations, many may choose colors and shapes based on personal preference and attitudes.  In some use cases, such as advertising, speed and ease of understanding are crucial. One should consider this when designing a visualization and be aware of how the colors and shapes they use can affect interpretation speed. I hope this study has given you a new perspective on some simple but often overlooked components of great visualization.

 

 

[1] G. Balakrishnan, G. Uppinakudru, G. G. Singh, S. Bangera, A. D. Raghavendra, and D. Thangavel.  “A Comparative Study on Visual Choice Reaction Time for Different Colors in Females”.  Neurology Research International.  2014.

[2] M. Shibasaki and N. Masataka.  “The color red distorts time perception for men, but not for women” Scientific Reports 4, Article number: 5899.  2014.

Computers can compose music…but can they write scripture?

Jeff Parker

I was so impressed to learn from a peer that computer programs have been written to compose music. Music is widely considered an art that takes innately human abilities to craft pleasing sounds. Computers are dull and inanimate while music is alive and human. I do not have a musical background, but after investigating this “artificial” music, I realized that the programs analyze sounds and patterns to piece together new compositions.

So if computers can compose music, something so deeply personal and human, I thought, what else deeply personal can they do? An idea sparked when I asked my wife late at night to tell me her favorite scripture verse in lieu of reading a chapter as is our nightly custom. My wife was too sleepy to read, in her drowsy voice gabbled what sounded like a random assortment of the most common biblical words, “For behold, thus sayeth the Lord, verily I say unto thee, that…” It was hilarious, but to the untrained ear, could have passed as a legitimate passage.

The English language, much like music, follows certain patterns – any English major would agree. There are subjects and predicates instead of tunes; nouns and verbs instead of notes. The language of the ancient Jewish, Greek and Roman authors as translated by English monks certainly follows a very distinct pattern. Many bible scholars and linguistic experts have written much on this. I thought I might try my hand at using a computer to “artificially” write scripture.

I should note that the Bible is something that is deeply personal and intimate for many people (as is music). For others, it is a very divisive subject. My humble attempts here are not to make a mockery of the Bible or to advocate it, but rather just to explore linguistic patterns using a scientific process. Wherever you fall on a religious spectrum, I think much can be learned from this type of text analysis. For instance, there are 5,975 unique words in 7,958 verses in the King James Version of the New Testament. I decided to focus solely on the New Testament, because 1) it is more recent, 2) it takes place over a shorter period of time than the Old Testament and 3) it is both shorter for computing purposes and easier to understand. All my R code can be found here. Below are the most common terms in the New Testament (“the”, “and”, and “that” are excluded):

Just using the most common words, I can manually piece together something that sounds like a scripture verse:

Approach 1 – Subsequent Terms

I decided on three approaches to craft my “artificial” scripture. The first is to pick a first word (a seed word) and calculate which word is most likely to follow that. Then the word most likely to follow that word. And so on till something makes sense.

Using the second most common starting term “FOR” I started down a path till I got the following: “FOR THE LORD JESUS CHRIST…” At this point, the most common word following CHRIST is JESUS which would put the verse into an infinite loop. So I choose the second most common word following Jesus. I had to do this a few times. Here is my final verse:

Not too bad. It is fun to see how the verse wanders down different paths. However, the biggest problem with this method is that the next word is solely dependent on the previous word. The subsequent words have no memory of previous words. If I was to add a bit of memory to my algorithm to account for phrases, our verse would have wandered down a different path. This memory could be added by using the preceding two words instead of just one. The word following “THAT THEY” is “SHOULD” and would have taken the verse in a different direction. It would be fun to write an algorithm to look at the two preceding words (the two nearest neighbors from K-NN techniques). I would also like to try a decaying weight on the preceding words (similar to kernel weighting regression).

Approach 2 – Market Basket

This idea is to use rule associations used in market basket analysis to find words that commonly appear in the same verse. Again after picking a seed word, I will find the word that is mostly likely to appear in the same verse. Then take the second word and find the word most likely to appear with that word. So on and so till I have a good collection of words. Then I will manually arrange them into something that looks like a verse.

The first step to this is to convert my data into transactions and look at some rules. This graph looks at the words that appear in 10% or more of the verses, or in market basket parlance, support greater than 10%.

Market basket analysis looks at sets, so we if we expand the sets to several words, we get some great conglomerates of words that occur often together in a verse. I looked at sets of words with high support, here are few that stuck out to me:

My next step is to pick a seed topic and find words around that topic. Using the same seed word as in approach 1, I started with “FOR”. The word pair with the highest support with “FOR” is “THE”… No surprises there. I hit an infinite loop when I got to the word “THAT” so I skipped to the next highest support. In fact, {and,that,the} have very high support. Below you can see my chaining:

Now manually putting my words together:

Honestly, I was a little disappointed with this method. I tried this pattern using other seed words as well. A major drawback to the association rules is that sentences are sequenced while market baskets are not. The market basket calculates the most commonly used words in a sentence, whether the word came before or after, it does not discriminate. So this method does not capture the natural pattern of speech.

Approach 3 – Deep Learning

My last approach is to use deep learning techniques to generate the text. I have much to learn about deep learning, but fortunately, my professor, Dr. Ellick Chan, pointed me to some code on GitHub that students had used previously to generate song lyrics. I did not write any of the code, I just used it straight out of the box, but using the same New Testament as the corpus. Just for fun, I started run epochs before my wife and I went to church, and then I checked it when we got home. Here are some of the best results using the seed “not into the judgment hall, lest they s” which was randomly generated as part of the code:

I am really excited to learn more about deep learning and start playing with the code. The code actually generates based on individual characters. It is pretty remarkable that it generates actual words. However, I would like to change it so it uses words instead.

Conclusion

My intuition going into this exercise was that my approaches would get progressively better. I thought Approach 3 (Deep Learning) would do better than Approach 2 (Market Basket) which would do better than Approach 1 (Subsequent Terms). As a human reader you can be the judge, but personally, I think Approach 1 did the best. This is because this method captured the sequenced patterns of sentences (a noun is followed by a connector which is followed by an adverb, etc.). It think with some tuning the Deep Learning approach could also return some more interesting results.

Ultimately, after this exercise, I learned a lot about text analytics and a little about the New Testament as well. It was very entertaining for me to read and generate the scriptures. I am excited to see what other deeply human concepts computers can mimic in addition to writing scripture and composing music.

Special Session: US Senators on Twitter

Dustin Fontaine

Social media plays a huge role in modern day politics. Through social media, politicians can instantly share their ideas and opinions with millions of people around the world. Social media posts do not have to pass through a middleman (i.e. a journalist or news-anchor…potential sources of bias), before reaching their intended audience, so many politicians use social media as one of their main platforms for communication. We all know that President Donald Trump uses Twitter heavily. What about US Senators?

It turns out that all 100 current US Senators are active on Twitter. I wanted to investigate and compare each Senator’s tweeting behavior, so I used some analytics tools, Python and Tableau, to do so. Check out my findings in the dashboard below.


This Tableau dashboard can also be viewed here.

The sections below describe how I completed this project. It is not a step by step guide, but can serve as a basic roadmap for someone wanting to complete similar work.

Gathering Tweets

The first, most tedious step was to identify the twitter handles of all 100 senators.  This link provided a great start, but to ensure accuracy I manually inspected each twitter page.  For senators that had multiple accounts, I decided to use the account that appeared the most “official” and the least personal.

Twitter has a nifty REST API that allows programmatic reading and writing of Twitter data.  Getting an API key was simple.  All I had to do was head to apps.twitter.com, sign into my Twitter account, and click the “Create New App” button.  After filling out a form, checking a few boxes, and judiciously reading the Developer Agreement, I was granted credentials for accessing the API.

Twitter API calls are relatively simple to handle in Python.  One can piece together each GET request, use the urllib2 package to send the request, and then unpack the JSON that it returns.  However, like any lazy efficient developer, I wanted an easier way to gather tweets.  Fortunately, the Twython package provided me with an easy-to-use Python wrapper for the Twitter API.  Twython simplified each request, and returned the results I requested as native Python objects. Using a simple for loop, I was able to gather tweets from all 100 Senators and store them in a CSV file.  You can see my code here for more details.

Sentiment Analysis

Performing sentiment analysis on the tweets in Python was, in my opinion, the most interesting part of the analysis. The measure of sentiment I used was polarity, which measures how “positive” or “negative” a statement is. Polarity is measured on a scale of -1 to 1 with 1 being the most positive. Each word has a predefined polarity score and the polarity of a statement is an aggregate measure of each word’s polarity score. For example, the word “happy” has a positive score and the word “horrible” has a negative score. Modifier words such as “very” or “not” can multiply or negate the polarity score of the following word.

The TextBlob package contains a function for measuring polarity. With a simple for loop, I was able to measure the polarity of each tweet and store the results in a CSV file.

Tableau

This was my first major project using Tableau.  Having experience with Microsoft Power BI, a competing software, I was familiar with the capabilities of Tableau but had never made more than a simple, one-off chart.  Creating the underlying data model and each individual visualization was a breeze.  The real challenge lied in putting the pieces together to form an elegant, user-friendly dashboard.  I want to extend my gratitude to the members of the Tableau Community for helping me to get all of my graphs and filters to play nicely together, thou art doing righteous work.

This section of the project took the most time, but was the most rewarding.  Going in, I had a vision of how I wanted to present the data, and I was able to do so exactly as I had imagined.  I look forward to working more with Tableau in the future.

Round Peg for a Square Hole: City Blocks vs. Hexagons

Jeff Parker

Driving to the O’Hare airport in Chicago, it takes a whopping 1 hour and 15 minutes to travel a mere 8 miles from my wife’s work. The fastest route zig-zags through the suburbs on streets of various sizes and directions. We travel west and south, inevitably hitting every traffic light at every intersection.

Chicago, despite a few cross-cut and diagonal streets, follows a loose grid system. My hometown of Salt Lake City, like most modern cities, also follows a pattern of square blocks. The simple design was originally conceived by the city founders as the “Platte of Zion.” Square blocks and grid systems have since become the standard for urban design.

 

But what if we are wrong using square blocks? Squares and rectangles are very efficient shapes for humans. We put things in square boxes, read square books, live in square rooms and build square houses. However, nothing in nature is square. Rivers wind, hills rolls and mountains come to peaks. As it relates to urban-planning, human expansion is not square; therefore, cities should not be either. Cities are circular. Square boxes are great if filling up a square closet, but are square blocks really great for filling a circular city? To compare, I packed a circle with unit squares* (I found the optimal radius for this circle to be 2.12). I then packed the same circle with hexagons. The square blocks wasted ~5 units of space, while the hexagons wasted only ~3 units.

*Note: this is not the optimal way to pack a circle.

This gave me a thought: what if city blocks were not squares, but rather hexagons? From above, the city would look like the board from Settlers of Catan. Streets would follow a honeycomb pattern (bees are very efficient creatures).

To test the efficiency of hexagon blocks versus square blocks, I need to make a few assumptions. In my test, each city block will have the same amount of departure and arrival points (called nodes). There are no nodes on the corners or intersections. (There may be a store on the corner, but people depart from the edge of the store). Both shapes have the same number of nodes: blocks have three nodes on each side (3 nodes X 4 sides) while hexagons have 2 nodes on each side (2 nodes X 6 sides). Lastly, the area of both hexagons and blocks is the same, exactly 1.

Total Nodes: 108; Total Blocks 9; Block Size: 1; Total Area: 9

Running a simulation with 10 samples on my small 9 block city, it appears that square blocks have the advantage. While this was surprising to me, it does make sense. Despite the same area, my hexagon city is both taller and wider than my block city; therefore, it takes longer to travel from point to point. I would be curious to run this on a larger scale with more city blocks packed into a circle in a different pattern. Perhaps I would get different results if I included every node within a certain radius of my city’s center.

In addition to longer distances, another obvious disadvantage to the hexagon blocks is that turning right and left all the time would make going in a straight line annoying. However, hexagon blocks lend themselves well to the addition of major thorough-fares or highways. On-ramps and off-ramps are already built in.

There is at least one other advantage to hexagon blocks to consider. Three-way intersections are quite a bit faster than 4-way intersections. Vehicles turning right would not need to stop and could have the right of way. Vehicles turning left would have to yield to cars turning right. This would mean a decrease in time waiting at lights.

It looks like a few forward-thinking city planners have already been experimenting with this pattern. I am excited to see urban designs in the future.

Simple College Football Rankings System

Macon McLean

With intense rivalries and unparalleled upsets, college football is a sport like no other.  As an avowed fan, I find each Saturday is more exciting than the last, as the postseason picture becomes clearer with each passing week.  All teams are fighting for bowl games, and for the especially talented schools, a national championship is the ultimate goal.  But not every school can vie for a title:  only the schools ranked in the top four at season’s end can compete in the College Football Playoff.  This simple fact weighs on the minds of coaches, players, and fans every single Saturday.

After the season’s halfway point, while coordinators and coaches spend Mondays discussing game plans and debating tactics, there’s a wider-ranging discussion going on among another group of football geeks:  the College Football Playoff Selection Committee.  Each Tuesday in the second half of the season, the Selection Committee faces the impossible task of ranking the top teams in college football.  This is the sole charge of these arbiters of the gridiron, who number among them former university athletic directors, head ball coaches, and even a Secretary of State (Condoleezza Rice is a voting member).  After each of their weekly meetings, the Committee issues official rankings that determine placement in postseason games, including the vaunted Playoff itself.

Creating these rankings is far from simple.  Given the nature of the sport, establishing a college football hierarchy turns out to be a uniquely challenging problem.  A school’s record often belies its true ability:  an undefeated team from a weaker athletic conference might be propped up by a softer strength of schedule; a squad with two close defeats at the hands of powerhouses (think Michigan and Ohio State) might be better than some teams with fewer losses.  Turnover luck, injuries, and significant variance in scheme all make this feel like a difficult and seemingly arbitrary judgment.  The most significant challenge, however, is independent of on-field performance:  there are 128 teams to evaluate.

Apart from the logistical issues of watching every team in the country, having this many teams makes it hard to compare them directly.  Unlike professional football, top-level college football teams play only a small fraction of their possible opponents.  In the NFL’s regular season, a team will play thirteen of thirty-one candidate opponents; in college football, this number is, at most, fourteen of 127.  This makes it very hard to compare many pairs of teams, as common points of comparison are few and frequently nonexistent.  And due to schools’ membership in athletic conferences, the pool of possible opponents to fill the already small schedule is significantly restricted in size; all ten members of the Big 12 conference, for example, play 75% of their yearly regular season games against each other.

We can think of head-to-head matchups as a way of connecting teams in a graph, a mathematical and visual representation of a network.  Each school is a node and each game serves as an edge connecting two nodes.  College football graphs are often like this picture below, with dense clusters (conferences) each having a few connections to other dense clusters.  This illustrates the problem – the graph is not particularly well-connected, making human application of the transitive property unreliable, and sometimes, impossible.  Eventually all the clusters link up, but it’s not as densely connected as the NFL, and has far fewer data points for comparing teams across conferences.

https://davidabelman.wordpress.com/2015/03/02/modelling-football-team-strength-using-pagerank/

https://davidabelman.wordpress.com/2015/03/02/modelling-football-team-strength-using-pagerank/

So how does the Selection Committee even approach this problem of comparing teams across conferences?  They base their rankings on the “eye test,” i.e., estimating how talented they believe a squad to be simply by watching them play, independent of any and all advanced metrics.  Though they take into consideration other factors like strength of schedule, head-to-head results, and conference championships, what they see in the “eye test” is the primary evidence.  If this all sounds a bit questionable to you, you’re not alone; the webpage on Selection Committee Protocol begins with “Ranking football teams is an art, not a science.”

Recognizing the difficulty of this task, some enthusiasts have developed more analytically-oriented approaches to ranking teams.  A commonly referenced system is the S&P+ rankings, which use play-level and drive-level data from all college football games to measure four factors:  efficiency, explosiveness, field position, and finishing drives.  However, the Selection Committee is loath to integrate systems like the S&P+ into their proceedings too formally:  the Playoff Committee is the immediate successor to the BCS rankings system, an oft-maligned combination of polls and computer rankings whose inscrutability was, in part, the author of its demise.

So, how could the Selection Committee rank teams in a way that is simple, connected, and not overly dependent on a team’s record?  One possible approach is based on margin of victory (MoV), the extent to which one team beats another as measured in points (e.g., when Northwestern beats Michigan State by a score of 54-40, Northwestern’s margin of victory is 14, while Michigan State’s is -14).  The key is that such a system should establish a “points added” number for each team, which roughly represents how many points it adds to a given matchup, or rather, how much better or worse it is than the average team.

Consider our Northwestern vs. Michigan State example.  If, going into the game, Northwestern’s “points added” number was 9, such that it is approximately 9 points better than the average team, and Michigan State’s was -2, then we would predict that the matchup would result in an 11 point Northwestern win (we subtract these “points added” numbers to calculate the prediction – we want teams with very close ranks to have very close games).

After the game occurs, we notice that the difference in the teams’ performances was larger than expected (a 14 point Northwestern victory as opposed to the predicted 11).  This data point is then used to more accurately estimate each team’s “points added” ranking the next week.

This approach to quantifying team strength is .  The underlying math is just a system of linear equations – recalling some algebra, if we have two distinct equations each containing two unknown variables, we can solve for both unknowns.  The idea behind the SRS is to expand these equations to each team in the country, such that each equation represents a single team’s games, and each unknown represents a team’s “points added” value.  With each passing week, the college football graph shown above gets more connected, and there are more data points with which to estimate rankings.

One of the major benefits of the SRS is that it can meaningfully differentiate between close wins and blowouts, which might have prevented the 2014 Florida State debacle.  To recap, FSU won its conference championship in 2014, going 13-0 with a total margin of victory of 153.  Ignoring wins over lower-division opponents, FSU’s average margin of victory was 10.66 points per game, with one victory coming in overtime.  While that may seem good, several wins came over struggling squads, many of which failed to make a bowl game.  While the media was quick to bestow accolades on undefeated FSU, others were concerned that so many of their games had been close, and against pretty questionable opponents at that.

FSU was chosen to play in that season’s Rose Bowl as part of the College Football Playoff semifinal against 12-1 Oregon, champion of the Pac-12 conference.  Ignoring lower-division wins, Oregon’s average margin of victory was 21.75 points, despite an early loss to Arizona.  Coincidentally, Oregon crushed FSU by 39 in the Rose Bowl.  In short, this example shows that, even without adjusting for strength of schedule, margin of victory might be better than overall record at representing team quality.

In addition, the SRS can mitigate the effects of strength of schedule on wins and losses.  Teams with a difficult matchup are rewarded for performing above expectations even if they lose.  Let’s say that the University of Houston’s “points added” value is 20, and Southern Methodist University’s is -15.  The two teams play, resulting in a Houston victory, 42-21 – a 21 point margin, much less than the predicted 35.  According to the SRS, this game doesn’t reflect favorably on the Houston Cougars – instead, SMU should receive a rankings boost for keeping the game two touchdowns closer than expected!

Conversely, teams with really easy schedules have their “points added” impacted negatively – if a team consistently plays below-average opponents, then its margins of victory in these games are consistently deflated by the negative “points added” values of its opponents.  Returning to the above example, a 21-point win is usually cause for celebration, but Houston should not be impressed with its victory – it really should have performed better.

As the SRS is based on simultaneous equations, each team’s ranking is estimated at the same time, leveraging all the connections in the graph and making the rankings truly holistic.  This system can evaluate each team’s true effect on margin of victory in an analytical manner, creating an evaluation metric not found in the “eye test.”

To demonstrate this system, we must first understand its mathematical basis.  Let’s say that Northwestern’s impact on margin of victory, its “points added,” is a number XNW.  Each matchup is a data point for identifying what this number is.  The aforementioned Northwestern-Michigan State game is a piece of evidence in itself, in the form of an equation:

mclean-blog1

Taking this approach, we can then imagine a team’s games to date as a single equation, with data points from each different matchup.  We can make this equation by adding together all its games (“points added” variables of NU and its opponent) on one side and each game’s margins of victory on the other, giving us a representation of the team’s season so far.  After twelve games, Northwestern’s simplified season equation is as follows:

mclean-blog2

But we don’t want to just know about Northwestern – we want to learn the “true ranking” of every team in big-time college football.  Expanding this representation, we can create such a “season equation” for every other team, with ranking variables showing up in multiple equations – for example, Michigan State’s season equation features mclean-blog3, since those two teams played each other.  This connectedness allows for the rankings to be estimated simultaneously – 128 distinct equations with 128 distinct unknowns.

Of course, this is merely the abstract framework.  For it to be meaningful, data is necessary.  After some web scraping and formatting, it’s easy to come up with the new and improved Top 10 through Week 13, the end of the regular season:

mclean-blog4

(Northwestern clocks in at #30, in case you were wondering.)

As the SRS rankings provide a predicted margin of victory, this makes them perfect for evaluating against the point spread, the number of points by which the favored team is supposed to beat the underdog as determined by sportsbooks.  The point spread is constructed so that the market determines that either side of the spread is equally likely.   So if Northwestern is scheduled to play Pittsburgh and is favored by 3.5 points, then the market believes that Northwestern is equally likely to either a) win by 4 or more points, or b) win the game by 3 or fewer points, or lose the game altogether.

To validate this system, I decided to calculate rankings through the season’s Week 12 and then test them on the results from Week 13.  I evaluated each of the top ten teams’ Week 13 matchups relative to the point spread.  I used the rankings to predict a margin of victory for each matchup (Pred. MoV).  I then compared this to the existing point spread (also known as the “Line”) to make predictions against the spread (Prediction ATS), i.e. picking the side of the line you believe to be correct.  After, I compared to the final score, which will tell us what the outcome of the game was relative to the line (Result ATS).  A good performance against the spread would be indicative of the SRS’s capabilities – correctly predicting these 50-50 propositions, what point spreads are designed to be, is a difficult task.

The results are below:

mclean-blog5

For example, the predicted outcome of Ohio State vs. Michigan (per SRS) is a 2.45-point Ohio State victory.  This is less than the point spread, which predicts a 3.5-point victory for OSU.  Therefore, SRS predicts that the Michigan side of the line is more likely.  The final score was that Ohio State won by less than 3.5 points, meaning that the SRS’s prediction was correct.

I repeated this analysis for several other matchups, limiting it to the top 10 teams for brevity, and because distinguishing between the top teams in America is the challenge that the College Football Playoff Selection Committee really faces.  As you can see, the SRS went 6-1 against the oddsmakers, an impressive performance.  That’s like correctly guessing the flip of a coin six times out of seven!

Notably, the Wisconsin vs Minnesota matchup resulted in a “push”, which is when the outcome is the exact same as the point spread – neither side wins.  The SRS had predicted Minnesota to either win the game or lose by less than 14 points.  This outcome arguably gives more evidence of the system’s accuracy:  Minnesota led for most of the game, but gave up three touchdowns in the fourth quarter.

mclean-blog6

Above is a brief look at the comparison of the predicted and actual margins of victory.  All the win/loss predictions were correct, apart from a huge whiff on Louisville (which nobody really saw coming).

There are some other interesting tweaks one could make, including time discounting (assuming recent games mean more than games further in the past) or adjusting for blowouts, which have been omitted for simplicity’s sake.

I have posted code to my Github account containing all the scraping and ranking tools.

K-means Shouldn’t Be Our Only Choice

Ethen Liu

All the Python code associated with this blog post is on my GitHub account for those whose wish to follow along, dive deeper or modify it for their own needs. [K-means] and [Gaussian Mixture Model]

K-means

Clustering algorthims are a typical example of unsupervised clustering. Its task is to gathering similar samples (we’ll need to define which similarity or distance (dissimilarity) measure we want to use) into groups based on the data attributes (features) alone. Let’s start by creating a simple, 2-dimensional, synthetic dataset.

blog1

In the scatter plot above, we can see 4 separate groups of data points and we would like to recover them using clustering – think of “discovering” the class labels that we’ve taken for granted in a classification task. Even though the groups are pretty obvious here, it becomes hard to find them when the data lives in a high-dimensional space, where we simply can’t visualize them in a single scatterplot.

Now we will use one of the simplest clustering algorithms, K-means. This is an iterative algorithm which searches for 4 cluster centroids (we need to specify the number of clusters before we run the algorithm) such that the distance from each point to its cluster is minimized.

It starts by choosing an initial set of centroids μj (in our case j will count up to 4, since we want to have 4 clusters). A common practice is to choose randomly from the data points. After initialization, the K-means algorithm iterates between the following two steps:

1. Assign each data point xi to the closest centroid zi using standard euclidean distance.

capture1

2. Revise each centroids as the mean of the assigned data points.
capture2

Where nj is the number of data points that belongs to cluster j.

One sidenote, the standard implementation of K-means uses the Euclidean distance, thus we need to make sure all our variables are measured on the same scale if we are working with real-world datastets. Or else our distance measure will get dominated by features that have a large scale, leading to unexpected results.

The following scatter plot is the algorithm’s clustering result.

10-blog-pdf-adobe-acrobat-pro-dc

Assessing Convergence

One weakness of K-means is, the algorithm’s performance is closely tied with the randomly generated initial centroids’ quality. If the algorithm starts with a set of bad inital centers, it will get stuck in a local minimum.

Before we get into that, we need to take a step back and ask ourselves how can we tell if the K-means algorithm is converging? Well, we can look at the cluster assignments and see if they stabilize over time. In fact, that’s what we did in the toy implementation above, we ran the algorithm until the cluster assignments stop changing at all. But after running the clustering algorithm we want to have a measure that assesses the algorithm’s performance. Hence, we’ll introduce an additional measurement, Total within Sum of Squares (or refered to as inertia or heterogeneity). Mathematically, this is defined as:

capture3

Where k denotes the total number of clusters, xi is the ith data point, μj is the jth cluster, and capture4 denotes the squared L2 norm (Euclidean distance) between the two vectors.

The formula computes the sum of all squared distances between data points and its centroids. It computation
can be divided into two small parts. The within sum of squares for a single cluster, capture12 is the squared distance (note that it is “squared” distance!, do not square root it like we usually do for euclidean distance) of each point in the cluster from that cluster’s centroid. And the total within sum of squares, capture13, is the sum of the within sum of squares of all the clusters.

It might be a bit confusing, as to why, in the objective function we’re using, capture4, the squared L2 norm (euclidean distance) between the two vectors, while earlier in the cluster assignment section, we use capture5, the euclidean distance (without squaring it). The fact that we can use still use euclidean distance (without squaring) it to assign data points to its closest cluster centers is because squaring or not squaring the distance doesn’t affect the order, and it saves us a computation to square it back to squared euclidean distance.

To see that the convergence of this algorithm is not guaranteed, we’ll run K-means multiple times. Each time, we use a different random seeds to get different set of initial centroids.

seed = 0, heterogeneity = 212.00599621083518
seed = 1000, heterogeneity = 212.00599621083518
seed = 8181, heterogeneity = 523.7150139149792
seed = 555, heterogeneity = 528.0315740282444
seed = 10000, heterogeneity = 212.00599621083518
seed = 120000, heterogeneity = 528.5562600476272

Notice the variation in heterogeneity for different initializations. This shows K-means does in fact get stuck at a bad local minimum from time to time. So, the next question is, is there anything we can do about this?

K-means++

One effective way to counter this issue is to use K-means++ to provide a smart initialization. This method tries to spread out the initial set of centroids so that they are not too close together. It is known to improve the quality of local optima and lower average runtime (link to the original paper if you’re interested in the theoretical guarantee K-means++: The Advantages of Careful Seeding).

The workflow of K-means++ is as follows:

  1. Choose a data point at random from the dataset, this serves as the first centroid
  2. Compute the squared euclidean distance of all other data points to the randomly chosen first centroid
  3. To generate the next centroid, each data point is chosen with the probability (weight) of its squared distance to the chosen center of this round divided by the the total squared distance (to make sure the probability adds up to 1). Roughtly speaking, a new centroid should be as far as from the other centroids as possible
  4. Next, recompute the probability (weight) of each data point as the minimum of the distance between it and all the centers that are already generated (e.g. for the second iteration, compare the data point’s distance between the first and second center and choose the smaller one)
  5. Repeat step 3 and 4 until we have K centers

Let’s now rerun K-means using the same set of seeds as before, but this time we will always using K-means++ to initialize the centroid. Then, we’ll compare the set of cluster heterogeneities we got from using K-means++ initialization versus using random initialization.

seed = 0, heterogeneity = 212.00599621083518
seed = 1000, heterogeneity = 212.00599621083518
seed = 8181, heterogeneity = 212.00599621083518
seed = 555, heterogeneity = 212.00599621083518
seed = 10000, heterogeneity = 212.00599621083518
seed = 120000, heterogeneity = 212.00599621083518

K-means average: 369.38680610389275
K-means++ average: 212.0059962108352

From the result, we can see that random initialization results in a worse clustering heterogeneity than K-means++ on average.

Although it looks like K-means++ worked pretty well in this simple example, it is not a panacea. It can still get stuck at local minima from time to time. Thus, in practice, we should run K-means at least a few times with different initializations and then return the best one (one that resulted in the lowest heterogeneity). For scikit-learn’s Kmeans, the default behavior is to run the algorithm for 10 times (n_init parameter) using the kmeans++ (init parameter) initialization.

Elbow Method for Choosing K

Another “short-comings” of K-means is that we have to specify the number of clusters before running the algorithm, which we often don’t know apriori. For example, let’s have a look what happens if we set the number of clusters to 3 in our synthetic dataset.

blog2

K-means can still run perfectly fine, but this the probably not the result we’re looking for.

There’re many different heuristics for choosing the suitable K, the simplest one being the Elbow method. The way it works is we run the algorithm using different values of K and plot the heterogeneity. This measure will decrease as the number of clusters increases, because each cluster will be smaller and tighter. And what we’re hoping for is the measure will keep on decreasing up till the optimal cluster number, and the decrease will start to flat out after that.

blog3

As we can see, the plot suggests we should choose 4 as our cluster number, since starting from K = 4 the magnitude of the decease starts dropping (it looks like the pit of an elbow, that’s why it was given the name the elbow method), which makes sense given our visual expection of the dataset previously. In practice, this “elbow” can sometimes be hard to see and can be quite subjective.

Now that we gotten a bit familiar with how K-means works, we’ll start addressing the assumption that K-means is making and its caveats.

K-means Caveats

Different algorithm makes different assumptions, hence the quality and interpretability of the result will depend on whether these assumptions are valid for our goal. For K-means clustering, the model assumes that all clusters have equal, spherical variance. To let the notion sink in, let’s look at some cases where K-means will generate results that might not match our intuition.

Suppose our data looks like the following:

blog4

Here we randomly generated three cluster of data points drawn from three different normal distribution, each having 20, 100 and 500 data points. Let’s try K-means on this dataset.

blog5

Ouch. In its quest to minimize the within-cluster sum of squares, K-means algorithm will give more “weight” to larger clusters. As we can see from the plot, a lot of data points from the smaller cluster (of size 20) end up being far away from any centroid, while the larger cluster (of size 500) gets chopped in half. This is because K-means assumes all clusters have equal, spherical variance, meaning that each cluster has roughly equal number of observations and the cluster’s will tend to form a sphere shape.

Let’s fit a different kind of clustering algorithm, Gaussian Mixture Model and visualize its clustering result:

blog6

In general, there is no guarantee that structure found by a clustering algorithm has anything to do with what we were interested in. And there will always be edge cases where a clustering algorithm might lead to unintuitive results (for higher dimensions where we can’t visualize the dataset easily, we might not even know whether the result matches our intuition or not). For visualization of comparing different clustering algorithms on toy 2d datasets, refer to the following link, scikit-learn doc: Plot cluster comparison.

The take away message for this section is that: K-means is usually “go-to” clustering algorithm for many, because it is fast, easy to understand, and available in lots of statistical or machine learning toolkit. If we have an EXTREMELY LARGE dataset then K-means might be our only option. But we must also understand that it is still making some assumptions and sometimes these assumptions can lead to unsatisfying or misleading results.

Gaussian Mixture Model

The title of the blog post was “K-means Shouldn’t Be Our Only Choice”. So now we’ll dive into a different kind of clustering algorithm.

Clustering methods such as K-means have hard boundaries, meaning a data point either belongs to that cluster or it doesn’t. On the other hand, clustering methods such as Gaussian Mixture Models (GMM) have soft boundaries, where data points can belong to multiple cluster at the same time but with different degrees of belief. e.g. a data point can have a 60% of belonging to cluster 1, 40% of belonging to cluster 2.

Apart from using it in the context of clustering, one other thing that GMM can be useful for is outlier detection: Due to the fact that we can compute the likelihood of each point being in each cluster, the points with a “relatively” low likelihood (where “relatively” is a threshold that we just determine ourselves) can be labeled as outliers. But here we’ll focus on the clustering application.

Gaussian Distribution

In GMM, each cluster corresponds to a probability distribution, in this case the Gaussian distribution. What we want to do is to learn the parameters of these distributions, which is the Gaussian’s mean μ (mu), and the variance o2 (sigma).

The mathematical form of the Gaussian distribution in 1-dimension (univariate Gaussian) can be written as:

capture6

  • This is also referred to as the probability density function (pdf).
  • Gaussian distribution is commonly referred to as the normal distribution, hence that’s where the N
    comes from.
  • x refers to the random observation over which this distribution is placed.
  • The mean μ, controls the gaussian’s “center position” and the variance 2, controls its “shape”. To be precise, it is actually the standard deviation , i.e. the square root of the variance that controls the distribution’s shape.

blog7

But that was in one dimension, what about two, three, four . . . It turns out the univariate (one-dimensional) Gaussian can be extended to the multivariate (multi-dimensional) case. The form of a d-dimensional Gaussian:

capture7

In higher dimensions, a Gaussian is fully specified by a mean vector μ and a d-by-d covariance matrix,  (do not confused this symbol with P, which is used for denoting summing a bunch of stuff). || refers to the determinant of the covariance matrix e.g. In two dimension, the Gaussian’s parameters might look like this:

capture8

The mean vector, containing elements μ1 and μ1 centers the distribution along every dimension. On the other hand, the covariance matrix specifies the spread and orientation of the distribution. Along the diagonal of this covariance matrix we have the variance terms capture14 and capture14  representing the shape (spread) along each of the dimensions. But then we also have the off-diagonal terms, capture15and capture16(these two thing actually take the same value because this a symmetric matrix) that specify the correlation structure of the distribution.

Let’s look at a few examples of covariance structures that we could specify.

blog8
One way to view a Gaussian distribution in two dimensions is what’s called a contour plot. The coloring represents the region’s intensity, or how high it was in probability. So in the plot above, the center area that has dark red color is the region of highest probability, while the blue area corresponds to a low probability.

The first plot is refered to as a Spherical Gaussian, since the probability distribution has spherical (circular) symmetry. The covariance matrix is a diagonal covariance with equal elements along the diagonal. By specifying a diagonal covariance, what we’re seeing is that there’s no correlation between our two random variables, because the off-diagonal correlations takes the value of 0. Furthermore, by having equal values of the variances along the diagonal, we end up with a circular shape to the distribution because we are saying that the spread along each one of these two dimensions is exactly the same.

In contrast, the middle plot’s covariance matrix is also a diagonal one, but we can see that if we were to specify different variances along the diagonal, then the spread in each of these dimensions is different and so what we end up with are these axis-aligned ellipses. This is referred to as a Diagonal Gaussian.

Finally, we have the Full Gaussian. A full covariance matrix allows for correlation between our two random variables (non zero off diagonal value) we can provide these non-axis aligned ellipses. So in this example that we’re showing here, these two variables are negatively correlated, meaning if one variable is high, it’s more likely that the other value is low.

Parameter Estimation

Now that that’s covered, we’ll start introducing the algorithm using a 1-dimension data. Suppose we have a bunch of data points and we know that these data points come from two separate Gaussian sources. Given these data, can we infer back what were the Gaussian sources that generated the data? Or to paraphrase the question. The guassian distribution has two parameters the mean and the variance, can we estimate them from our data?

blog9

 

Well, since we know which data came from which Gaussian distribution, all we need to do is to compute the mean and the variance for both groups and lo and behold we get our estimates for the two Gaussian.

blog10

That’s great!! But this is all based on knowing which points came from which distribution. Now, what if we have just a bunch of data points, we don’t know which one came from which source. Can we trace back these Gaussian sources? Hmm . . . , a bit trickier isn’t it? On the other hand, what if someone came along and actually told us the parameters for the Gaussian, then we could actually figure out which points is more likely to come from which Gaussian. Given these information, we know have a chicken and egg problem. If someone told us which point came from which source, we can easily estimate the means and variance. Or if someone told us the mean and the variance for the Gaussians then we can figure out the probability of each point coming from each Gaussians. Unfortunately, we have neither . . . ..

This is the exact situation we’re in when doing GMM. We have a bunch of data points, we suspect that they came from K different guassians, but we have no clue which data points came from which guassian. To solve this problem, we use the EM algorithm. The way it works is that it will start by placing guassians randomly (generate random mean and variance for the guassian). Then it will iterate over these two steps until it converges.

  • E step: With the current means and variances, it’s going to figure out the probability of each data point xi coming from each guassian.
  • M step: Once it computed these probability assignments it will use these numbers to re-estimate the guassians’ mean and variance to better fit the data points.

E Step

We’ll now formalize this. Recall that GMM’s goal is to output a set of soft assignments per data point (allocating the probability of that data point belonging to each one of the clusters). To begin with, let’s just assume we actually know the parameters k, μk and k (from some random initialization) and we need a formula to compute the soft assignments having fixed the values of all the other parameters.

capture9

Let’s break this down piece by piece. The soft assignments are quantified by the responsibility vector r. For each observation i, we form a responsibility vector with elements ri1, ri2, all the way up to riK. Where K is the total number of clusters, or often refered to as the number of components. The cluster responsibilities for a single data point i should sum to 1.

The name Mixture of Gaussians comes from the notion that, in order to model more complex data distribution, we can use a linear combination of several Gaussians instead of using just one. To compute the mixture of Gaussians, we introduce a set of cluster weights, k, one for each cluster k. Where capture17and capture18 (meaning that the sum must add up to one and each of them is between 0 and 1). This parameter tells us what’s the prior probability that the data point in our data set x comes from the kth cluster. We can think it as controlling each cluster’s size.

The next part of the equation, capture19 tells us: Given that we knew that the observation comes from the kth cluster, what is the likelihood of observing our data point xi coming from this cluster. After multiplying the prior and the likelihood, we need to normalize over all possible cluster assignments so that the responsibility vector becomes a valid probability. And this is essentially the computation that’s done for the E step.

M Step

After computing the responsibility vector for the current iteration, we then use it to update GMM’s parameter.

capture10

First, the cluster weights k, show us how much each cluster is represented over all data points (each cluster’s relative size). This weight is given by the ratio of the soft count capture21 over the total number of data points N.

When updating our parameters’ estimates for each cluster k, we need to account for the associated weights rik for every one of our observation. So every time we’re touching a data point xi it’s going to be multiplied by rik.

Another thing that’s worth noticing is that, when we’re updating the parameter ˆμk and ˆk, instead of dividing the summation with the raw count of the total number of data points in that cluster Nk, we will use the effective number of observations in that cluster (the sum of the responsibilities in that cluster) as the denominator. This is denoted as capture21.

Assessing Convergence

Apart from training the model, we also want a way to monitor the convergence of the algorithm. We do so by computing the log likelihood of the data given the current estimates of our model parameters and responsibilities.

Recall that during the E step of the algorithm, we used the formula:

capture11

To compute the weighted probability of our data point xi coming from each cluster j and summed up all the weighted probability. If we were to assume the observed data points were generated independently, the likelihood of the data can be written as:

capture

This basically means that we multiply all the probability for every data point together to obtain a single number that estimates the likelihood of the data fitted under the model’s parameter. We can take the log of this likelihood so that the product becomes a sum and it makes the computation a bit easier:

capture20

Given this formula, we can use it and say: If the log likelihood of the data occuring under the current model’s parameter does not improve by a tolerance value that we’ve pre-specified, then the algorithm is deemed converged.

Implementing the EM algorithm

To help us develop and test our implementation, we first generate some observations from a mixture of Gaussians.

blog11

Just like with K-means, it is important to ask how we obtain an initial configuration of mixing weights and component parameters. In this simple case, we can take three random points to be the initial cluster means, use the empirical covariance of the data to be the initial covariance in each cluster (a clear overestimate), and set the initial mixing weights to be uniform across clusters. On the other hand, a better way to initial the GMM parameters is to use K-means as a first step and use its mean/cov of those clusters to initialize EM.

Note. Like K-means, EM is prone to converging to a local optimum if the initial set of parameters are sub-par. In practice, we may want to run EM multiple times with different random initialization.

With that in mind, we will run our EM algorithm to discover the mixture components and visualize its output. When working with low-dimensional data, one useful way of testing our implementation is to visualize the Gaussian components over the data at different points in the algorithm’s execution:

  • At initialization
  • After running the algorithm to completion (convergence)

blog12

blog13

From the plot of different iterations in the EM algorithms, one can see that the Gaussian model is incrementally updated and refined to fit the data and the algorithm converged before it reached the maximum iteration that we’ve specified.

How many Gaussians?

Just like with K-means, GMM requires the user to specify the number of components (clusters) before training the model. Here, we can use the Aikaki Information Criterion (AIC) or the Bayesian Information Criterion (BIC) to aid us in this decision. Let L be the maximum value of the likelihood function for the model, p be the number of estimated parameters in the model and N be the total number of data points.

Then the AIC value of the model is the following:

AIC = 2 · p − 2 · ln(L)

And the BIC value is denoted as:

BIC = −2 · ln(L) + p · ln(N)

For both evaluation criteron, the lower the better.

blog14

It appears that for both the AIC and BIC, 3 components is preferred.

 

Advice taken from Notes: AIC vs. BIC.

In general, it might be best to use AIC and BIC together in model selection. Most of the times they will agree on the preferred model. The main differences between the two is that BIC penalizes model complexity more heavily. Thus if they do disagree. e.g. If BIC points to a three-class model and AIC points to a five-class model, it makes sense to select from models with 3, 4 and 5 latent classes and see which one leads to a more suitable result.

Things to Note

One drawback of GMM is that there are lot parameters to fit, and usually requires lots of data and many iterations to get good results. An unconstrained model with K-mixtures (or simply K clusters) and D-dimensional data involves fitting D ×D ×K +D ×K +K parameters (K covariance matrices each of size D × D, plus K mean vectors of length D, plus a weight vector of length K). That could be a problem for datasets with large number of dimensions (e.g. text data), because with the number of parameters growing roughly as the square of the dimension, it may quickly become impossible to find a sufficient amount of data to make good inferences. So it is common to impose restrictions and assumption to simplify the problem (think of it as introducing regularization to avoid overfitting problems).

One common way to avoid this problem is to fix the covariance matrix of each component to be diagonal (off-diagonal value will be 0 and will not be updated). To acheive this, we can change the covariance_type parameter in scikit-learn’s GMM to diag.

The following link includes an example of comparing GMM using different covariance matrices on a toy dataset, sckit-learn docs: GMM classification.

Reference

K-means:

GMM:

Classifying images with style

Dr. Ellick Chan

Recent research has shown that popular computer vision algorithms can struggle with recognizing certain types of stylized or artistic photos posted on services such as Instagram because they alter key properties of the image such as color, shape, texture, and other informative features needed for accurate recognition. MSiA students Kapil Bhatt, Ke Huang, Mengshan Jin, Balamurali Natarajan, and Elena Smith devised a way to significantly improve the recognition rate on stylized images by devising a system that removes certain kinds of style transformations before the image is recognized by a conventional system. Their work was featured at the annual Deep Learning and Data Visualization Poster Session.

MSiA students Kapil Bhatt, Ke Huang, Mengshan Jin, Balamurali Natarajan, and Elena Smith share how to classify images with style.

The team used a sophisticated analytics technology called Deep Learning to remove artificial style transformations from photographs that were distorted by artistic effects such as Instagram’s Kelvin and Nashville filters. To accomplish this task, the team developed a custom version of a denoising autoencoder that takes a distorted image and produces a clean version of it. Using this technique, the team then fed the denoised images into a standard convolutional neural network for image recognition and managed to classify many stylized images correctly.

The team used Keras, a popular deep learning library, to develop a custom convolutional denoising autoencoder based on one of the samples published on the Keras blog. They made many helpful improvements to the example code including: adding leaky ReLU activations, switching to the Adagrad optimizer and augmenting their training data by adding more training samples specific to the filters used.

This work comes at a very transformative time for NU, as new initiatives in data science are transforming the way that NU does research in broad areas including the arts. The project was part of a new course on Deep Learning (MSiA 490-30) taught by Dr. Ellick Chan. Technology developed from this project could be very useful in a time and age when people are increasingly posting stylized photos on the Internet. Many deep learning image recognition systems are currently trained on a variety of natural images, but as this project shows, new techniques may need to be developed to handle unexpected variations in images including the use of stylizers and other ways to manipulate images. This effort is just one step in developing more robust recognizers.

Please see below for the team’s poster:

robustclassification_poster