This is a blog post about the distribution of a special kind of “dislegomena,” tetragrams and longer n-grams whose “collection frequency” is 2 and whose “document frequency” is also 2. My purpose is to figure out how many swallows make a summer. If you are interested in the intertextual relationship between one play and another, how many shared n-grams do you need to make a plausible case that something is going on?
The data in this project are heavily dependent on the work of Phil Burns and Craig Berry. Burns’ MorphAdorner was the tool to create linguistically annotated versions of the 548 plays in the SHC corpus, where every word token is mapped to a lemma, a part-of-speech tag, and a standardized spelling associated with this ‘lempos.’ Berry wrote the code that generated a complete list of repetitions for the Chicago Homer almost two decades ago (‘wrote the code’ is written much more quickly than writing that code!). A slightly revised version of this code worked well with a data input that consisted of all the words spoken by characters in the SHC plays (and only those words, barring a few processing errors), ignoring punctuation markers. Names were mapped to the place holder NAME and arge ignored in this post.
The key concept in Berry’s script is the independently recurring substring. The character tetragram ‘abcd’ includes the trigrams ‘abc’ and ‘bcd’. If the former never occurs outside the tetragram, but the latter does, ‘abc’ is not registered as a distinct trigram, but ‘bcd’ is. And so on for millions of n-grams.
There are 5.9 million “repetition events” in the SHC corpus, that is to say, n-grams of three or more words that do not contain a name. Some occur in a lot of plays. “I will not” occurs 3565 times in 495 plays. Which makes you wonder how 63 plays manage to get by without this indispensable phrase. Others occur only in two plays, and these are the objects of inquiry here. There are ~660,000 of them. A majority of them are trigrams, and we focus here on the 273,000 dislegomena ngrams that consist of four words or more. There are some 150,000 pairwise combinations of plays in the SHC corpus (the formula is (548-1)*(548/2)). About 50,000 pairwise combinations do not share any such dislegomena. The average and median values hover around 2 for the ~100, 000 pairwise combinations in which such dislegomena occur. For all pairwise combinations the median value is one, and the average not much above it.
In short, it is quite rare for two plays–texts that are typically between 15,000 and 25,000 words long–to share more than one or two of the dislegomena analyzed here. As for those dislegomena considered in themselves, they are thoroughly boring strings of words, such as “I am as I would” or “I spy him now.” The dislegomena in the plays with lots of them are no more striking than those with just one or two.
What about the pairwise combinations that share a lot of dislegomena? Expect some “duh moments,” but remember that even “duh” has its uses. There are 4,629 pairwise combinations that share seven or more dislegomena. Their z-scores are above 2, which means that the counts are two or more standard deviations above the average. This is a very crude measure, but it will do for the current purpose.
1018 or 22% of these pairwise combinations involve plays by the same author. Compare this with the fact that of the 273,000 dislegomena events, only 18,010 or 7% involve same author pairwise combinations. So it appears that plays by the same author are likely to share more dislegomena. Duh! If we look more closely at shared dislegomena by same-author play pairs, we discover that on average plays by the same author share five dislegomena, and the median is four. Roughly speaking, plays by the same author are likely to share twice as many dislegomena as plays by different authors. Clearly some author effect is at work, and there is some virtue in adding some precision to this intuitively plausible conclusion.
What is the use of this information? Well, it tells you something about what is a little and what is a lot, and it sets a framework of expectations within which the evidentiary value of shared diselegomena can be evaluated. There is not a whole to be said about play-pairs that share four or even eight dislegomena, but out of ~130,000 pairwise combinations of play by (putatively) different authors, there are only 119 that share more than a dozen dislegomena of our type. That hardly means that they are “really” by the same author. It does, however, mean that something worth investigating may be going in all or most of those combinations.
If you want to look at the data on which this post is based, you can find a version of them at
This is a spreadsheet derived from a MySQL database. It is possible, but awkward, to manipulate the data in that spreadsheet. If you know a little about Microsoft Access, it would be easy to import the data into Microsoft Access. Manipulating them in that environment is much faster and nimbler, but you need to know a little about SQL logic.
Pervez Rizvi prepared a much fuller and very careful survey of shared n-grams in Early Modern Drama at http://www.shakespearestext.com/can/.