What do repeated phrases or n-grams tell us about how distant from or close to each other pairs of early modern plays are?  Do n-grams provide  dependable measures of distance, and can we learn from them about the weight of various factors that differentiate between one play and another, whether by date, genre, or author?

The following is a report on various experiments with n-grams. My quantitative skills are very limited, and the techniques I use are crude and idiosyncratic, but from a proof of the pudding perspective, they work well enough. That is to say, where the results show very little or a whole lot of X or Y, they typically confirm what a knowledgeable observer knows already. The very obviousness of such results inspires confidence that the method works, at least some of the time.

The major finding of my n-gram analysis is that authors are trumps.  Plays that are known to be by the same author share on average twice as many n-grams as plays that are related by genre or are written in the same five-year span regardless of author. The “author effect” is much stronger than any other factor that establishes links between plays.

But strong as the author effect is, it is of relatively little help in determining whether play X was written by author Y.  We can say with great confidence that plays by the same author share more n-grams, just as we say with great confidence that men are on average taller than women by a percentage that can be specified with some precision. But we cannot argue that a person who is five foot nine inches tall is a man. Some women are much taller than many men, and for reasons that may or may not be worthy of further inquiry, some plays by different authors share many more n-grams than plays by the same author.  That is the old story of statistical analysis as being very helpful in general and not so helpful in determining the individual case.

How I got my n-grams

I have  a table of some 730,000 distinct n-grams, extracted from the 320 plays in the EMD corpus. The n-grams range in length from three to 77 words and are distributed between one and 301 plays. Document frequency is the technical term for the number of distinct plays in which an n-gram appears. Collection frequency is the technical term for the total count of an n-gram in a collection. “I will not,” which occurs 2,320 times in 301 plays is the most common trigram in terms of document or collection frequency.

The n-grams were extracted in a tedious manner, which I will describe elsewhere, and they are based on a rather abstract representation of the underlying texts. The EMD corpus was linguistically annotated, and each “surface form” or spelling was mapped to a combination of a particular lemma and POS tag. From that combination I derived a standard spelling.  There are a few problems with that method, but they are outweighed by the advantage of leveling orthographic variance to a common standardized form of each word that has been linguistically defined in the same way. In this method, differences between texts are functions of those texts and not of some printer’s spelling habits.

Play links and how to weigh them

Every n-gram that is shared between one play and another establishes a link, however tenuous, between those two plays. The value of that link depends on four factors:

  1. The length of the n-gram. The frequency of n-grams declines very rapidly with length. There are some 460,000 distinct trigrams but only 1,360 heptagrams. You could define the value of an n-gram as 1 divided by the number of n-grams of that type. But such a way of counting assigns enormous weight to the occasional long n-gram. I used the square root of n-gram length instead. This gives a value of 1.73 to a trigram and a value of 8.66 to a 75-gram.
  2. The document frequency of the n-gram. If we think of a given link as having a force equal to 1, we can derive its value by the number of pairwise combinations among which it is shared. A link that is unique to two plays has a value of 1. A link shared among three plays generates three pairwise combinations. The number rises rapidly according to the formula (n-1)/0.5n. Thus the power of an n-gram shared among 16 plays is dissipated across 120  or 15*8 pairwise combinations. In this inquiry I ignore n-grams that are shared among more than 16 plays because their values are too small and scattered to make a difference.
  3. There is a difference between n-grams that string together common function words, such as “I will not” or “you and I” and n-grams that include weighty or rare words, such as “veni, vidi, vici.” I put words in log(10) bins and assigned them a value of 1/exponent. Thus a word that occurs five times has a value of 1/1, while a word that occurs 500 times has a value of 1/3. Then I added the log values of each word to get a weighted word value for the entire n-gram.
  4. A link that is shared between two short works is a rarer think than a link shared between two long works. I treated the aggregate word count of  play pairs as a single textual, counted the links of a certain type  and normalized their values to relative frequencies per 10,000 words.

You put those four things together and you get a simple formula: divide the length value of an n-gram by its document frequency value, multiply it first by the weighted word value of the n-gram and then by the relative frequency of its occurrences in a given pairwise combination of plays. You can add these values for all the n-grams in a pairwise combination and arrive at an overall weighted value for each link between two plays.

 

This is a pretty crude method of weighing n-grams, but it works reasonably well. It tells you, for instance, that “I will tell you,” which occurs in 178 plays, has a value below the measuring threshold, that “you and I have,” which occurs in 16 plays, has the barely measurable value of 0.01, and that “quem facient aliena pericula,” which occurs in two plays, has the maximum value of 8.

A few observations about repetitions within and between plays

Before turning to n-grams that occur in two or more plays, it helps to know something about the frequency of repetition within plays. An n-gram is more likely to be repeated within a given work than in some other work. It is possible to add some quantitative precision to this unsurprising pronouncement. In the EMD corpus, there are are about 25,000 n-grams of three or more words that are repeated at least once once in a given play but do not occur in any other play. Since there are 320 plays, you expect a play to have on average about 75 n-grams that are unique to it. By contrast, the EMD corpus has about 385,000 n-grams of three or more words that occur only in two works and are shared between 51,040 pairwise combinations of plays. That works out to an average of 7.5 shared n-grams between any pairwise combination of plays. Leaving aside the actual distribution of n-grams across these “play links,” as I will call them, it appears that a whole order of magnitude separates repetition within and across plays.

The table below gives a “seven number summary” of the difference between repeated n-grams within a play or shared across two plays. It shows pretty clearly that play-crossing n-grams are actually quite rare. If you find more than a handful of them in a given a pair of plays the odds are that something is going on.

Value within one play across two plays
min 5.32 0
10% 20.61 0.56
25% 24.16 1.05
50% 36.45 1.67
75% 51.02 2.46
90% 68.38 3.32
max 214.9 35.98

There is one other unsurprising, but in its way quite striking, fact about n-grams repeated within a play: their frequency diminishes rapidly with distance. Because plays differ markedly in length it makes sense to express distances in a normalized manner. You can express the distance between the two occurrences of an n-gram as a fraction of the total length of the work in which they occur. Repeated n-grams will thus be distributed in some fashion across deciles of distance. A chart of that distributions looks like this:

NewImage

It is a reasonable inference from this chart that the occurrence of n-gram repetition within a play is strongly motivated by scenic context. If about half of all repetitions occur within the first decile and the drop from the first to the second decile is a a factor of six, we can see why repetitions across plays are so much rarer than repetitions within plays.

What can you learn from tabulating the results?

N-grams shared between different plays come in two flavours. Some of them are unique to a particular pairwise combination. Others are shared among several plays. For now add their values. The “link weight”  that is generated by the various multiplications, divisions, and additions a rather abstract “rep value unit,” to be valued only in the proportions it points to. This point is made even clearer if you “standardize” the weights and create “z-scores” by subtracting the average from a given value and dividing it by the standard deviation..  The result of that operation tells you how many standard deviations a given value sits above or below the average. In a normal distribution 95% of values are found within two standard deviations, Thus a person with a height z-score of +2 is quite tall, while a person with a z-score of -2 is quite short. The weighted value for n-grams in pairwise combinations of plays ranges from 0 to 1393. The corresponding z-scores are -0.4 and 212.3, which is an astronomical number. This outlier comes from Jonson’s Neptune’s Triumph and the Fortunate Isles, two masques that recycle a lot of materials.

In the following descriptive statistics the most interesting results are not found in the outliers, but in the shifting values of the interquartile range from the 25th to the 75th percentile. Here is the table that reveals differences when you group pairwise combinations of plays by author, genre, or date:

play links by category n min 10% 25% 50% 75% 90% 99% 99.90% max
All plays 51040 0 1.63 2.75 4.16 5.79 7.66 13.83 26.54 1451.72
same author 2356 0 3.83 5.62 8.01 11.73 16.38 30.94 70.31 1451.72
different authors 48618 0 1.61 2.71 4.06 5.6 7.26 11.1 16.5 172.6
less than five years apart 6938 0 2.1 3.44 5.04 7.02 9.9 22 42.92 1451.72
less than five years apart, different authors 6047 0 2.01 3.27 4.71 6.35 8.19 12.68 20.69 28.88
more than thirty years apart 14845 0.01 0.31 2.11 3.28 4.69 6.19 9.75 13.99 172.6
same genre 15727 0 2.43 3.51 4.95 6.73 8.83 16.6 32.92 1451.72
same genre, same author 871 0 5.17 7.11 10.14 13.6 19.14 39.97 1471.52

From this table you learn that the interquartile range for plays by the same author (5.6-11.7) is twice as high as the comparable range for plays by different authors (2.71-5.6). So there is a quite marked author effect, which shows up through the range. Plays by different authors less than five years apart share more n-grams (3.44 – 7.02) than play pairs that are more than thirty years apart (2.11-4.69), but the time effect is considerably less than the author effect. The same is true of genre. The same genre or a narrow time gap will shift the interquartile ranges upward by 15% to 25%.

There is a big difference between 25% and 100%. Authors are trumps. There is more to be said about how to make sense of these crude things when looking at particular plays. But that is a story for another blog.