The EMD Corpus

The EMD corpus of  Early Modern Drama consists  of 320 English plays written between 1520 and 1660. It contains all of Shakespeare’s plays (but not his poems), most or all of the plays of his major contemporaries and followers such as Ford, Jonson, Kyd, Marlowe, Marston, Middleton, Shirley, Webster, and quite a few other plays by writers now largely forgotten.

The EMD corpus does not include every surviving play. There are about 150 plays not in it, including Gorboduc and the anonymous versions of Taming of the Shrew and King John, which are of considerable interest to Shakespeareans. Nor is the selection based on some clearly thought-out principle. The EMD corpus consists of the plays that happened to be available in 2007 in the EEBO-TCP corpus. But however haphazardly assembled, it provides a sufficiently robust basis for the simple quantitative analyses that are the subject of this blog.

With the exception of Shakespeare, the plays are taken from the EEBO-TCP transcriptions. Until 2015, when they will pass into the public domain, access to these texts will be limited to subscriber institutions and their members. The Shakespeare texts are based on the text of the WordHoard Shakespeare, which is an eclectic revision of the Globe Shakespeare.

The EMD corpus is a linguistically annotated corpus consisting only of words spoken by characters in the plays. Prefatory materials, cast lists, speaker labels, stage directions, and notes were excluded to achieve an “apples to apples” basis for various comparative analyses. Because the exclusions were algorithmically done, there are probably a few inconsistencies, mostly relating to the treatment of prologues and epilogues, intrinsically ambiguous textual creatures.

Linguistic annotation in the EMD corpus consists of mapping every surface spelling to the combination of a part-of-speech tag and a lemma in its modern spelling. Depending on context, the spellings “lo’d” “lou`d” “lou’d” “lou’de” “loued” “lovd” “lov’d” “lov’de” “loved”, all of which occur in EMD corpus, were mapped to either “love_vvd” or “love_vvn,” where “vvd” marks the past tense of “love” and “vvn” the past participle.

No human reader would want to read text so processed, but — making allowances for the inevitable errors of algorithms, however assiduously corrected by erring humans — it supports “apples to apples” comparisons of many kinds. There are times when you want to focus on the orthographical or typographical habits of authors or printers, and there are times when you want to ignore them. The analysis of the EMD corpus in the following entries assumes that the 56 authors of EMD plays (including “anonymous”) all spelled ‘took’ and ‘taken’ in the same way and thought of them as the proper spellings of the past tense and participle of ‘take’. I use the term ‘lempos’ for the abstract construct of a lemma and part of speech however spelled by the author(s) or represented by the printer(s).

It is worth dwelling on the costs and benefits of this approach from a stylometric perspective. The first systematic and computationally assisted analysis of textual difference was Mosteller’s and Wallace’s study of the dozen Federalist papers where there was disagreement about whether had been written by Madison or by Hamilton. Among other things the papers differed in their relative frequencies of ‘while’, ‘whilst’ and ‘by’. If two authors differ in their use of ‘while’ and ‘whilst’ , the difference runs a little deeper than orthographic variance, but you can substitute one for the other without changing the fabric of their discourse. But a difference in the use of ‘by’ cuts deeper and points to constructions or phrases that one author prefers or avoids when compared with another.

For “authentication studies”, whether in art history, literature, or espionage, accidentals are of the essence, and they typically provide much harder evidence. An accidental may determine whether X is or is not the work of Y. But if you want to know how X or Y differ from each other, accidentals may stand in the way. If it were the case that a dozen Federalist papers differ only in their frequency of ‘while’ and ‘whilst’ you would have to agree that their authors write about the same things in the same way, unless you could argue that it pointed to some significant difference in their way of relating themselves to time.

There is a limited number of things that texts in the EMD corpus may be said to “know” about themselves. Here is a list of them:

  1. Each word knows that it is spoken by a character in a play, though it is ignorant about the name, sex, or other properties of the character.
  2. Each word knows whether it is part of verse or prose (ignoring occasional errors by encoders of the texts.
  3. Each word knows that it is part of a distinct utterance by a character, although it does not know which.
  4. Each word knows what text it belongs to.
  5. The words in a text do not all know what act or scene they belong to because act and scene divisions are inconsistently observed in the early print editions and there inconsistently reported in the digital files.
  6. Each text knows about its genre, its date of composition, and its author

As a result, texts in the EMD corpus can be compared with some confidence at the level of

  1. the word occurrence — captured as a ‘lempos’ or combination of lemma and part of speech–
  2. the binary distinction of verse and prose
  3. the sentence
  4. the speech or any sequence of words between one change in characters and the next
  5. the work as a whole


In many of the printed sources, the division of the play into acts or scenes is not consistently observed. As a result, inter-scenic comparison is not possible in the EMD corpus at the moment. This is a pity because the scene is the most important unit of construction. It is also plausible that collaboration — a very common phenomenon among Early Modern playwrights — typically involved a division of labour along scenes. Algorithmic comparison of interscenic difference may well be the most potent tool for teasing out details of collaboration. But it will take considerable manual editing of the current XML source texts to make this possible across the entire corpus.

For the moment, every play in the EMD corpus is assigned to one, and only one, author. That is pretty rough justice. On the other hand, initial results provide strong evidence that differences by author trump differences by genre or period. For many purposes it may be a good enough working hypothesis that a given play is predominantly the work of a single author.