In my earlier post “From Shakespeare His Contemporaries to the Book of English” I promised to release all SHC plays “later this spring.” I have now done so, and you may download all 504 of them from https://github.com/martinmueller39/shc. Most of the texts come from Phase I of the TCP project and have been in the public domain since January 2015. Proquest graciously gave me permission to add about three dozen texts from TCP Phase II, notably the 1647 edition of the plays of Beaumont and Fletcher, which contains the first printings of some of their plays. I am likely to add a few plays here and there, but the corpus in its current form gives you a pretty complete view of the work of Shakespeare His Contemporaries broadly construed.
If this were a software release it would be somewhere between late alpha and early beta. There are still many errors, and much work remains to be done. I am following a “release early and release often” strategy in the hope that users will help discover errors. Once they are pointed out they are usually easy to fix.
In what follows I describe what has been done and remains to be done to these texts, all of which are derived from TCP versions but differ from them in various ways. My goal has been to enable a corpus-wide view of Early Modern drama and turn the texts into “data” that support DATA or Digitally Assisted Text Analysis. Franco Moretti calls it “distant reading”. Matt Jockers’ term is “macro-analysis,” although what he means by it is really “macro- and micro-analysis.” I have called it Scalable Reading. Whatever you call it, it requires the transformation of texts into comparable comparanda that a machine can ‘grok’. Each text leads a double life in a corpus that should be both human-readable and machine-actionable.
Most of the SHC texts were typeset and printed just once. A few of them exist in multiple versions and raise challenging textual problems. From the corpus-wide perspective of SHC, which aims at articulating intertextual variance and resemblance, the “intra-textual” variance of such texts is of lesser importance. At this point SHC includes only one version of each title–usually but not always the first printing.
This project has been with me for four years, and many people have helped along the way. My most important debt is to the undergraduates who did and are still doing much of the work: Nayoon Ahn, Hannah Bredar, Madeline Burg, Nicole Sheriko, and Melina Yeh at Northwestern and Kate Needham and Lydia Zoells at Washington University in St. Louis. With one exception, the Northwestern students were freshmen or sophomores when they began their work. I am very grateful to Mary Finn, Associate Dean for Undergraduate Studies at Northwestern’s WCAS, for listening to my argument that the best way of getting students into research is to get them early and involve them in scholarly tasks that are simple but fundamental and where they can learn a lot while doing work that is useful to others. The students were supported on summer research grants. A cost benefit analysis looking at what they did and what they learned will conclude that this was a good investment.
This project could never have been done without Phil Burns, a brilliant programmer with a deep grasp of statistics and an equally deep and intuitive understanding of linguistic phenomena in a diachronic perspective. His NLP tool suite MorphAdorner has been the essential foundation for just about every aspect of the work done with the SHC corpus. I am deeply grateful to him and also to the Mellon Foundation for a grant that enabled him to fine-tune MorphAdorner and make it play nice with the vagaries of Early Modern English, of which there are many.
I am also grateful to the Mellon Foundation for their support of the TEI Simple project, which I expect to provide a simple but robust framework for displaying and querying SHC texts and other texts from the TCP archives.
Craig Berry designed the AnnoLex collaborative curation tool that sits on top of MorphAdorned data and allows for the dispersed annotation and curation of texts anywhere anytime. It also keeps a very precise log of the who, what, when, and where of any textual change. The Greek papyrologists have for many years kept a Berichtigungsliste or “correction list” of papyri on a global scale. Most of the ~46,000 emendations of the TCP texts in the SHC corpus have gone through AnnoLex, which can justly claim its role as a reliable keeper of the Berichtigungsliste of Shakespeare His Contemporaries. Modest but timely grants from Robert Lee Taylor, the Director of Northwestern Academic and Research Technologies helped to get AnnoLex started and improve it over time. It doesn’t always take a lot of money to do something, and AnnoLex is my choice exmple of a low-budget high-impact tool. I am very grateful to both Bob and Craig for that.
In the summer of 2013, at a session of the Folger Library’s Early Modern Digital Agenda seminar, I met Joe Loewenstein, and we discovered a shared interest in the role of undergraduates as “citizen scholars,” to use a term that Greg Crane likes. Two of Joe’s students, the Kate and Lydia mentioned above, got interested in using AnnoLex in Rare Book libraries where you would have access to a printed copy as well as the digital page image in AnnoLex. By way of preparing for a trip to various libraries they compiled a census of copies of SHC titles held in US libraries and the Bodleian. Stephen Pentecost in Washington University’s humanities computing shop turned this census into a Library Finder. He and Craig Berry cleverly linked the Library Finder and AnnoLex together, enabling a “find it, fix it, log it” workflow that can start from a text or from a library. You learn from the Library Finder that the University of Texas, Austin, has copies of 271 of the SHC plays. A student there who is interested in collaborative curation could consult the Library Finder, decide whether s/he would rather tackle the 42 remaining defects in Goffe’s Orestes or the 11 defects in Ford’s The Lover’s Melancholy, with a single click from the Library Finder fire up AnnoLex, and be assured that emendations will be reviewed and find their way into a central repository. This is a cool tool for “directing the crowd,” to quote a colleague in Computer Science.It has been a delight to be part of this informal, effective, and quite cheap inter-institutional cooperation.
Richard Proudfoot and Sir Brian Vickers have been very supportive of the SHC project and given me good advice on the standard spelling versions of the SHC corpus. Richard Proudfoot in particular read through half a dozen plays with great care. Each of his corrections led to the correction of dozen and sometimes hundreds of errors elsewhere. I am grateful for their help.
Relation to the TCP source files
In the TCP archive every digital file represents one bibliographical item. Plays are usually “playbooks”– one play per book, but plays were also printed in collections. In the SHC corpus every distinct title is a file of its own. 134 of the SHC files were created by splitting 29 TCP files. The filenumber is a reliable guide to provenance: the filenumber for The Queen of Corinth is SHC-A27177_27, which means “part 27 of the TCP file A27177”, where ‘A27177’ is the TCP filenumber for the Comedies and Tragedies written by Francis Beaumont and Joh Fletcher, aka “Wing B1581” or “ESTC R22900”.
Ten of the split texts are two-part plays. The others come from collections. The 21 dramatic sketches of Margaret Cavendish are an outlier in several ways. They were published in 1662, although some of them were probably written earlier. They are not full-size plays. But there are hardly any plays or play-like texts by women authors before the Restoration. So I have included them in this corpus in the hope that some enterprising undergraduate will profile this set of outliers against the larger corpus and analyze it from a perspective of genre and gender. Which could make for a useful honors project.
The text have been encoded in TEI Simple. TEI Simple is a new customization of the TEI standard. It is a close cousin of TEI Lite, but differs in associating its schema with a “processing model” or formally defined set of processing rules which permit modern web applications to easily present and analyze the encoded texts. The final version of TEI Simple will be released later this summer in the hope that it will significantly reduce the time and trouble of moving an encoded text into a user-friendly environment where it can be displayed, manipulated, and analyzed. By that time I expect to have a TEI Simple web site that will show the plays in a readable format both in original and standardized spellings.
The conversion of TCP texts from their EEBO dtd into TEI Simple has been a lossless and almost entirely automatic process.
The TCP texts have followed a controversial policy of recording typographical changes. The transcriptions recognize only two typographical states: marked and unmarked. Marked passages are enclosed in <hi> elements, but without indicating the manner of the marking. In the most common case a word or phrase enclosed in a <hi> tag will be in italics, but you would have to check the image to be sure. The distinction between ‘marked’ and ‘unmarked’ becomes problematical very quickly. For instance, stage directions are often set in italics, but names in them may be in plain type or smallcaps. Against the surrounding spoken words, the stage direction is marked. But within the stage direction, the words in italics are unmarked, while the name in plain type is marked.
In the SHC texts I have replaced <hi> elements with ‘rendition=”hi”‘ attributes for every word token (see below). This does not lose information but demotes it to a lower level where it can be used, ignored, or repurposed. The very large majority of <hi> elements in TCP texts enclose single words or short phrases that are names or foreign words. Thus the TCP notation <hi> Caesar</hi> turns into the SHC notation <w ana=”#n1-nn” rendition=”hi”>Caesar</w>, which in an unannotated form of the text could turn into <name> Caesar</name>, a more expressive and flexible encoding.
Tokenization and linguistic annotation
The SHC texts have been tokenized and linguistically annotated with MorphAdorner, the NLP tool suite designed by Philip R. Burns. Tokenization means that word and sentence boundaries are marked in a machine-readable fashion. Every word token has been given a unique xml:id, which is like a Social Security or VIN number. The word, person, or car identified by such a number can be associated in a stable fashion with an arbitrary array of data. Conceptually, explicit tokenization transforms a text into a sequence of addresses. Think of a sentence as a block and of word tokens as the addresses of houses on that block. Their inhabitants can change. The spelling ‘CVPIDS’ at the address ‘A11909_04-042570’ can be associated with (or replaced by) the standardized spelling “Cupid’s”, the lemma “Cupid”, or the POS tag ‘n1g-nn’, meaning ‘genitive of a proper noun’. The notation for this
<w lemma="Cupid" ana="#n1g-nn" xml:id="A11909_04-042570">
<choice> <orig>CVPIDS</orig> <reg>Cupid's</reg> </choice>
does nothing for the reader and in fact makes the text unreadable by humans. But this “explicitated” version, which is hidden in any readable display, enormously enhances the agility of the text when processed by a machine and provides the foundation for complex search operations. It also facilitates the task of textual curation because the keeping track of individual curatorial acts is greatly simplified by the radical divide and conquer technique of breaking the text into atomic and individually addressable objects with known locations.
Departures from the TCP text
The SHC texts tacitly depart from the TCP source texts in the following ways:
- Long ‘s’ is replaced by plain ‘s’, though it could be restored algorithmically with some manual tweaking.
- ‘Ʋ’ (\u01b2) is replaced by ‘V’ or ‘U’ depending on context.
- TCP texts do not mark line breaks but keep soft hyphens from the source texts. They have been dropped in the SHC text unless they are true hyphens.
- Brevigraphs represented by character entities , such as ‘&abque;’ for ‘que’ have been replaced by their content. Thus ‘cum&abque;’ => ‘cumque’.
Textual defects in the TCP transcriptions
Imperfections in the TCP transcriptions are largely a function of the quality of the microfilm images from which the texts were transcribed. Transcription was farmed out to vendors whose employees were mostly located in Asian countries. If transcribers could not decipher a letter, word, or longer passage they were instructed to mark the lacuna as precisely as possible. In the SGML texts these lacunae appear in such notations as <GAP DESC=”illegible” RESP=”pdcc” EXTENT=”1 letter”>. These “known unknowns”, to use Donald Rumsfeld’s phrase, are scattered across the entire archive but cluster most heavily in about ten percent of the pages.
The transcribed texts were reviewed by staff at Michigan and Oxford with professional experience in Early Modern texts. For each text a reviewer read a sample of between 3% and 5% of the text, but at least five pages to determine whether the text met the quality goal of no more than one error per 20,000 keystrokes.The “known unknowns” did not count as errors.
Many of the TCP texts are quite short: the median length is less than 7,000 words. If a text contained two thousand words or less, it had a good chance of being fully proofread by a competent reader. Longer texts were not proofread, but the quality judgment was based on a relatively small sample. Keep in mind, though, that two dozen samples by the same vendor would give you a fairly good sense of the quality of the work.
The scholarly perception of the quality of TCP transcriptions has been disproportionately shaped by the quality of the small percentage of texts with many gaps. People have an incurable tendency to judge a barrel by its worst apples, as I have argued before. Below is a table showing the distribution of textual defects per 10,000 words in all the Phase I TCP texts, the SHC texts, and the SHC texts after some collaborative curation.
Known Defects per 10,000 words
|percentile||TCP Phase 1||SHC plays before curation||SHC plays after curation|
The table shows, unsurprisingly, that playbooks have more defects than other TCP texts. In the interquartile range between the 25th and 75th percentile the number of gaps is twice as high. Play texts were often poorly printed by the standards of their day. Earlier works will on average have fewer gaps than later ones because orthographic standardization increased and typographical quality improved with time. A majority of TCP texts come from second half of the 17th century, while half of the SHC plays were created before Shakespeare’s death in 1616. The average publication date may be a silly concept, but it is telling to compare the average date of a TCP text (1653), with the average date of a play text(1616).
Collaborative curation by undergraduates
As reported earlier, undergraduates have gone about the basic clean-up of TCP texts in a very competent and energetic manner. In the summer of 2013, the Northwestern undergraduates Nayoon Ahn, Hannah Bredar, Madeline Burg, Nicole Sheriko, and Melina Yeh filled ~34,000 of ~52,000 gaps over an eight-week period. The rate of defect for the interquartile range of SHC texts dropped by a factor of three. If you assume an average page size of 400 words, at the 75th percentile you would expect five gaps every two pages before curation and two gaps every three pages after curation. That is progress.
Progress will continue this summer. Kate Needham and Lydia Zoells, two current juniors at Washington University in St. Louis, are in the midst of a curation sprint that has taken Kate to the Bodleian and Lydia to the Newberry and the Rare Book Library of the University of Chicago. They will spend a week at the Folger Library, where they will be joined by Hannah, who has also done some work at the Newberry. The three of them have made well over 2,000 corrections in the past month and are likely to make significant inroads on the ~10,000 remaining gaps in the SHC corpus.
Here I just repeat what I wrote earlier :While fixing known unknowns, we stumbled across roughly 10,000 ‘unknown unknowns’, often by accident and sometimes by looking for spellings unlikely to be right. No more than a handful of the plays in the project were proofread from the first to the last word. So we do not know whether these unknown unknowns are all, most, or just some of the impossible spellings in the SHC corpus. But there seems to be at least one of them for every five known defects. I had thought that there was a positive correlation between known and unknown defects, on the hypothesis that transcribers faced with hard-to-read texts would make more errors in transcribing what they thought they could read. But I was wrong. There is no clear correlation between known and unknown defects. From which I conclude that impossible spellings in the TCP transcriptions are for the most part accurate renderings of what the transcribers saw: spellings like ‘sortunate,’hnsband’, ‘assliction’, ‘a biectly’, ‘I and somely’, ‘lamestallion’ or ‘suriesrend.’ It was certainly the right policy to ask transcribers not to emend what they saw. On the other hand, these are the kinds of things that show up in the printers’ errata and are the occasions for their effusive and whimsical apologies about the errors of their trade. A digital surrogate should correct these cases and represent the printer’s intention, about which in the overwhelming number of cases there can be no doubt. On the bright side, these textual defects are an eloquent testimony to the conscientiousness of the transcribers.
How are textual corrections recorded?
The corrections of each text are listed in an appendix in a manner that lets a reader both judge the emendation in context and refer it back to its place in the printed source. Take the following two corrections from Thomas Newton’s Eunuch, one of them an incompletely, the other an incorrectly transcribed (or printed) word:
|A13613_02-39-a-0530||per●●e||Now in that case iudiciously he wrought The||parrie||at the barre , then defensorie , To plead|
|A13613_02-40-a-1130||assliction||so go not one , And faintly bearing loues||affliction||, When y’are not sought to , when you are|
The code at the left concatenates the TCP file number with the EEBO image number and a page-based word counter. The first defect occurs on the left side (‘a’) of the double-page image 39. The word counter increments by ten, so that the defect is located at word 53 (towards the top) on the left side of image 39. In the case of incorrectly transcribed words we did not distinguish between printer’s and transcriber’s errors. The decision was based on a cost/benefit calculus how to make the best use of available time. If you care about the distinction, the data allow you to follow it up. It is a safe assumption that such errors are typically the printer’s.
A good enough text
A minimal definition of a good enough digital transcription is a text that has been proofread word for against its source, with manifest errors corrected and cruxes identified. None of the SHC texts can be completely certified to have undergone that treatment, but quite a few come close enough. It would be good to have a some certifying mechanism of that kind. With most of the plays in SHC, it would take a lot less time to do what is enough to certify a play than to write an article. There may be a thousand scholars in the world who right now are thinking about writing something about Early Modern Drama, most probably Shakespeare. Would it be a good collective decision by the Early Modern Drama community to invest a small percentage of its scholarly labour into proofreading and trade the next hundred or two-hundred articles for a cleaned-up public domain corpus of Early Modern plays certified to be good enough for most uses? It might well be, but it is probably not going to happen.
Greg Crane’s students in second-year Ancient Greek not only parse sentences–as students of Ancient Greek have done for centuries–but enter those sentences into a digital tree bank where they contribute to a growing syntactic genome of Ancient Greek. That is a very effective way of developing a scholarly product as a by-product of a pedagogical activity. Are there ways of achieving similar results in the quite different domain of Early Modern texts? I have no doubt that two or three undergraduates assigned to the task of proofreading the digital transcript of a scene or act in a play will learn some very useful lessons and many of them would feel good about having done something useful. There is technology for weaving such collaborative curation into the pedagogical routines of teaching Early Modern drama. The social infrastructure is a harder problem.
Some people may find that the addition of “who attributes” to the <sp> elements is the most useful feature of the SHC corpus. Plays consist of sequences of speeches, and speech prefixes are a way of telling the human reader who is talking. The consistent spelling of speech prefixes is not among the virtues of Early Modern play books. This does not matter much to fault-tolerant and tacitly error-correcting human readers who take wildly different spellings of the same name in stride. It matters a lot to a machine that is thrown off by even slight variance.
If a play is encoded in TEI, every speech sits inside an <sp> container. It is trivial to count the number of speeches or the number of words in each speech, and from those data it is possible to construct simple visualizations that tell you quite a bit about the pace and rhythm of a play. If you associate each <sp> element with an identifier of its speaker, it is not difficult to map a play in terms of who talks to whom when and at what length. If you know that about a play you know a lot. With the help of Thomas Berger’s useful Index of Characters in Early modern English Drama Printed Play, 1500-1660 I was able to map the speech prefixes in the printed sources to corpus-wide identifiers that are added as “who attributes” to the <sp> element. Thus
<sp xml:id="A00959-e102960"><speaker>>Aub.</speaker> ...some speech</sp>
<sp xml:id="A00959-e102960" who="A00959-aubrey"><speaker>Aub.</speaker>.... some speech</sp>
and all speeches with the who attribute “A00959-aubrey” can now be firmly identified by the machine as spoken by the same person. It is tedious to assemble the data, but once you have them you have the conditions for doing clever things with graph databases and analyzing plays as networks at different levels of abstraction. The potential for analysis increases if you classify every speaker in terms of sex, age, social status, and basic kinship relations. Computer assisted research of this kind can build very fruitfully on the principles developed in Manfred Pfister’s Drama (1977). At a workshop on the computationally assisted analysis of drama at the Bavarian Academy of Sciences last March I learned about interesting work along those lines at Göttingen, Leipzig, Regensburg, Stuttgart, and Würzburg.
Only half the plays in the SHC corpus have castlists. I have added a machine-generate cast list to each play that lists the characters in descending order of their number of speeches. For The Bloody Brother the top eight characters are
<item xml:id=”A00959-rollo” n=”138″>
<item xml:id=”A00959-latorch” n=”94″>
<item xml:id=”A00959-aubrey” n=”77″>
<item xml:id=”A00959-edith” n=”48″>
<item xml:id=”A00959-sophia” n=”43″>
<item xml:id=”A00959-otto” n=”38″>
<item xml:id=”A00959-cook” n=”36″>
<item xml:id=”A00959-hamond” n=”36″>
If you have not read the play the list helps shape your expectations. In Shakespeare’s Comedy of Errors a similar list tells you that Adriana speaks more words than any other character, which immediately directs your attention to the central difference from its Plautine source. An equally primitive count draws your attention to the singular ubiquity of Susanna in the Marriage of Figaro, adding a nice twist to the Countess’ anguished E Susanna non vien in the recitative preceding Dove sono.
A standard spelling edition
From the linguistically annotated SHC it is not difficult to generate an algorithmically constructed version that presents a play in the standardized orthographic form in which modern readers typically encounter Shakespeare. Archaic features like ‘thou’ or ‘loveth’ are preserved but orthography follows modern practices. If you read Shakespeare in that form and Shakespeare His Contemporaries in original spellings you are likely to think that the plays are more different than they in fact are. Standardized spelling does away with some illusions of difference. Nor should we forget that original spellings, like music played on “original instruments”, did not look or sound different to them, although they look and sound different to us.
An algorithmically produced standard spelling edition needs to be read and manually corrected. Richard Proudfoot read through half a dozen plays and for each of them pointed out several dozen errors. Nearly all of these errors generated dozens or more corrections in other plays. If a dozen individuals follow Proudfoot’s example and report corrections to a handful of plays, the number of errors also found elsewhere will decline, but there will always be a need for proofreading such a version, especially for the consistent treatment of names.
What about Shakespeare?
Where is Shakespeare in the SHC corpus? He is not yet there at all, but will soon be in two formats. The Internet Shakespeare editions of the Folio and quartos are much superior to the TCP transcriptions, and they have additional features such as a careful keying of lines to the TLN numbers in Hinman’s composite Folio facsimile. A TEI P5 version of the Internet Shakespeare edition is underway, and I hope to integrate those into the SHC corpus.
As for a standard spelling of Shakespeare, I have worked with Michael Poston at the Folger Library on creating a TEI Simple version of the Folger editions of Shakespeare’s play. This version will integrate the linguistic annotation of the WordHoard Shakespeare into the Folger text. It will also integrate the Folio TLN numbers into its citation scheme so that you can move with a single click from any line in the standardized text to an image of the relevant Folio page.
Search environments for corpus-wide analysis
There are search environments that are user-friendly at the cost of restricting what you can look for. Other search environments support complex analyses, but presuppose skills that it may take weeks or months to become familiar with. Using any complex search environment may be easier than learning how to play the violin. But it is harder than learning how to ride a bicycle. WordHoard struck a nice balance between complexity and user-friendliness, but at the cost of a highly constrained environment that made it difficult to change or add to the underlying data.
The BlackLab search engine developed by programmers at the Institute of Dutch Lexicology strikes me as the most promising tool for taking advantage of the query potential of a corpus that, like SHC, has been structurally encoded and linguistically annotated. It lets you look for for phrases that have the grammatical structure of “handsome, clever, and rich” or for words spoken by Ophelia in prose. It supports incremental indexing, which makes it easy to add texts, delete them, or swop one version of a text for another.
Phil Burns has created an experimental and simple interface to BlackLab at https://devadorner.northwestern.edu/corpussearch/. The TCP ECCO and Evans are searchable through it. You need to be careful about the finicky syntax of its search commands, and it helps to know a little about “regular expressions”, but the learning curve for the most common searches is measured in hours or days rather than weeks and months. And search results can be downloaded as tab separated files that you can import into Excel or similar programs for subsequent analysis. The SHC corpus is likely to be searchable via BlackLab fairly soon. It would take more than six months to build a user-friendly interface for BlackLab. But such an interface would be a wonderful tool for exploring the query potential of SHC and similar corpora.