“Fluent in Marlowe”: A decade of undergraduates as collaborative curators of Early Modern texts

In a course on Early Modern Drama that I taught in 2009 I gave my students the option of doing editorial work  for some of their assignments. Two of them wrote a perceptive essay on the work that they found both tedious and engrossing. They concluded by saying that they had become “fluent in Marlowe”,  a charming testimony to the value of  exercises from which students learn while doing work that is useful to others. In particular, the essay clearly shows how bright undergraduates move very quickly from humble editorial tasks  to thinking about fundamental philological problems. The practical work has a strong reflective payoff.

The students worked with spreadsheets that were populated with a verticalized output  of TCP texts–more about them below–that had been linguistically annotated with MorphAdorner, a Natural Language Processing (NLP) tool suite developed by Phil Burns in Northwestern’s IT Research division.  In such a table or “dataframe”  each word is the keyword in a row that includes left and right context as well as data about particular properties of the word. The reading order of the text is maintained by a numerical column, but the reading order is only one of several ways of ordering the data for this or that analytical or editorial purpose.

Shakespeare His Contemporaries

A few years later a grant from Northwestern’s IT group enabled Craig Berry to design Annolex, a  Web application with a relational database backend. Annolex could easily keep some 50,000 records of corrupt transcriptions from some 500 plays written within 30 years before Shakespeare’s birth and after his death. Craig earned his PhD at Northwestern with a dissertation on Chaucer and Spenser. His Doktorvater was Leonard Barkan. He has lived a double life as Spenser scholar and a programmer with responsibilities for the accounting software of a kidney dialysis clinic .

Before taking on Annolex Craig had made two significant contributions to computationally based projects in the Humanities at Northwestern. In the mid-nineties he wrote a program that identified the approximately 250,000 occurrences of some 30,000 repeated phrases in Early Greek epic. This inventory has been the basis of the Chicago Homer, which for the past twenty years has helped readers with or without Greek to get a sense of bardic memory by making visible  the network of phrasal repetition that is so distinctive a feature of Homeric poetry. Craig also added the Spenser corpus to Wordhoard, an application for the close reading and scholarly analysis of deeply tagged texts, which includes Early Greek epic, Chaucer, Spenser, and Shakespeare.

Annolex was operational between 2013 and 2015 in a project we called “Shakespeare His Contemporaries.”  During that period a dozen students from Amherst, Northwestern, and Washington U. in St. Louis corrected some 50,000 textual defects in some 500 plays and  reduced the median rate of textual defects per 10,000 words from 14.5   to 1.4. The modal Early Modern play will consist of   20,000 words +/- 4,000. In reading through a play you may not notice three defects. You will notice thirty.

Working in structured environments with light supervision these students fixed over 90% of textual defects in 511 plays.  The remaining distribution of defects looks as follows:

Remaining defects Number of plays
0 284
1 63
2-4 39
5-16 59
17-64 23
> 64 19

All in all, the students did very good work, and the remaining tasks are quite manageable, but most of them require access to better images, not to speak of the 23 plays whose digital scans were missing 67 pages.

If you do NLP work you may  say that the original  median defect rate of 14.5  per 10,000 (0.145%)  would in most cases make n difference to any quantitatively based inquiry. Which is true, but beside the point: Early Modern scholars like their texts clean. In a survey of TCP users 88% ranked “accuracy of transcription” as their first or second criterion, and 70% put it first.

Three students from that project deserve special recognition: Hannah Bredar(BA Northwesterns 2015), Kate Needham(BA, Wash.U. 2016), and Lydia Zoells(BA, Wash.U. 2016). Between April and July of 2015 the three of them, separately or together, visited the Bodleian, Folger, Houghton, and Newberry Libraries as well as the special collections of Northwestern and the University of Chicago. They fixed about 12,000 incompletely or incorrectly transcribed words.  Hannah and Kate are now PhD students in English at Michigan and Yale. Lydia, the valetudinarian of her class at Wash.U, went straight into New York’s publishing world and is currently an editorial assistant at Farrar, Giroux, and Strauss.

The Text Creation Partnership (TCP)

The texts from Shakespeare His Contemporaries came from the Text Creation Partnership (TCP). This is a good moment to give a brief account of what has arguably been the most important infrastructure project in Anglophone Early Modern Studies over the past thirty years. The English Short Title Catalog (ESTC), which aims at being a complete record of all imprints before 1800 from the English speaking world, lists ~ 137,000 imprints before 1701.  An imprint may be  a single sheet broadside or it may contain the 3.1 million words of Du Pin’s 1694 New History of Ecclesiastical Writers, the longest text in the TCP archive. Just about all these imprints were microfilmed between late thirties and end of the 20th century. For many years these microfilms were owned by University Microfilms, a corporation with close ties to the University of Michigan. Name and ownership have changed repeatedly in the past decades. Proquest, the current owner, is a subsidiary of the Cambridge Information Group.

In the early nineties Proquest digitized the microfilms. The digital scans became available as EEBO or Early English Books Online. I once asked colleagues what difference digital tools made to their work. Before I even finished my question one of them answered “EEBO changed everything” . And so it did. Ranganathan’s Fourth Law of Library Science says “Save the time of the reader”. If you can get across the paywall (a non-trivial if) and barring an Internet outage you can get to just about every book before 1700 right away and anytime–including 2am in your pyjamas.

In the late nineties Proquest and a consortium of universities led by Michigan and Oxford  formed the Text Creation Partnership and struck an agreement to create SGML transcriptions of ~ 60,000 ESTC titles– approximately two billion words and for many practical  uses a deduplicated library of Early Modern English books. The work was done in two phases on the understanding that after an initial five-year period  the texts of each phase would move into the public domain. Some 25,000 Phase I texts moved into the public domain in 2015; Phase II texts will follow in 2021.

The transcriptions were for the most part  done by off-shore transcriptions services like Apex Typing from EEBO scans of microfilms of copies printed before 1700 and subject to the vagaries of the intervening centuries. Lots of things could and did go wrong in the long journey from the author’s manuscript to the screen in front of the modern copyist. The contract called for an accuracy level of no more than 1 error per 20,000 keystrokes, but transcribers were not penalized for illegible characters  if they marked the nature and extent of the resulting gap, whether “2 characters” or “1 paragraph”.  Missing letters are typically display by a placeholding character. If you use a dot, the resulting string will be a “regular expression”.

In Donald Rumsfeld’s parlance, the TCP texts include ~ 10 million “known unknowns” or corrupt words where the position and extent of damage are reported with high accuracy.  The texts probably include a roughly equal number of “unknown unknown” in the form of misprints or transcription errors. Some of them are reported in the errata sections  found in some 6,000 texts.  Some of them–especially the notorious confusions of long ‘s’ and ‘f’  or ‘u’ and ‘n’–can be flushed out by targeted searches. ‘Hnsband’ and ‘assliction’ are real examples. But most cases are hidden among 4.5 million spellings that occur less than five times in the corpus.

I doubt whether more than 10% of Early Modern texts have ever received the attention required for meeting minimal editorial standards for scholarly work. A reasonable person could wonder whether the editorial attention lavished on Shakespeare has strayed beyond the point of diminishing return. Could  that attention be  more profitably spent on the thousands of texts that  would benefit greatly from basic  forms of “textkeeping”? For many purposes, including simple lookups and citations, EEBO images are good enough, and their image numbers have the advantage of a global and stable citation system. But images cannot easily searched, and for texts before 1700 OCR remains far too “dirty” to produce reliable results.

The TCP texts were originally encoded in SGML, but also exist as XML versions. They go a long way towards creating searchable texts, but none of them fully qualifies as a scholarly text, and most of them have only gone through very limited proofreading. On the other hand, the coarse but consistent XML encoding across a corpus of 60,000 texts in principle lets users formulate queries that look for (or exclude) text in verse or prose, lists, tables, notes, and prefaces, dedications or other forms of “paratext”. There is currently no interface that makes these affordances available in a user-friendly manner to “non-geeky” Early Modernists, which is most of them. The linguistic annotation of texts extends the query potential of the texts into the micro-level of phrasal structure by supporting queries for patterns like “handsome, clever, and rich” or adjectives preceding ‘liberty’ or ‘freedom’.

The  reputation of TCP texts has suffered from the universal tendency to judge a barrel by its worst apples . Defects cluster heavily in a minority of texts: 15% of them account for 60% of all defects, and two thirds of the texts have defect rates that are low or tolerable. But there is a lot of basic editorial work that can and should be done.  It may be the case that as many as three million defects can be fixed algorithmically with an acceptable error rate. If a million defects can be fixed at an error rate of 3%,  970,000 words would be corrected and 30,000 would be no worse off: they would just be wrong in a different way, That is not a bad bargain, especially if all algorithmically corrected words are flagged appropriately. Philological casualties are easier to bear than military ones.

Towards a cultural genome of Early Modern English

Since 2016 undergraduate work on collaborative curation has extended beyond the scope of Early Modern Drama and tackled the entire EEBO-TCP corpus. Time will whether this will prove to have been a wise or foolish step, but the extended project–which has involved Notre Dame and now involves Northwestern and Washington U. in St. Louis– has received significant support from the Mellon Foundation and the ACLS.  Its roots go back to the multi-institutional 2007-09 MONK project  (Metadata Offer New Knowledge) , which took some steps towards a a multi-genre diachronic and consistently tagged and annotated corpus in the spirit of a  remark by Brian Athey, chair of Computational Medicine at Michigan, that “agile data integration is an engine that drives discovery.”  MONK led me to formulate the idea of a Book of English defined as

  • a large, growing,  collaboratively curated,  and public domain corpus
  • of written English since its earliest modern form
  • with full bibliographical detail
  • and light but consistent structural and linguistic annotation

The parallel with collaborative genomic annotation runs deep.  Early Modern printed English (from 1473 to 1700) would be the first chapter in such a book, and the one with the most realistic chance of being completed. You could call the result a “cultural genome” or “book” of Early Modern English, just as the “book of life” metaphor is often used for the human genome.

“Agile data integration” for a Book of Early Modern English would be a good thing to have, but one must be clear about what it is or is not. It is a good enough record of what has been printed and survived. It does not include what was written by hand and never made it into print. After ‘Augustine’, ‘Luther’ is the most common word in EEBO-TCP that unambiguously refers to a historical person. The 60 volume Weimar edition of Luther’s works has an additional dozen German and Latin index volumes of names, places, subjects, and citations. No Luther scholar would want to be without it. Indexes are a very early device of print culture to make books more “agile”, witness the “diligent, and necessary Index, or Table of the most notable thynges , matters, and woordes contained in these workes of Master William Tyndall” in a 1570 edition of Tyndale’s works.

In a Book of Early Modern English each “chapter” (or separate TCP text) should contain complete, clean, and readable text, and this book should be complemented and surrounded by digital indexes that let users treat it as if it were  a single and well-indexed book.  Getting there will take a lot of work and involve different and mutually reinforcing tasks ranging from basic copy-editing to complex NLP routines. Not all of it needs to be done before parts of it become useful, and as long as duplication is avoided it does not matter in what order things get done.

It is a big step to go from 500 to 60,000 texts.  Think of the simple “textkeeping” tasks in terms of a classic  ditch-digging story problem. A dozen students working full-time in two eight-week summer internships cleaned up 90% of defective tokens in 510 or 0.85%  texts. How many students (or other contributors) would it take to complete the task in seven or ten years?  Roughly speaking, they did about half a percent of the work.

It is easy to be discouraged by those numbers, but there is also a cheerful way of looking at it. A few students working together can significantly improve some cluster of Early Modern texts, whether plays, books about science, gardening, law, witchcraft or whatever, and that work cleans up some textual neighbourhood for all its future readers.

Over the past three years, fixing textual has taken a backseat to improving the tools and environment for doing collaborative editorial work. The EarlyPrint Library  is built on an eXist XML database that adds the following features to a readable text:

  1. For each page of transcribed text it provides immediate access to the corresponding EEBO image
  2. For a growing number of texts it provides access to high-quality and public domain images on IIIF servers at the Internet Archive and elsewhere
  3. It includes an Annotation Module that supports “curation en passant” and allows registered users to offer emendations for corrupt readings. These emendations are flagged and immediately displayed in the text, but their integration into the source texts is subject to editorial review

It has taken two years to make this environment reasonably stable and fast enough for most purposes. It is still a work in progress, but we have a much clearer sense of what it takes in software refinements and more powerful hardware to make it faster and more reliable.

The standard search functionalities of the eXist database  were not designed to meet requirements of editorial work in the EarlyPrint Library. The current plan is to combine the EarlyPrint environment with a BlackLab search engine that implements a corpus query language on top of a Lucene index and also supports XML aware searching.

 

Eng² or Engineering English

Engineering and English are close alphabetical relatives but the people in those disciplines tend not to think of each other as blood brothers. That said, whatever else a book may be, it is certainly an engineering product. Images like the Ramelli wheel testify to an Early Modern fascination with mechanical engineering. In retrospect, one may even see in that image a foreshadowing of  Franco Moretti’s “distant reading”. At least it shows a recognition of reading as fundamentally a “many books activity”.  A modern book is one of many possible ways of representing a digital file, and all stages of the editorial process have been deeply affected by digital technologies. What applies to the making of books also applies to their reading and analysis. The business world with its understandable  interest in profits has been eager to use all manner of NLP techniques to get to get to some bottom line as quickly as possible.  Humanists are leery of bottom lines. Leaving aside self-proclaimed “digital humanists”, scholarly readers or editors remain reluctant to exploring ways in which technology could help them with anything beyond the mundane tasks of typing, printing, copying etc. This reluctance is not very helpful, but it is very powerful.  The thoughtful Andrew Piper in his recent Enumerations wistfully looks ahead  to “imagin[ing] an alternative future where students are not dutifully apportioned into silos of numeracy and literacy but are placed in a setting where these worldviews mix more fluidly and interchangeably” (p.x).  It will be a while before that becomes an everyday reality in humanities departments, but it is worthwhile hoping for and working towards it.

Curating and exploring the Early Modern corpus offers many opportunities for breaking the “silos of numeracy and literacy” and joining them in the increasingly useful skill of “telling stories with numbers.”  Some of those opportunities are very practical, but (remember “fluent in Marlowe”)  practice and reflection can be close neighbours. .The humble task of correcting corrupt readings  is at some level a spellchecking problem, but the pattern matching skills that it calls on are just as important for higher-level operations.

The dismal prospect of manually fixing millions of frequently obvious typographical errors or gaps led me to ask whether a machine could help. I talked with Doug Downey in Northwestern Computer Science Department. One of his students took a first stab at a solution in the context of  the limited drama corpus.  “Dirty Words: Engineering a Literary Cleanup” is a lively report about it. Two years later, two other students of his, Larry Wang and Sangrin Lee, did a more ambitious experiment that targeted the entire corpus and used Long Short Term Memory (LSMT) routines.  The results were so promising that machine-generated corrections were imported into  the EarlyPrint Library but flagged with a colour that marked their algorithmic status.   A closer look has shown the need for a more granular case logic that excludes certain types of defects and clusters subsets of texts for special treatment.  But there is no question in my mind that at least half of the defective tokens can have algorithmically based solutions.

This is a case where engineering students can add substantial value to a humanities project by using sophisticated and familiar techniques of pattern matching. But there are also things for them to learn.  The increasingly powerful NLP routines developed largely for the uses of business and industry make substantial and tacit assumptions about what English is like. But these routines require much tweaking of training data and algorithms to work with data from earlier centuries, and that tweaking requires deep conversations with domain experts to figure out what is or is not within algorithmic reach. Those conversations are a bridging exercise, and there is much to be learned on both sides.

How many different words are there in the Early Modern corpus? This is a question of some interest to a lexicographer. With a modern corpus you can get a pretty good answer by stripping of some suffixes and grouping the results. Not so with a corpus that spans 230 years of considerable orthographic variance and fluctuation.  The EarlyPrint corpus includes has about 4.4 million distinct spellings. 3.5 millions occur less than five times, and 2.4 million occur only once. Programs that map incomplete to known words should also be able to identify very rare spellings as variants or misspelling of more common words.

Named Entity Recognition

From a technical perspective, Named Entity Recognition (NER) is very close to the spellchecking problems discussed above: but instead of matching a string to a standard spelling or lemma you seek to match it to an entity that exists outside that text in some real or imagined space. Is ‘John’ the apostle, the Baptist, the name of one of the gospels, the name of the letter by John, the English king, or the name of some fictional character? There are well over a million distinct character strings that are names or abbreviations of names. Not all of them are as polysemous as ‘John’, but quite a few are.

From an end user’s perspective, clarity about names may be greatest navigational help that a corpus can provide. A recent Northwestern English major, who is now at the UIUC School of Informations Sciences,  worked with Phil Burns and did valuable NER work on Purchas His Pilgrimage, a very large early 17th century compilation that probably contains a high percentage of names published in texts before it.  Getting names roughly right will be a high priority of the project, and it will call on a clever combination of algorithmic analysis and shoe leather journalism to get it done. Collaboration between Computer science and humanities students can do a lot of good in this field and be a very valuable experience for the students engaged in it.

 

 

 

 

c

Leave a Reply

Your email address will not be published. Required fields are marked *