Collaborative curation of TCP texts in the EarlyPrint environment

This is a report about the collaborative curation of Early Modern Anglophone texts transcribed by the Text Creation Partnership (TCP) and maintained by the EarlyPrint project in a linguistically annotated form.  It is written for an audience for whom TCP is not a familiar acronym.  It describes work done over the past decade and looks ahead to what could or should be done in the future. For a more broadly based and reflective essay about the same  topic look at  Re-mediating the Documentary Infrastructure of Early Modern Studies in a Collaborative Fashion.

I begin with three assertions:

  1. Considered in the aggregate, the almost 70, 000 TCP texts are the most important source  of primary documents for the study of the English-speaking world before 1800 and more particularly before 1700.
  2. In their current form very few of these texts meet the minimal conditions of a carefully proofread text, and many fall seriously short of it.
  3. The dissonance between these two assertions can be largely resolved by involving the user community in the correction of millions of textual defects. Digital tools and routines have greatly lowered the access barriers to such work. Much of the work that needs to be done lies comfortably within the competence of careful readers from many walks of life. All you need is  some interest in old books and a little patience.  Students have been very good at it, doing work from which they learn and that is useful to others.

The scope of the TCP project

The English Short Title Catalogue (ESTC) is the most authoritative bibliographical source for materials printed in the English-speaking world before 1800. It has some 480,000 entries ranging from single page broadsheets to multi-volume books like Hakluyt’s Voyages. A high percentage of these “books” were microfilmed over the course of the 20th century, mostly by University Microfilms, a business with historical relations to the University of Michigan. In the late 1990s Proquest, the current owner of University Microfilms, created Early English Books Online (EEBO) by making  digital scans of the microfilms available over the Web. When I asked a colleague what difference digital things made to his work, I could barely finish my question before he said: “EEBO has changed everything”.  The stuff you needed had suddenly moved from the  least loved technology of the 20th century to a medium where you get at it at 2 am and in your pyjamas.

Proquest partnered with  the University of Michigan and the University of Oxford to use the EEBO scans as the sources  for the TCP project, which over the course of two decades  created SGML- encoded transcriptions of some 60,000 texts published before 1700 (They now circulate mostly in the XML format that replaced SGML).  The goal was to create something like a deduplicated library of pre-1700 Anglophone print.  It is not easy to determine when or for what purpose two texts are duplicates of each other. Broadly speaking, EEBO-TCP includes about 70% of distinct pre-1700 titles in the ESTC, and much of what is in the 30% of missing titles occurs somewhere in the 70% of transcribed titles if you interpret ‘duplicate’ broadly enough. For many practical purposes you can think of EEBO-TCP as a Noah’s Ark of pre-1700 English print culture.

The TCP also partnered with Readex and Gale to create 5,000 transcriptions of American books before 1800 and 2,000 transcriptions of English 18th century texts. These smaller projects cover respectively  12%  and 0.7% of the corresponding ESTC entries.  That said, 2,000 18th century titles are still quite a few titles.

How the TCP texts were made

The transcriptions were farmed out to transcription services like Apex. Much of the work was done in Asian countries. We know next to nothing about the workers. They were held to no more than one error per 20,000 keystrokes, and they were instructed to mark as precisely the number of letters or words that they could not decipher. Quality control was assigned to staff with academic backgrounds at Michigan and Oxford.

The transcriptions are not “diplomatic” editions that capture the look and feel of the source document.  They are quite Spartan and explicitly ignore many features that would be of interest  to a bibliographer or book historian. They are faithful to orthography but largely ignore information about typefaces, layout, running heads, or other detail about the physical nature of a page or book. On the other hand, the structural markup with a reasonably granular set of TEI “elements” created texts that with appropriate software are much more deeply searchable than  “plain text” transcriptions would be.

Within significant constraints of time and money, it was clearly a major goal of the project to push out as many texts as possible, restrict features to  the essentials of words in sequence, and live with a higher error rate than you would tolerate in the editing of a single book. In retrospect that was clearly the right thing to do. In a digital archive texts can comfortably coexist at different levels of (im)perfection and be subject to continuing improvement.

Defects in the TCP texts

In what follows, figures are derived from 52,800 EEBO texts that are currently part of the EarlyPrint project, about which more below.

The biggest source of error in the TCP texts has to do with the quality of the microfilms from which they were transcribed. The quality of the printed page from which the microfilm was made is a close second. About 4% of the texts are missing a total of 11,000 pages ranging from 1 to 297 per book. The modal case is a missing double page image.  Transcription errors in the outer or inner margins of page images are higher by a factor of five. Faded small print in a marginal note becomes even harder to decipher in a poor microfilm image. Undecipherable final letters in the left inner margins and initial letters in the right inner margins are the victims of microfilms that were made before deskewing devices became common.

The combined defects caused by poor microfilm images  or their page source are the reason for the uneven distribution of textual defects. The most common defects–untranscribed letters in words–cluster heavily. In the screen display of the text letters known to be missing are represented  the by the symbol ‘● ‘. There are almost eight million of them.  On average, one word in 200 is a “blackdot word”. But 55% of them are found in  the 15% of texts with more than 100 defects per 10,000 running words.  The interquartile range  of defects moves between 1.05 and 47.8 defects per 10k with a median value of a dozen defects per 10k.  People have an inveterate tendency of judging a barrel by its worst apples. But one might say of the TCP texts what Edmund Malone is supposed to have said of Shakespeare’s plays: “The texts of our author are not as  bad as they are made out to be.”

Donald Rumsfeld once distinguished between “known unknowns” and “unknown unknowns”. It’s the latter that get you.  You can count the blackdot words, and quite often they are easy to fix.  The unknown unknowns are hidden in plain sight among 1.6 billion words.  Thus the more than 10,000 occurrences of the the string ‘re’ usually represent  the ablative of the Latin word ‘res’,  but hundreds of them are the prefix ‘re’ of a word that has been wrongly split.

Why does it matter to fix defects?

Textual defects may block your understanding of a text, but quite often you see right through them. Many people don’t care about such “noise” as long as they get the “signal”. This is a very common attitude among researchers who are engaged in machine learning or statistically driven Natural Language Processing. They are right in a context where you are trying to get some information out of  a  flood of textual ephemera.  But if ephemera hang around long enough they become cultural heritage objects deserving attention and care. Time has transformed them into what Thucydides thought his history should be– a “possession forever”.  Thomas Carlyle described the Thomason tracts– a collection of some 20,000 very ephemeral pamphlets from the English Civil –as a “hideous mass of rubbish” that was nonetheless the most informative source  about what the English of their day thought and felt.  Carlyle’s phrase aptly describes many TCP texts.

There is a  “yuck” factor in all this. According to the New York Times (Nov. 16, 2009), the engineer John Kittle helped improve the Google map for his hometown of Decatur, Georgia, and said:

Seeing an error on a map is the kind of thing that gnaws at me. By being able to fix it, I feel like the world is a better place in a very small but measurable way.

It has become much easier to do something about things that gnaw at you in that way. At a conference some years ago Gregory Crane, the editor of the Perseus Project, argued that “Digital editing lowers barriers to entry and requires a more democratized and participatory intellectual culture.” In the context of the much more specialized community of Greek papyrologists, Joshua Sosin has successfully called for “increased vesting of data control in the user community.”

EarlyModern printers were keenly aware of the mistakes they made. There is a subgenre of whimsically contrite apologies asking for the reader’s forgiveness and cooperation. I am very fond of  the printer’s plea in Harding’s Sicily and Naples, a mid-seventeenth century play:

Reader. Before thou proceed’st farther, mend with thy pen these few escapes of the presse: The delight & pleasure I dare promise thee to finde in the whole, will largely make amends for thy paines in correcting some two or three syllables.

For Latinists, there are the lapidary and imperative final words of Sameue Garey’s Great Brittans little calendar (1618):

Amen.
Candido lectori: Humanum est errare, errata hic corrige (lector) quae penna, aut praelo lapsa fuisse vides.

If enough readers share John Kittle’s sense of gnawing and have a good sense how much digital technologies have lowered the entry barriers to collaborative work, there is a chance that over a number of years their “many hands will make light work” and many of the texts will be substantially improved.

Collaborative curation at Northwestern

My earliest experiment with collaborative curation goes back to 2009. I gave students the option to replace an essay with fixing defects in TCP transcriptions of Marlowe’s Tamburlaine. The students worked from spreadsheets that listed defective tokens as keywords in context and added a URL to a page image. Two students wrote a charming essay about how this exercise had taught them to become “fluent in Marlowe“. Their essay remains a useful demonstration of how a simple exercise can lead to reflections on the fundamentals of philological labour.

Between 2013 and 2015 three generations of students at Northwestern, augmented by volunteers from Amherst and Washington University in St. Louis, worked on a project called Shakespeare His Contemporaries . They reduced the median defect rate in the TCP transcriptions of some 500 non-Shakespearean plays by an order of magnitude  from 14.6 to 1.4. per 10,000 words, roughly from 30 to three defects per play.

Nicole Sheriko was a member of the first team of five students supported by summer research grants. She went on to Graduate School, working at the “intersection of literary criticism, cultural studies, and theory history” and is now a Junior Research Fellow at Christ’s College, Cambridge. Last year, she had this to say about her experience:

Having the experience of working with a professor on a project outside of the classroom–especially in digital humanities, which everyone seems to find trendy even if they have no idea what it entails–was a vital piece of my graduate school applications, I think, and other students may see a similar benefit in that.

In a less vulgarly practical sense, though, I would say that working on what was then Shakespeare His Contemporaries made a significant difference in how I approach studying the field of early modern drama. The typical college course can only focus on a handful of canonical texts but working across such an enormous digital corpus reoriented my sense of how wide and eclectic early modern drama is. It gave me a chance to work back and forth between close and distant reading, something I still do as I reconstruct the corpus of more marginal forms of performance from references scattered across many plays. A lot of those plays are mediocre at best, and I often remember a remark you once made to us about how mediocre plays are so valuable for illustrating what the majority of media looked like and casting into relief what exactly makes good plays good. The project was such a useful primer in the scope and aesthetics of early modern drama. It was also a valuable introduction to the archival challenges of preservation and digitization that face large-scale studies of that drama. Getting a glimpse under the hood of how messy surviving texts are–both in their printing and their digitization–raised all the right questions for me about how critical editions of the play get made and why search functions on databases like EEBO require a bit of imaginative misspelling to get right. That team of five brilliant women was also my first experience of the conviviality of scholarly work, which felt so different from my experience as an English major writing papers alone in my room. That solidified for me that applying to grad school was the right choice, a sentiment likely shared by my teammate Hannah Bredar, who–as you probably know–also went on to do a PhD. Once I got to grad school, the project also followed me around in my first year because I took a course in digital humanities and ended up talking a lot about the TCP and some of the little side projects I ended up doing for Fair Em, like recording the meter of each line to see where breakdowns occurred. I even learned some R and did a final project looking for regional markers of difference across the Chronicling America historical newspaper corpus. So, in big ways and small, the work I did at NU has stayed with me.

EarlyPrint and curation en passant

Beginning in 2016 Shakespeare His Contemporaries project morphed into the more ambitious EarlyPrint.  This joint project by Northwestern and Washington University in St. Louis is dedicated to the curation and analysis of all texts in the TCP archives. Early Print currently contains some 58,000 texts from the TCP EEBO, Evans, and ECCO archives, about 82% of all TCP texts.

The EarlyPrint texts differ from their TCP sources in several respects:

  1. Each word in the text has a unique ID that concatenates the file_id, with a page_id and a word counter. This does nothing for the reader, but allows for a much more complex and accurate management of textual corrections.
  2. Each word in the text is linguistically annotated and assigned to a lemma, a part of speech, and a standard spelling, where the spelling differs from a modern or standard form archaic form (‘louyth’ > ‘loveth’ rather than ‘loves’). Readers can with the click of a button switch the text display.
  3. The texts are indexed with the very fast Blacklab search engine of the Dutch Language Institute, which supports searches by lexical or grammatical criteria, e.g “adjectives preceding liberty or freedom” or “three adjectives with the last preceded by “and”.
  4. A small but growing number of texts (586 as of September 2022) has matching image sets from IIIF servers at the Internet Archive, the Kislak Center at the University of Pennsylvania and the Ransom Center at Texas. These images are usually much superior to the EEBO scans, and they are free.
  5. Any reader can at the click of a button call up an “annotation module” that displays a data entry form where the reader can enter corrections or comments for any defective or suspect reading.  Some80,000 corrections have been made in this fashion since 2017.

The final feature is the most relevant to this discussion.  Just about any activity follows a pattern of “set-up, action, clean-up”. Quite often the set-up and clean-up take more time than the action. The philological version of this sequence goes like “find it, fix it, log it”.  If you can reduce the time cost of finding and logging, you may persuade more people to become user-contributors and fix this or that defect as they come across it in their reading.

The Annotation Module works best in a mode that I call “curation en passant”. You come across a reading that is clearly wrong or suspect. Clicking on a button will open a window that lets you enter an emendation or question the reading with a comment. This is as simple as writing in the margin of a book page. You must be a registered user to do so (a very simple procedure).  As soon as you save your correction,  the machine creates a permanent record of the who, what, when, and where of your transaction. The record is saved into a central archive. Your proposed emendation is immediately but provisionally displayed in the text, highlighted in yellow. Once your emendation has been approved by an editor with the additional privilege of approving or rejecting emendations, it is highlighted in green. The underlying texts of the EarlyPrint site are periodically updated.  The updated texts no longer highlight approved corrections, but the project maintains a permanent record of textual changes.

While  the record of corrections is not currently public, it could be made public–somewhat in the manner of the Berichtigungsliste or correction list that Greek papyrologists maintain. But the corrections very rarely enter into the realm of what Michael Witmore once called “the philologically exquisite”. For the most part they stay in the lower regions of Lower Criticism.  They matter a lot in the aggregate, but individually they are rarely of interest. Patterns of defects are of interest because they may suggest algorithmic solutions or semi-automatic routines for finding and fixing defects.

Engineering English

80,000 corrections is a lot of corrections, but it is a small percentage of the more than eight million “known unknowns” and the several million “unknown unknowns”, such as typographical errors, whether the printer’s or the transcriber’s, or words that are wrongly split or wrongly joined—a very common source of error, especially in notes and double column texts.

Most of the student work has been funded through summer research grants for eight-week sessions that over the years have crept up from $3,000to $4,000. The value added by that investment consists of the corrections and of what the student learned. The cost per correction is close to a dollar. It is still fifty cents if you argue that half of the grant pays for what the student learns.  If the goal remains a coarse but comprehensive clean-up of the TCP texts in an EarlyPrint environment, it is clear that current approaches do not scale. The Annotation Module is a very good tool for a readerly approach to fixing defects in a text as you encounter them. It is also the best tool for proofreading every tenth page of a text, which would be a good way of flushing out recurring defects and—in the case of negative results—building confidence in the integrity of a partially proofread text. But by itself it will not contribute enough to the goal of a coarse but comprehensive clean-up of the entire corpus.

Could machine learning help? Doug Downey is a professor of Computer Science at Northwestern. He has a special interest in utilizing human input more effectively in machine learning. In 2017 two of his undergraduate students, Sangrin Lee and Larry Wang, used “long short-term memory” algorithms (LSTM) to find correct replacements for incompletely transcribed words. Their results were so promising that we incorporated these “autocorrections” as provisional emendations into the display of the EarlyPrint texts, using a special colour to mark their algorithmic status.

Machine learning never completely replaces human labour. It is helpful to think of the three R’s of animal experiments: Replace, reduce, refine. The conditions under which algorithmic corrections can be accepted without any human review are limited. But experiments in 2020 showed that at a minimum the one-by-one review of algorithmic corrections doubles or triples the productivity of undergraduate curators.

Over the past few years the software for neural network procedures has become more sophisticated and easier to use (Tensorflow). You no longer need machines with special processors to run the programs (Apple’s M1 chip). Thus it pays to re-examine the potential of algorithmic correction and to think of various divide and conquer approaches to correct not only incompletely transcribed tokens but also review the five million unique spellings and identify the many cases where a unique spelling is a misprint. This is a particularly challenging version of a spellchecker problem because across a textual archive of two billion words over more than two centuries is not easy to distinguish clearly between misprints and quirky variants.

Success will depend on starting from the assumption that the coarse but consistent clean-up of the TCP texts will remain a project with a relatively high degree of human intervention.  Computer scientists call such projects “mixed initiatives”. I am fairly confident that 80% of current defects can be corrected with such an approach. It would  require substantial conversations among domain and technical experts and a more granular and iterative  approach to specifying the contexts for evaluating the likelihood of a replacement for a corrupt or incomplete token.  There will always be many “last mile” problems where the human editors are on their own. But even in those cases you should not underestimate the power of  digital tools in reducing the time cost of “finding”  defects and “logging” the curator’s emendations.

At one point I conducted a very primitive experiment in which I treated blackdot words as regular expressions and checked whether they matched words in the same text. A “regular expression” is a pattern matching device where a dot matches any single character. Thus ‘b.t’ matches ‘bat’, ‘bet’, ‘bit’, ‘but’. If you replace the black dots with dots you have a regular expression pattern. Using this approach I found fewer hits than the LSTM algorithm, but the results were better: they were right at least nine out of ten times. The probability of success also improves with the length of the word.

If the replacement for the defective word is not found in the same text it probably occurs in a text that is like it. In the LSTM experiment we only used chronological closeness as a criterion. The “discovery engine” of EarlyPrint has other ways of grouping texts. It is possible and not particularly difficult to define any text as a member of a group of other texts that share all or some of such categories as author, printer, genre, or date. Instead of looking for a match in the entire corpus or some chronological subset you look for a match in a more closely circumscribed set. If you focus on mapping very rare spellings to more standardized forms, restricting the search to texts from the same printer may well be helpful.

Early Print maps spelling to standard forms unless the spelling is a standard form. A blackdot word can frequently be mapped to a standard spelling with complete certainty, while its actual spelling cannot be identified with equal certainty. The blackdot spelling “lo●e” could in principle stand for ‘lobe’, ‘lode’, ‘loke’, ‘lome’, ‘lone’, ‘lore’, ‘lose’, ‘loue’, ‘love’, ‘loze’. Most of the time it will be either ‘loue’ or ‘love’. Identifying the standard spelling ‘love’ with 100% certainty is worth doing. If the text contains no instance of the spelling ‘haue’ you may confidently map the spelling to ‘love’.

The human labour involved in the review of algorithmic corrections can also be reduced and refined by sorting results in terms of the number of possible matches. There is only one possible correction of ‘ ab●ue’, but ‘abo●e’ is a different story. It takes time, but not a lot of it, to read through a list of common “autocorrections” and identify the cases for which there is only one possible correction. Those can be globally replaced. In such cases you reduce and refine human labour by moving from the review of individual passages to the discovery of rules that the machine can apply without exceptions. There are many such “micro-rules”.

Taking advantage of digital tools and routines

In a print-based environment editorial work  proceeds on a page by page and book by book manner. In a digital environment  you are not constrained in this way. If you can identify defects that are identical or very similar you can process them  in a batch mode. It is one thing to discover 92 times that ‘silth’ should be ‘filth’. It is another to see the 92 occurrences simultaneously. Some times you may, like the tailor in Grimm’s fairy tale, dispatch all of them at a blow. But fixing them one by one as part of a single action will still be quicker than discovering and fixing them separately. Dennis Duncan’s recent Index, history of the  is a great celebration of the power of book indexes, but the time cost of look-ups is constrained by the speed with which you can turn pages. The print  time cost of looking up and fixing 92 occurrences of ‘silth’ is prohibitive relative to the benefit of the correction. In a semi-manual digital environment you can often identify and arrange the relevant data in ways that reduce the time cost of correction by orders of magnitude.

Given the nature  of most  textual problems in the TCP texts, the solution will rarely require a context of more than the seven preceding and following words. MorphAdorner, the program that EarlyPrint uses for linguistic annotation, has a tabular output consisting of a keyword in context together with the lexical and morphological data about the previous, current, and next word.

In the summer of 2022 three Northwestern Classics undergraduates  used a slightly modified version of this output to curate 120 medical texts. Ace Chisholm, Grace DeAngeles, and Lauren Kelley  worked from Google spreadsheets that included links to relevant pages. In addition to transcribing 50 missing pages from a dozen different books (using matching facsimiles on the Internet Archive), the three students corrected about 18,000 defective tokens. This is a pretty impressive record, given the fact that the medical texts turned out to be quite  quite thorny.  Their transcriptions  have much higher defect rates than the TCP as a whole.  The following table gives you the defect rates per 10,000 words for  the interquartile range and at the 10th and 90th percentile for different categories of texts:

TCP source texts 10% 25% 50% 75% 90%
52,800 EarlyPrint texts 0 1.05 12.15 47.78 141.474
814 playtexts 0 0 2.83 16.22 71.425
120 medical texts before curation 4.627 14.7575 53.23 156.3475 284.876
120 medical texts after curation 1.8 3.32 13.3 30.03 128.4

The first data row shows you the current defect rates for all EarlyPrint texts. Since only about  1,000 texts have been touched by any curator, these figures will not be very different from the figures for all TCP texts.  The second row shows the data for 814 play texts. This set of texts has received considerable and continuing attention beyond the initial Shakespeare His Contemporaries phase. The improvements are very striking: the median defect rate has dropped by a factor of four. What you can do for plays, you can do for sermons, travel narratives, histories. etc.

The third and fourth rows show the defect rates for medical  texts before and after curation. You see right away that the median defect rate is larger by almost half an order of magnitude.  Curation made a big difference. The median defect rate dropped by a factor of 4.roughly speaking from two defects per page to one defect every other page.  I suspect that the emendation of remaining defects will in most cases require access to better images than the EEBO scans provided.

It is instructive to look in detail at the collaborative curation of Thomas Cogan’s Haven of Health.  a text from the 1580’s whose title and spirit whoe the influence of Thomas Elyot’s Castle of Health.

Finding the unknown unknowns

On average, once every 200 words you run across a “blackdot” whose appearance draws attention to the fact that it needs fixing. How many “unknown unknowns” are there–i.e. words that are corrupt but don’t draw attention to themselves?  If you assume that a corrupt token occurs at least once every thousand words, that adds up to 1.6 million words in a corpus of 1.6 billion words. It would not be an unreasonable working hypothesis to assume that there may be two million or more such tokens hidden in plain sight. In  a strict “diplomatic” edition it would be important to distinguish between printer’s and transcriber’s mistakes. For this project that distinction does not matter.

Language is an LNRE phenomenon– large number of rare events. Some forms of corruption are quite common and you can discover them by routines such as substituting  ‘s’ for ‘f’ or ‘n’ for ‘u’  (and the other way round)  But the ‘tokens’ of most corrupt ‘types’ are counted  in the low dozens or less. Powers of 4 create useful bins for grouping words by frequency. In EarlyPrint a word that occurs in 64 documents will on average occur once every 850 texts. Words with a document frequency of  64 or less add up to  just 3% of all word tokens but 97% of all distinct spellings. 3% of 1.6 billion is 48 million tokens. That is a lot of tokens, but it takes you into a scale that is manageable with quite ordinary digital tools. If you create a table of those tokens in which every row contains the corresponding MorphAdorner data you can be confident that it will contain nearly all textual phenomena that need require some attention, whether you are looking for a particular phenomenon across all texts or work your way through one text of a group of them.

For my own purposes, I extracted such a table for the ~8,200 texts from the first 150 years of English print, from Caxton’s Troy book (1473) to the Shakespeare Folio (1623).. It consists of not quite eight million rows, and for each data row it includes the document frequency of its token.  I moved it into a Postgres database that I accessed via  an affordable graphical user interface client that simplifies working with the underlying SQL language. The student teams last year and this year learned enough about this tool to put it to good use.  It is a very efficient tool for the many cases where the nature of textual corruption is obvious from the keyword in context and does not require consulting the image.

While statistically driven inquiries prefer to focus on differences in the distribution of very common words,  “hapax legomena” or “dislegomena”–words that are unique to a document or occur in only one other document–remain objects of great interest in many philological inquiries. A tool that makes it easy to identify them will have enthusiastic users. There is a scenario that both from a narrow curatorial and a more broadly philological perspective assembles rare words in a particularly revealing and instructive manner. You can select all the data rows for a particular text and sort them by spelling or in order of occurrence. Either way you will encounter most of the problem cases in that text as a whole. A briefe and true report  briefe and true report of the new found land of Virginia  (1590) is a text of 22,023 words. Its concentrate runs to  708 words, which is 3.24% of the word count, with half of them occurring in four or fewer documents. Some of them are quite weird, others are American Indian words or names. You can also select all word occurrences in the corpus of words that occur in A briefe and true report of the new found land of Virginia. That will return 5,582 data rows, which you will want to sort by spelling. You see that the two obvious misprints ‘sinple’ in A brief and true report  are found in seven other texts. Not everybody will be excited by this discovery, but  for people who take an interest in words for their own sake this type of query is extraordinarily efficient in bringing all occurrences of rare and odd phenomena to the attention of human eyes. A  more ambitious for all corpus occurrences of rare  words used in North’s translation of Plutarch will generate 125,000 hits, with may opportunities for “data janitoring“, while pausing along the way to see whether some 500 words shared by North and Shakespeare are of critical interest.

Complementing the Annotation Module with a “philological shopping cart”

 Considered as a transaction, a web-based emendation is very similar to ordering something on the Internet. Digital “shopping carts” often consist of a Web “frontend” that talks to a “backend” SQL database via something called “object-relational mapping”.  I could imagine a “philological shopping cart” that would mediate  the “text concentrate” of  the approximately 50 million tabular records of EarlyPrint tokens with a document frequency of 64 or less. Users would not need to master SQL, but the interface would  present a dozen common query types in a user-friendly manner. This technique was used with great success in the Chicago Homer.

By the standards of a humanities project this would be a fairly big animal. In a business or science it would be at best a Ford 150 with four-wheel drive and an off-road suspension. I use the metaphor of the shopping cart to draw attention to the fact that the parts,  relationships, and scale issues of such a site are very well understood.

Icing on the Cake: Young scholar editions

A coarse but consistent clean-up of the millions of textual defects in the TCP transcriptions would certainly be welcomed by Early Modern scholars and their students. A decade ago Judith Siefring and Eric Meyer wrote a report on Sustaining the EEBO-TCP Corpus in Transition. It talks about users complaining in particular about errors that were easy to fix. As I said before, most of the work stays within the lower regions of Lower Criticism. It is also work whose social utility is inversely proportionate to its rank in the prestige hierarchy.It makes good sense to engage students in this work. The ones who want to do it are usually good at it. They learn a lot from it, and they enjoy it for a summer or two. But they should also have opportunities for more visible work.

The TCP archive includes two plays by Catharine Trotter published in the late 1690’s. Last summer, three students from Classics and English got interested in her work, transcribed her three other plays published (published between 1701 and 1706) from facsimiles of Eighteenth-Century Collections Online (ECCO) , and proofread all five of them.  All of Trotter’s plays are now on EarlyPrint, two of them with excellent digital facsimiles.

Some years ago I made an argument for Young Scholar editions in the context of Shakespeare His Contemporaries   For much of the twentieth a critical edition was a proper dissertation subject. Those days are gone, but the three TCP collections are full of texts that would be excellent candidates for editions that fit into the time frame of a senior essay or Independent Studies project.  The texts would benefit from being properly edited and framed by an introduction and explanatory notes. Sermons, pamphlets, political speeches, etc. come to mind.

EarlyPrint shares much of its technical infrastructure with TEI-Publisher, an “instant publishing toolbox” that is used by  e-editiones, a small Swiss society for scholarly editions of diverse cultural archives (disclosure: I am a Board member) . I could imagine a cross-disciplinary seminar in which half a dozen seniors from History, Literature, Political Science, and Religion would pursue their separate projects, discuss editorial problems and the rhetorical challenges of writing introductions and notes while sharing a technical infrastructure. It is worth repeating that the combined resources of EEBO-TCP and Evans TCP add up to a very rich archive of Early American materials.

A final reflection

There are two perspectives for looking back on a decade of this project this project and forward to its future. You could conclude that after ten years of intermittent work there have been substantial changes in only 1,000 of some 50,000 texts, two percent, give or take. So what? And who cares anyhow about whether a lot of digitized old books are full of mistakes as long as you get the gist of them?

Alternately you could argue that  a dozen people in two institutions, with some support from others elsewhere not only made substantial progress in two percent of the texts, but developed prototypes of manual, algorithmic, and mixed-mode data curation that are eminently exportable and could be used by anybody anywhere. If people at fifty other institutions would pick up this ball and run with it that would be a very different game. And fifty institutions are a very small percentage of American higher education, not to speak of countries elsewhere.

I naturally incline to the latter view. In 2010 Clay Shirky published Cognitive Surplus, a book with the original and deeply optimistic subtitle  “Creativity ad Generosity in a Connected Age”.  It is not easy to argue plausibly that the world as a whole has much improved since 2010.  On the other hand, if you feel like committing a little of your “cognitive surplus” to the curation of Early Modern books, there is no doubt that the physical conditions for doing so have much improved. Computer screens have got bigger and better. Internet connections are faster, and the relative costs of maintaining large data sets in the “cloud” are dropping.  Technologies like IIIF are greatly increasing the number of freely available high-quality images that can be used to check the accuracy of TCP transcriptions. Mary Wroth’s own copy of her Urania with her own marginalia is now freely from Penn’s Colenda repository. A few months ago I spent a pleasant day checking the TCP transcription against it and making several hundred mostly trivial corrections. It was fun, and like John Kittle I felt that by making these corrections the world is a better place in a very small but measurable way.

 

Leave a Reply

Your email address will not be published. Required fields are marked *