Collaborative curation of 126 medical texts in the EarlyPrint corpus

Three Northwestern  Classics undergraduates, Ace Chisholm,  Grace DeAngeles, and Lauren Kelley, had summer research grants to work on the curation of medical texts in the EarlyPrint corpus. The following is their report, very lightly edited and supplemented by a few hyperlinks.  For more information about collaborative curation, Early English Books Online, and EarlyPrint see https://sites.northwestern.edu/scalablereading/2022/09/25/collaborative-curation-of-tcp-texts-in-the-earlyprint-environment/ 

The three editors also wrote a detailed summary of the text, accompanied by stylistic and thematic analysis. You can find it at

 

EarlyPrint 2022: What we did and what we learned

Our goal

This summer, we worked on medical texts from the English early modern period. Our goal was to get this medical corpus in polished shape for future researchers to use. We also researched a text that was particularly interesting: Thomas Cogan’s Haven of Health, first published in 1584 (though we worked with the fourth edition, published in 1636).

Our contributions to EarlyPrint

In order to make texts in the Early Print corpus more easily read, searched, and analyzed, a plethora of mistakes in the original transcriptions, both known and unknown, had to be corrected, and gaps in the text needed to be filled.  We took four different approaches to meet these needs: correcting “blackdot” words, proofreading, correcting metadata, and transcribing.  To address the “known unknowns,” we worked through nineteen spreadsheets covering 126 texts and corrected almost 20,000 blackdot words.  Blackdot words are those which the original transcriber was unable to identify a letter or several letters in the word and indicated how many and where those letters occurred.  Those missing letters appear as black dots (●) in the digital transcription.  In many cases, we were able to determine the word based on context clues, but where we could not, we consulted the EEBO  page image for its appearance on the page.  In the cases where the word was Latin, Greek, or even abbreviated Latin, our knowledge of Latin (and Grace’s knowledge of Greek) was particularly useful.  If a letter or two were completely obscured, we could often nonetheless identify the word from our knowledge of grammar and its orthographic realization.

The “unknown unknowns” that we interacted with took the form of either incorrectly transcribed words or incorrectly tagged words.  Because incorrectly transcribed words were unmarked (hence an unknown unknown), we rooted these out by proofreading.  We proofread Haven of Health (A19070), simply correcting as we read, as well as seventeen pages of The vertuose boke of distyllacyon , the earliest English “herbal” or book about plants,  recording the types of errors we discovered in our focused sample in a spreadsheet for analysis. Within the seventeen pages, we identified and corrected 450 transcription errors that were hapax spellings, missing, erroneously added or omitted letters, or letter changes.  We reported the tendencies of each type of error with the intention to inform a future parser about possible mistakes and their remedies.

We also used “known unknowns” to identify the “unknown unknowns” in word metadata.  MorphAdorner works in the Early Print corpus to adorn the text with its appropriate metadata, such as a word’s part of speech tag.  While incredibly useful to researchers by providing this searchable data, MorphAdorner is not always accurate.  Using  Aqua Data Studio, a user-friendly frontend for relational databases, we reviewed the metadata attached to words within medical texts that had been tagged as a name along with every other instance of that word.  We corrected the lemma, standardized form, and token (the word itself) of these 5532 words and changed the part of speech tag to “fla” or “n1/2” or “nn1/2” as appropriate.

A number of texts in the Early Print corpus have entire pages missing, a result of the corresponding EEBO being either illegible or incomplete.  After Grace led a brief training session in XML, the coding language used for TCP texts, we set to work in the Oxygen XML Editor to fill those gaps.  We transcribed 107 missing pages for fifteen texts in both English and Latin, resulting in the completion of eleven of the texts and moving the remaining four closer to that end.

In addition to correcting and developing texts, we also researched and produced a report for Haven of Health (A19070), containing a brief biography of the author, Thomas Cogan, summaries of the six major sections of the text and their subsections, and a collection of analyses considering the topical, linguistic, and theoretical aspects of the work.  The research that went into these analyses expanded our understanding of early modern medical beliefs, practices, and their underlying rationale.

The value of our pursuit

It has been a worthwhile endeavor to work with the medical texts of early modern England because, in a time when disease dominates global discourse, it is essential to consider the roots of modern medicine. There is no value in understanding where we are if we don’t know where we came from, as history informs the decisions that will be made in the future.

Many aspects of early-modern medicine contrast with the present day. We no longer maintain the theory of four humors; pills have replaced herbal remedies; surgical patients are likely to survive their procedures; women’s bodies are no longer considered ‘cold and moist’. Regardless of what issues we may have with modern medicine, we can rest assured that any given affliction is less painful and deadly today than it was in the past.

However, a comparison of early-modern and present-day medicine reveals that there is an essential similarity between the two, in that medicine is always subjective to some degree. In the twentieth century, medical research began in earnest to be as objective as possible, using double-blind procedures, randomization, control groups, placeboes, and statistical analysis, inter alia.[1] This trend has continued and accelerated into the twenty-first century, simultaneously drawing from and contributing to the growing importance that society places on scientific advancement.[2] While the research itself may be objective, objectivity itself is a subjectively defined concept that is shaped by social discourse. Additionally, the supremacy of objectivity, rather than subjectivity, in medicine is an arbitrary choice that society has made.

Studying early-modern medical texts can give us perspective on the state of our current social discourses surrounding medicine. Religion no longer has any part in medical science, but they were more closely implicated in several of the texts we studied. For example, the 1636 work Lord have mercy upon us the world, a sea, a pest-house, the one full of stormes, and dangers, the other full of soares and diseases reports that “the principall cause of all the Diseases of the body, are those of the Soule, which is Sinne.”[3] Many of the authors that we read wove together objectivity and subjectivity. For example, an author might first discuss the potential procedures for setting a broken bone, then state his opinion that a particular procedure is best because he himself saw that it healed a man’s arm in three days.

Religion and one’s own observations are both subjective, for they present personal perceptions and interpretations which cannot be verified and may not necessarily be shared by others. Such subjectivity was commonplace and acceptable in early-modern medicine, as the authors’ contemporary society did not require them to be objective and methodical. Upon reading early-modern medical texts, we now have a much more clear perspective about how cultural values and social discourse shape the human experience of, and opinion on medicine.

Moreover, given our status as Classicists, we were able to read these texts with an awareness of ancient medicine. This allowed us to consider the impact of ancient medicine on early-modern and present-day medicine. The authors that we read often paraphrased and quoted directly from writers of ancient Greece and Rome, such as Hippocrates, Galen, and Cornelius Celsus. They also usually made sure to cite their sources, demonstrating that the teachings of ancient medicine were esteemed in the medical culture of early modern England. Many pieces of information about diseases, humors, injuries, and body parts (inter alia) were transmitted directly from the ancient sources into the early modern texts that we worked on. Ancient medical knowledge was generally unquestioned, which is comprehensible from the Humanist lens that prioritized learning from classical antiquity. On the other hand, modern-day medical science prefers research-based advancement that constantly overrides and disproves ideas that were credible in the past.

Our work this summer has demonstrated that the culture surrounding modern-day medicine is a relatively new phenomenon and marks a break from the history of medicine. Gaining perspective on how medical information has operated and currently operates in diverse societies and time periods helps us understand where the future of medicine might lead, as the world renegotiates its relationship to medicine in the wake of pandemic.

What we have learned about early-modern medicine

Through proofreading the medical text corpus, we were exposed to a wide breadth of texts written throughout the early modern period. Upon reading these texts, there was an immediate and almost staggering realization of the monumental developments within the field of medicine in just a few centuries. Even the texts written near the end of this period purport an outdated view of health based on Humorism and the 2000 year old teachings of Hippocrates. In a relatively short time frame, the advent of the Scientific Revolution and subsequent developments in science have transformed the way that disease and health are understood. Although the information in these texts is outdated, they have the invaluable ability to remind us of the often-slow progression of knowledge, as well as stand as a testament to the never-ceasing desire of humanity to learn more about itself and the world.

The medical text corpus contained texts that spanned just over a century, between the early 16th century and the mid-17th century. Although medical theory remained largely unchanged in this timeframe, there was still a notable difference in individual works from the early end of corpus to the latter end. Earlier texts tended to stick primarily to English, while later texts tended to quote more frequently from Latin and Greek authors. This would suggest a pattern of increasing exclusion within medicine, since extensive schooling would have been required to understand those languages. In earlier centuries, folk medicine seemed to have been more dominant; medical knowledge was passed around within communities, and a formal degree would not have necessarily been required to gain authority or to practice medicine within a community. However, later authors seem to derive their authority from their education. As discussed in the formal report for The Haven of Health, Cogan quotes from ancient authors largely to establish his own expertise in medicine. The ability to read these authors in their original language was a skill that only educated men had, and if medical authority was restricted to those who had a firsthand intimacy with the ancients, then the pool of people who can derive medical authority is limited.

Additionally, the reliance on ancient authors suggests that there was an increased importance of having a source for medical information. Earlier texts tended to cite where they got recipes and information from less often, while later texts had more consistent citation about precisely where they got information from. While this is additional proof for the gatekeeping of medical knowledge, since being able to cite sources requires an education, it also suggests a movement towards analysis and antiquarian research. Authors of later medical texts had to have an impressive grasp on a vast body of medical knowledge, and selectively choose which information they considered most accurate and relevant. There are several examples in Haven of Health where Cogan would present conflicting information from two different sources, and then reconcile them with his own opinion and thoughts. This behavior demonstrates a remarkable ability to analyze and synthesize information from a massive corpus of medical information, as well as build upon established knowledge and tweak it if necessary. This behavior became much more prevalent as time went on, and built the foundation for a modern culture of peer-reviewed research and collaboration in medical settings.

One aspect of the medical corpus that remained fairly consistent throughout time was its treatment of women. With the exception of two or three texts concerning midwifery, the vast majority of texts contained little or no information about medicine that was specific to women. As a pertinent example, John Banister’s 1578 The Historie of Man explicitly refuses to discuss the reproductive anatomy of women (at the end of a chapter dedicated to male reproductive anatomy) since he would “commit more indecencie agaynst the office of Decorum, then yeld needefull instruction to the profite of the common sort.”[4] There was a pervasive sentiment throughout the corpus that women’s issues were secondhand to men’s issues, or that they weren’t worth mentioning. Of course, at this point in history, the formal medical profession was essentially restricted to men, and the vast majority of these texts were aimed at an educated, male audience. Therefore, it is no surprise that this patriarchal institution would dedicate little time or effort into understanding issues that affected only women. Additionally, the strong presence of the Church in society undoubtedly rendered discussions about female anatomy and reproduction taboo and possibly offensive. The lack of mention of these topics in the medical corpus is further proof of the importance of alternate medical authorities, such as midwives, in the lives of women.

The role of religion in the medical corpus demonstrates another tradition that is largely gone from the modern world of medicine. A considerable number of texts make some mention of God in the title, or begin with a preface that explains the author’s dedication to God. This can be explained with the central Christian ideology that the body is a temple to God, and that maintaining health is a way to remain faithful and preserve oneself so as to better serve the Lord. Additionally, the act of helping the poor and needy is another central teaching of Christianity that is fulfilled by the profession of physicians. This is very consistent with the prevalence and authority of the Church in early modern society, and explains the close intertwining of religion and medicine. An interesting observation that is somewhat more surprising to the modern reader is the role of sin in medicine. Since medicine was not secular, health encompassed both physical and spiritual health; therefore, if someone were out of favor with God, they were not considered truly healthy since they were in a state of deep sin. Depending on the source, spiritual sickness could even lead to physical ailment. The large role that religion used to play in diagnosis and prognosis further demonstrates the cultural impact of Christianity on all aspects of early modern life.

Becoming familiar with these medical texts has certainly reinforced the vast progress that has been made in medicine from the early modern period to the current era. In many ways, the medical landscape of this time is unrecognizable; these works were written before germ theory, when the prevailing medical philosophy was essentially unchanged for 2000 years. The reliance of physicians on ancient authors instead of empirical evidence is almost unimaginable to modern readers. Perhaps of most importance to half the world population, modern medicine has expanded its bounds to include women and their specific health needs. However, there is some continuity in the actual hierarchical structure of medicine; in the current era, there is an increasing barrier to entry in the medical profession, with higher levels of education and experience needed to become certified. Similarly to the 17th century, medical authority tends to rest in the hands of relatively few experts who bestow treatment upon the larger population. Therefore, while the content and understanding of medicine has changed over time, the man-made structures and hierarchies within the field remain constant.

Conclusion

Before this summer, the three of us had only a baseline knowledge of ancient medicine and no knowledge of early modern medicine. Nonetheless, we managed both to learn from and to contribute to the medical corpus of EarlyPrint. We hope that our efforts inspire others to continue to research the medical corpus, or to work on other topical corpora that spark their interest.

[1] Bhatt, Arun. “Evolution of clinical research: a history before and beyond James Lind.” Perspectives in Clinical Research vol. 1,1 (2010): 6-10.

[2] “Society” here refers to Western society—we do not have enough familiarity with the history and current practices of Eastern medicine to analyze it here.

[3] https://texts.earlyprint.org/works/A68989.xml?page=010-b.

[4]The Historie of Man, 89. https://texts.earlyprint.org/works/A03467.xml?page=098-a

A list of texts for which missing pages were transcribed from digital facsimiles on the Internet Archive of the printed originals.

 

S121369 1533 Fabyan, Robert, Fabyans cronycle newly prynted, wyth the cronycle, actes, and dedes done in the tyme .. Henry the vii.
S106976 1596 Le Sylvain, The orator: handling a hundred seuerall discourses, in forme of declamations:

 

S102357 1598 Florio, John, A vvorlde of wordes, or Most copious, and exact dictionarie in Italian and English
S106753 1599 Hakluyt, Richard, The principal nauigations, voyages, traffiques and discoueries of the English nation
S103824 1599 Harsnett, Samuel, A discouery of the fraudulent practises of Iohn Darrel Bacheler of Artes
S117760 1624 Camden, William The historie of the life and death of Mary Stuart Queene of Scotland.
R15125 1652 Selden, John, Of the dominion, or, ownership of the sea two books.
R9064 1660 Bartoli, Daniello The learned man defended and reform’d. A discourse of singular politeness, and elocution
R19153 1661 [no entry] Mathematical collections and translations… Galileus his system of the world.
R42174 1673 Milton, John, Poems, &c. upon several occasions. By Mr. John Milton: both English and Latin
R5715 1673 Ray, John, Observations topographical, moral, & physiological; made in a journey through .. the Low-Countries…
R27278 1678 Cudworth, Ralph, The true intellectual system of the universe
R16918 1699 Haudicquer de Blancourt, Jean, The art of glass. Shewing how to make all sorts of glass, crystal and enamel.
R31983 1700 Dryden, John, Fables ancient and modern; translated into verse,

Collaborative curation of TCP texts in the EarlyPrint environment

This is a report about the collaborative curation of Early Modern Anglophone texts transcribed by the Text Creation Partnership (TCP) and maintained by the EarlyPrint project in a linguistically annotated form.  It is written for an audience for whom TCP is not a familiar acronym.  It describes work done over the past decade and looks ahead to what could or should be done in the future. For a more broadly based and reflective essay about the same  topic look at  Re-mediating the Documentary Infrastructure of Early Modern Studies in a Collaborative Fashion.

I begin with three assertions:

  1. Considered in the aggregate, the almost 70, 000 TCP texts are the most important source  of primary documents for the study of the English-speaking world before 1800 and more particularly before 1700.
  2. In their current form very few of these texts meet the minimal conditions of a carefully proofread text, and many fall seriously short of it.
  3. The dissonance between these two assertions can be largely resolved by involving the user community in the correction of millions of textual defects. Digital tools and routines have greatly lowered the access barriers to such work. Much of the work that needs to be done lies comfortably within the competence of careful readers from many walks of life. All you need is  some interest in old books and a little patience.  Students have been very good at it, doing work from which they learn and that is useful to others.

The scope of the TCP project

The English Short Title Catalogue (ESTC) is the most authoritative bibliographical source for materials printed in the English-speaking world before 1800. It has some 480,000 entries ranging from single page broadsheets to multi-volume books like Hakluyt’s Voyages. A high percentage of these “books” were microfilmed over the course of the 20th century, mostly by University Microfilms, a business with historical relations to the University of Michigan. In the late 1990s Proquest, the current owner of University Microfilms, created Early English Books Online (EEBO) by making  digital scans of the microfilms available over the Web. When I asked a colleague what difference digital things made to his work, I could barely finish my question before he said: “EEBO has changed everything”.  The stuff you needed had suddenly moved from the  least loved technology of the 20th century to a medium where you get at it at 2 am and in your pyjamas.

Proquest partnered with  the University of Michigan and the University of Oxford to use the EEBO scans as the sources  for the TCP project, which over the course of two decades  created SGML- encoded transcriptions of some 60,000 texts published before 1700 (They now circulate mostly in the XML format that replaced SGML).  The goal was to create something like a deduplicated library of pre-1700 Anglophone print.  It is not easy to determine when or for what purpose two texts are duplicates of each other. Broadly speaking, EEBO-TCP includes about 70% of distinct pre-1700 titles in the ESTC, and much of what is in the 30% of missing titles occurs somewhere in the 70% of transcribed titles if you interpret ‘duplicate’ broadly enough. For many practical purposes you can think of EEBO-TCP as a Noah’s Ark of pre-1700 English print culture.

The TCP also partnered with Readex and Gale to create 5,000 transcriptions of American books before 1800 and 2,000 transcriptions of English 18th century texts. These smaller projects cover respectively  12%  and 0.7% of the corresponding ESTC entries.  That said, 2,000 18th century titles are still quite a few titles.

How the TCP texts were made

The transcriptions were farmed out to transcription services like Apex. Much of the work was done in Asian countries. We know next to nothing about the workers. They were held to no more than one error per 20,000 keystrokes, and they were instructed to mark as precisely the number of letters or words that they could not decipher. Quality control was assigned to staff with academic backgrounds at Michigan and Oxford.

The transcriptions are not “diplomatic” editions that capture the look and feel of the source document.  They are quite Spartan and explicitly ignore many features that would be of interest  to a bibliographer or book historian. They are faithful to orthography but largely ignore information about typefaces, layout, running heads, or other detail about the physical nature of a page or book. On the other hand, the structural markup with a reasonably granular set of TEI “elements” created texts that with appropriate software are much more deeply searchable than  “plain text” transcriptions would be.

Within significant constraints of time and money, it was clearly a major goal of the project to push out as many texts as possible, restrict features to  the essentials of words in sequence, and live with a higher error rate than you would tolerate in the editing of a single book. In retrospect that was clearly the right thing to do. In a digital archive texts can comfortably coexist at different levels of (im)perfection and be subject to continuing improvement.

Defects in the TCP texts

In what follows, figures are derived from 52,800 EEBO texts that are currently part of the EarlyPrint project, about which more below.

The biggest source of error in the TCP texts has to do with the quality of the microfilms from which they were transcribed. The quality of the printed page from which the microfilm was made is a close second. About 4% of the texts are missing a total of 11,000 pages ranging from 1 to 297 per book. The modal case is a missing double page image.  Transcription errors in the outer or inner margins of page images are higher by a factor of five. Faded small print in a marginal note becomes even harder to decipher in a poor microfilm image. Undecipherable final letters in the left inner margins and initial letters in the right inner margins are the victims of microfilms that were made before deskewing devices became common.

The combined defects caused by poor microfilm images  or their page source are the reason for the uneven distribution of textual defects. The most common defects–untranscribed letters in words–cluster heavily. In the screen display of the text letters known to be missing are represented  the by the symbol ‘● ‘. There are almost eight million of them.  On average, one word in 200 is a “blackdot word”. But 55% of them are found in  the 15% of texts with more than 100 defects per 10,000 running words.  The interquartile range  of defects moves between 1.05 and 47.8 defects per 10k with a median value of a dozen defects per 10k.  People have an inveterate tendency of judging a barrel by its worst apples. But one might say of the TCP texts what Edmund Malone is supposed to have said of Shakespeare’s plays: “The texts of our author are not as  bad as they are made out to be.”

Donald Rumsfeld once distinguished between “known unknowns” and “unknown unknowns”. It’s the latter that get you.  You can count the blackdot words, and quite often they are easy to fix.  The unknown unknowns are hidden in plain sight among 1.6 billion words.  Thus the more than 10,000 occurrences of the the string ‘re’ usually represent  the ablative of the Latin word ‘res’,  but hundreds of them are the prefix ‘re’ of a word that has been wrongly split.

Why does it matter to fix defects?

Textual defects may block your understanding of a text, but quite often you see right through them. Many people don’t care about such “noise” as long as they get the “signal”. This is a very common attitude among researchers who are engaged in machine learning or statistically driven Natural Language Processing. They are right in a context where you are trying to get some information out of  a  flood of textual ephemera.  But if ephemera hang around long enough they become cultural heritage objects deserving attention and care. Time has transformed them into what Thucydides thought his history should be– a “possession forever”.  Thomas Carlyle described the Thomason tracts– a collection of some 20,000 very ephemeral pamphlets from the English Civil –as a “hideous mass of rubbish” that was nonetheless the most informative source  about what the English of their day thought and felt.  Carlyle’s phrase aptly describes many TCP texts.

There is a  “yuck” factor in all this. According to the New York Times (Nov. 16, 2009), the engineer John Kittle helped improve the Google map for his hometown of Decatur, Georgia, and said:

Seeing an error on a map is the kind of thing that gnaws at me. By being able to fix it, I feel like the world is a better place in a very small but measurable way.

It has become much easier to do something about things that gnaw at you in that way. At a conference some years ago Gregory Crane, the editor of the Perseus Project, argued that “Digital editing lowers barriers to entry and requires a more democratized and participatory intellectual culture.” In the context of the much more specialized community of Greek papyrologists, Joshua Sosin has successfully called for “increased vesting of data control in the user community.”

EarlyModern printers were keenly aware of the mistakes they made. There is a subgenre of whimsically contrite apologies asking for the reader’s forgiveness and cooperation. I am very fond of  the printer’s plea in Harding’s Sicily and Naples, a mid-seventeenth century play:

Reader. Before thou proceed’st farther, mend with thy pen these few escapes of the presse: The delight & pleasure I dare promise thee to finde in the whole, will largely make amends for thy paines in correcting some two or three syllables.

For Latinists, there are the lapidary and imperative final words of Sameue Garey’s Great Brittans little calendar (1618):

Amen.
Candido lectori: Humanum est errare, errata hic corrige (lector) quae penna, aut praelo lapsa fuisse vides.

If enough readers share John Kittle’s sense of gnawing and have a good sense how much digital technologies have lowered the entry barriers to collaborative work, there is a chance that over a number of years their “many hands will make light work” and many of the texts will be substantially improved.

Collaborative curation at Northwestern

My earliest experiment with collaborative curation goes back to 2009. I gave students the option to replace an essay with fixing defects in TCP transcriptions of Marlowe’s Tamburlaine. The students worked from spreadsheets that listed defective tokens as keywords in context and added a URL to a page image. Two students wrote a charming essay about how this exercise had taught them to become “fluent in Marlowe“. Their essay remains a useful demonstration of how a simple exercise can lead to reflections on the fundamentals of philological labour.

Between 2013 and 2015 three generations of students at Northwestern, augmented by volunteers from Amherst and Washington University in St. Louis, worked on a project called Shakespeare His Contemporaries . They reduced the median defect rate in the TCP transcriptions of some 500 non-Shakespearean plays by an order of magnitude  from 14.6 to 1.4. per 10,000 words, roughly from 30 to three defects per play.

Nicole Sheriko was a member of the first team of five students supported by summer research grants. She went on to Graduate School, working at the “intersection of literary criticism, cultural studies, and theory history” and is now a Junior Research Fellow at Christ’s College, Cambridge. Last year, she had this to say about her experience:

Having the experience of working with a professor on a project outside of the classroom–especially in digital humanities, which everyone seems to find trendy even if they have no idea what it entails–was a vital piece of my graduate school applications, I think, and other students may see a similar benefit in that.

In a less vulgarly practical sense, though, I would say that working on what was then Shakespeare His Contemporaries made a significant difference in how I approach studying the field of early modern drama. The typical college course can only focus on a handful of canonical texts but working across such an enormous digital corpus reoriented my sense of how wide and eclectic early modern drama is. It gave me a chance to work back and forth between close and distant reading, something I still do as I reconstruct the corpus of more marginal forms of performance from references scattered across many plays. A lot of those plays are mediocre at best, and I often remember a remark you once made to us about how mediocre plays are so valuable for illustrating what the majority of media looked like and casting into relief what exactly makes good plays good. The project was such a useful primer in the scope and aesthetics of early modern drama. It was also a valuable introduction to the archival challenges of preservation and digitization that face large-scale studies of that drama. Getting a glimpse under the hood of how messy surviving texts are–both in their printing and their digitization–raised all the right questions for me about how critical editions of the play get made and why search functions on databases like EEBO require a bit of imaginative misspelling to get right. That team of five brilliant women was also my first experience of the conviviality of scholarly work, which felt so different from my experience as an English major writing papers alone in my room. That solidified for me that applying to grad school was the right choice, a sentiment likely shared by my teammate Hannah Bredar, who–as you probably know–also went on to do a PhD. Once I got to grad school, the project also followed me around in my first year because I took a course in digital humanities and ended up talking a lot about the TCP and some of the little side projects I ended up doing for Fair Em, like recording the meter of each line to see where breakdowns occurred. I even learned some R and did a final project looking for regional markers of difference across the Chronicling America historical newspaper corpus. So, in big ways and small, the work I did at NU has stayed with me.

EarlyPrint and curation en passant

Beginning in 2016 Shakespeare His Contemporaries project morphed into the more ambitious EarlyPrint.  This joint project by Northwestern and Washington University in St. Louis is dedicated to the curation and analysis of all texts in the TCP archives. Early Print currently contains some 58,000 texts from the TCP EEBO, Evans, and ECCO archives, about 82% of all TCP texts.

The EarlyPrint texts differ from their TCP sources in several respects:

  1. Each word in the text has a unique ID that concatenates the file_id, with a page_id and a word counter. This does nothing for the reader, but allows for a much more complex and accurate management of textual corrections.
  2. Each word in the text is linguistically annotated and assigned to a lemma, a part of speech, and a standard spelling, where the spelling differs from a modern or standard form archaic form (‘louyth’ > ‘loveth’ rather than ‘loves’). Readers can with the click of a button switch the text display.
  3. The texts are indexed with the very fast Blacklab search engine of the Dutch Language Institute, which supports searches by lexical or grammatical criteria, e.g “adjectives preceding liberty or freedom” or “three adjectives with the last preceded by “and”.
  4. A small but growing number of texts (586 as of September 2022) has matching image sets from IIIF servers at the Internet Archive, the Kislak Center at the University of Pennsylvania and the Ransom Center at Texas. These images are usually much superior to the EEBO scans, and they are free.
  5. Any reader can at the click of a button call up an “annotation module” that displays a data entry form where the reader can enter corrections or comments for any defective or suspect reading.  Some80,000 corrections have been made in this fashion since 2017.

The final feature is the most relevant to this discussion.  Just about any activity follows a pattern of “set-up, action, clean-up”. Quite often the set-up and clean-up take more time than the action. The philological version of this sequence goes like “find it, fix it, log it”.  If you can reduce the time cost of finding and logging, you may persuade more people to become user-contributors and fix this or that defect as they come across it in their reading.

The Annotation Module works best in a mode that I call “curation en passant”. You come across a reading that is clearly wrong or suspect. Clicking on a button will open a window that lets you enter an emendation or question the reading with a comment. This is as simple as writing in the margin of a book page. You must be a registered user to do so (a very simple procedure).  As soon as you save your correction,  the machine creates a permanent record of the who, what, when, and where of your transaction. The record is saved into a central archive. Your proposed emendation is immediately but provisionally displayed in the text, highlighted in yellow. Once your emendation has been approved by an editor with the additional privilege of approving or rejecting emendations, it is highlighted in green. The underlying texts of the EarlyPrint site are periodically updated.  The updated texts no longer highlight approved corrections, but the project maintains a permanent record of textual changes.

While  the record of corrections is not currently public, it could be made public–somewhat in the manner of the Berichtigungsliste or correction list that Greek papyrologists maintain. But the corrections very rarely enter into the realm of what Michael Witmore once called “the philologically exquisite”. For the most part they stay in the lower regions of Lower Criticism.  They matter a lot in the aggregate, but individually they are rarely of interest. Patterns of defects are of interest because they may suggest algorithmic solutions or semi-automatic routines for finding and fixing defects.

Engineering English

80,000 corrections is a lot of corrections, but it is a small percentage of the more than eight million “known unknowns” and the several million “unknown unknowns”, such as typographical errors, whether the printer’s or the transcriber’s, or words that are wrongly split or wrongly joined—a very common source of error, especially in notes and double column texts.

Most of the student work has been funded through summer research grants for eight-week sessions that over the years have crept up from $3,000to $4,000. The value added by that investment consists of the corrections and of what the student learned. The cost per correction is close to a dollar. It is still fifty cents if you argue that half of the grant pays for what the student learns.  If the goal remains a coarse but comprehensive clean-up of the TCP texts in an EarlyPrint environment, it is clear that current approaches do not scale. The Annotation Module is a very good tool for a readerly approach to fixing defects in a text as you encounter them. It is also the best tool for proofreading every tenth page of a text, which would be a good way of flushing out recurring defects and—in the case of negative results—building confidence in the integrity of a partially proofread text. But by itself it will not contribute enough to the goal of a coarse but comprehensive clean-up of the entire corpus.

Could machine learning help? Doug Downey is a professor of Computer Science at Northwestern. He has a special interest in utilizing human input more effectively in machine learning. In 2017 two of his undergraduate students, Sangrin Lee and Larry Wang, used “long short-term memory” algorithms (LSTM) to find correct replacements for incompletely transcribed words. Their results were so promising that we incorporated these “autocorrections” as provisional emendations into the display of the EarlyPrint texts, using a special colour to mark their algorithmic status.

Machine learning never completely replaces human labour. It is helpful to think of the three R’s of animal experiments: Replace, reduce, refine. The conditions under which algorithmic corrections can be accepted without any human review are limited. But experiments in 2020 showed that at a minimum the one-by-one review of algorithmic corrections doubles or triples the productivity of undergraduate curators.

Over the past few years the software for neural network procedures has become more sophisticated and easier to use (Tensorflow). You no longer need machines with special processors to run the programs (Apple’s M1 chip). Thus it pays to re-examine the potential of algorithmic correction and to think of various divide and conquer approaches to correct not only incompletely transcribed tokens but also review the five million unique spellings and identify the many cases where a unique spelling is a misprint. This is a particularly challenging version of a spellchecker problem because across a textual archive of two billion words over more than two centuries is not easy to distinguish clearly between misprints and quirky variants.

Success will depend on starting from the assumption that the coarse but consistent clean-up of the TCP texts will remain a project with a relatively high degree of human intervention.  Computer scientists call such projects “mixed initiatives”. I am fairly confident that 80% of current defects can be corrected with such an approach. It would  require substantial conversations among domain and technical experts and a more granular and iterative  approach to specifying the contexts for evaluating the likelihood of a replacement for a corrupt or incomplete token.  There will always be many “last mile” problems where the human editors are on their own. But even in those cases you should not underestimate the power of  digital tools in reducing the time cost of “finding”  defects and “logging” the curator’s emendations.

At one point I conducted a very primitive experiment in which I treated blackdot words as regular expressions and checked whether they matched words in the same text. A “regular expression” is a pattern matching device where a dot matches any single character. Thus ‘b.t’ matches ‘bat’, ‘bet’, ‘bit’, ‘but’. If you replace the black dots with dots you have a regular expression pattern. Using this approach I found fewer hits than the LSTM algorithm, but the results were better: they were right at least nine out of ten times. The probability of success also improves with the length of the word.

If the replacement for the defective word is not found in the same text it probably occurs in a text that is like it. In the LSTM experiment we only used chronological closeness as a criterion. The “discovery engine” of EarlyPrint has other ways of grouping texts. It is possible and not particularly difficult to define any text as a member of a group of other texts that share all or some of such categories as author, printer, genre, or date. Instead of looking for a match in the entire corpus or some chronological subset you look for a match in a more closely circumscribed set. If you focus on mapping very rare spellings to more standardized forms, restricting the search to texts from the same printer may well be helpful.

Early Print maps spelling to standard forms unless the spelling is a standard form. A blackdot word can frequently be mapped to a standard spelling with complete certainty, while its actual spelling cannot be identified with equal certainty. The blackdot spelling “lo●e” could in principle stand for ‘lobe’, ‘lode’, ‘loke’, ‘lome’, ‘lone’, ‘lore’, ‘lose’, ‘loue’, ‘love’, ‘loze’. Most of the time it will be either ‘loue’ or ‘love’. Identifying the standard spelling ‘love’ with 100% certainty is worth doing. If the text contains no instance of the spelling ‘haue’ you may confidently map the spelling to ‘love’.

The human labour involved in the review of algorithmic corrections can also be reduced and refined by sorting results in terms of the number of possible matches. There is only one possible correction of ‘ ab●ue’, but ‘abo●e’ is a different story. It takes time, but not a lot of it, to read through a list of common “autocorrections” and identify the cases for which there is only one possible correction. Those can be globally replaced. In such cases you reduce and refine human labour by moving from the review of individual passages to the discovery of rules that the machine can apply without exceptions. There are many such “micro-rules”.

Taking advantage of digital tools and routines

In a print-based environment editorial work  proceeds on a page by page and book by book manner. In a digital environment  you are not constrained in this way. If you can identify defects that are identical or very similar you can process them  in a batch mode. It is one thing to discover 92 times that ‘silth’ should be ‘filth’. It is another to see the 92 occurrences simultaneously. Some times you may, like the tailor in Grimm’s fairy tale, dispatch all of them at a blow. But fixing them one by one as part of a single action will still be quicker than discovering and fixing them separately. Dennis Duncan’s recent Index, history of the  is a great celebration of the power of book indexes, but the time cost of look-ups is constrained by the speed with which you can turn pages. The print  time cost of looking up and fixing 92 occurrences of ‘silth’ is prohibitive relative to the benefit of the correction. In a semi-manual digital environment you can often identify and arrange the relevant data in ways that reduce the time cost of correction by orders of magnitude.

Given the nature  of most  textual problems in the TCP texts, the solution will rarely require a context of more than the seven preceding and following words. MorphAdorner, the program that EarlyPrint uses for linguistic annotation, has a tabular output consisting of a keyword in context together with the lexical and morphological data about the previous, current, and next word.

In the summer of 2022 three Northwestern Classics undergraduates  used a slightly modified version of this output to curate 120 medical texts. Ace Chisholm, Grace DeAngeles, and Lauren Kelley  worked from Google spreadsheets that included links to relevant pages. In addition to transcribing 50 missing pages from a dozen different books (using matching facsimiles on the Internet Archive), the three students corrected about 18,000 defective tokens. This is a pretty impressive record, given the fact that the medical texts turned out to be quite  quite thorny.  Their transcriptions  have much higher defect rates than the TCP as a whole.  The following table gives you the defect rates per 10,000 words for  the interquartile range and at the 10th and 90th percentile for different categories of texts:

TCP source texts 10% 25% 50% 75% 90%
52,800 EarlyPrint texts 0 1.05 12.15 47.78 141.474
814 playtexts 0 0 2.83 16.22 71.425
120 medical texts before curation 4.627 14.7575 53.23 156.3475 284.876
120 medical texts after curation 1.8 3.32 13.3 30.03 128.4

The first data row shows you the current defect rates for all EarlyPrint texts. Since only about  1,000 texts have been touched by any curator, these figures will not be very different from the figures for all TCP texts.  The second row shows the data for 814 play texts. This set of texts has received considerable and continuing attention beyond the initial Shakespeare His Contemporaries phase. The improvements are very striking: the median defect rate has dropped by a factor of four. What you can do for plays, you can do for sermons, travel narratives, histories. etc.

The third and fourth rows show the defect rates for medical  texts before and after curation. You see right away that the median defect rate is larger by almost half an order of magnitude.  Curation made a big difference. The median defect rate dropped by a factor of 4.roughly speaking from two defects per page to one defect every other page.  I suspect that the emendation of remaining defects will in most cases require access to better images than the EEBO scans provided.

It is instructive to look in detail at the collaborative curation of Thomas Cogan’s Haven of Health.  a text from the 1580’s whose title and spirit whoe the influence of Thomas Elyot’s Castle of Health.

Finding the unknown unknowns

On average, once every 200 words you run across a “blackdot” whose appearance draws attention to the fact that it needs fixing. How many “unknown unknowns” are there–i.e. words that are corrupt but don’t draw attention to themselves?  If you assume that a corrupt token occurs at least once every thousand words, that adds up to 1.6 million words in a corpus of 1.6 billion words. It would not be an unreasonable working hypothesis to assume that there may be two million or more such tokens hidden in plain sight. In  a strict “diplomatic” edition it would be important to distinguish between printer’s and transcriber’s mistakes. For this project that distinction does not matter.

Language is an LNRE phenomenon– large number of rare events. Some forms of corruption are quite common and you can discover them by routines such as substituting  ‘s’ for ‘f’ or ‘n’ for ‘u’  (and the other way round)  But the ‘tokens’ of most corrupt ‘types’ are counted  in the low dozens or less. Powers of 4 create useful bins for grouping words by frequency. In EarlyPrint a word that occurs in 64 documents will on average occur once every 850 texts. Words with a document frequency of  64 or less add up to  just 3% of all word tokens but 97% of all distinct spellings. 3% of 1.6 billion is 48 million tokens. That is a lot of tokens, but it takes you into a scale that is manageable with quite ordinary digital tools. If you create a table of those tokens in which every row contains the corresponding MorphAdorner data you can be confident that it will contain nearly all textual phenomena that need require some attention, whether you are looking for a particular phenomenon across all texts or work your way through one text of a group of them.

For my own purposes, I extracted such a table for the ~8,200 texts from the first 150 years of English print, from Caxton’s Troy book (1473) to the Shakespeare Folio (1623).. It consists of not quite eight million rows, and for each data row it includes the document frequency of its token.  I moved it into a Postgres database that I accessed via  an affordable graphical user interface client that simplifies working with the underlying SQL language. The student teams last year and this year learned enough about this tool to put it to good use.  It is a very efficient tool for the many cases where the nature of textual corruption is obvious from the keyword in context and does not require consulting the image.

While statistically driven inquiries prefer to focus on differences in the distribution of very common words,  “hapax legomena” or “dislegomena”–words that are unique to a document or occur in only one other document–remain objects of great interest in many philological inquiries. A tool that makes it easy to identify them will have enthusiastic users. There is a scenario that both from a narrow curatorial and a more broadly philological perspective assembles rare words in a particularly revealing and instructive manner. You can select all the data rows for a particular text and sort them by spelling or in order of occurrence. Either way you will encounter most of the problem cases in that text as a whole. A briefe and true report  briefe and true report of the new found land of Virginia  (1590) is a text of 22,023 words. Its concentrate runs to  708 words, which is 3.24% of the word count, with half of them occurring in four or fewer documents. Some of them are quite weird, others are American Indian words or names. You can also select all word occurrences in the corpus of words that occur in A briefe and true report of the new found land of Virginia. That will return 5,582 data rows, which you will want to sort by spelling. You see that the two obvious misprints ‘sinple’ in A brief and true report  are found in seven other texts. Not everybody will be excited by this discovery, but  for people who take an interest in words for their own sake this type of query is extraordinarily efficient in bringing all occurrences of rare and odd phenomena to the attention of human eyes. A  more ambitious for all corpus occurrences of rare  words used in North’s translation of Plutarch will generate 125,000 hits, with may opportunities for “data janitoring“, while pausing along the way to see whether some 500 words shared by North and Shakespeare are of critical interest.

Complementing the Annotation Module with a “philological shopping cart”

 Considered as a transaction, a web-based emendation is very similar to ordering something on the Internet. Digital “shopping carts” often consist of a Web “frontend” that talks to a “backend” SQL database via something called “object-relational mapping”.  I could imagine a “philological shopping cart” that would mediate  the “text concentrate” of  the approximately 50 million tabular records of EarlyPrint tokens with a document frequency of 64 or less. Users would not need to master SQL, but the interface would  present a dozen common query types in a user-friendly manner. This technique was used with great success in the Chicago Homer.

By the standards of a humanities project this would be a fairly big animal. In a business or science it would be at best a Ford 150 with four-wheel drive and an off-road suspension. I use the metaphor of the shopping cart to draw attention to the fact that the parts,  relationships, and scale issues of such a site are very well understood.

Icing on the Cake: Young scholar editions

A coarse but consistent clean-up of the millions of textual defects in the TCP transcriptions would certainly be welcomed by Early Modern scholars and their students. A decade ago Judith Siefring and Eric Meyer wrote a report on Sustaining the EEBO-TCP Corpus in Transition. It talks about users complaining in particular about errors that were easy to fix. As I said before, most of the work stays within the lower regions of Lower Criticism. It is also work whose social utility is inversely proportionate to its rank in the prestige hierarchy.It makes good sense to engage students in this work. The ones who want to do it are usually good at it. They learn a lot from it, and they enjoy it for a summer or two. But they should also have opportunities for more visible work.

The TCP archive includes two plays by Catharine Trotter published in the late 1690’s. Last summer, three students from Classics and English got interested in her work, transcribed her three other plays published (published between 1701 and 1706) from facsimiles of Eighteenth-Century Collections Online (ECCO) , and proofread all five of them.  All of Trotter’s plays are now on EarlyPrint, two of them with excellent digital facsimiles.

Some years ago I made an argument for Young Scholar editions in the context of Shakespeare His Contemporaries   For much of the twentieth a critical edition was a proper dissertation subject. Those days are gone, but the three TCP collections are full of texts that would be excellent candidates for editions that fit into the time frame of a senior essay or Independent Studies project.  The texts would benefit from being properly edited and framed by an introduction and explanatory notes. Sermons, pamphlets, political speeches, etc. come to mind.

EarlyPrint shares much of its technical infrastructure with TEI-Publisher, an “instant publishing toolbox” that is used by  e-editiones, a small Swiss society for scholarly editions of diverse cultural archives (disclosure: I am a Board member) . I could imagine a cross-disciplinary seminar in which half a dozen seniors from History, Literature, Political Science, and Religion would pursue their separate projects, discuss editorial problems and the rhetorical challenges of writing introductions and notes while sharing a technical infrastructure. It is worth repeating that the combined resources of EEBO-TCP and Evans TCP add up to a very rich archive of Early American materials.

A final reflection

There are two perspectives for looking back on a decade of this project this project and forward to its future. You could conclude that after ten years of intermittent work there have been substantial changes in only 1,000 of some 50,000 texts, two percent, give or take. So what? And who cares anyhow about whether a lot of digitized old books are full of mistakes as long as you get the gist of them?

Alternately you could argue that  a dozen people in two institutions, with some support from others elsewhere not only made substantial progress in two percent of the texts, but developed prototypes of manual, algorithmic, and mixed-mode data curation that are eminently exportable and could be used by anybody anywhere. If people at fifty other institutions would pick up this ball and run with it that would be a very different game. And fifty institutions are a very small percentage of American higher education, not to speak of countries elsewhere.

I naturally incline to the latter view. In 2010 Clay Shirky published Cognitive Surplus, a book with the original and deeply optimistic subtitle  “Creativity ad Generosity in a Connected Age”.  It is not easy to argue plausibly that the world as a whole has much improved since 2010.  On the other hand, if you feel like committing a little of your “cognitive surplus” to the curation of Early Modern books, there is no doubt that the physical conditions for doing so have much improved. Computer screens have got bigger and better. Internet connections are faster, and the relative costs of maintaining large data sets in the “cloud” are dropping.  Technologies like IIIF are greatly increasing the number of freely available high-quality images that can be used to check the accuracy of TCP transcriptions. Mary Wroth’s own copy of her Urania with her own marginalia is now freely from Penn’s Colenda repository. A few months ago I spent a pleasant day checking the TCP transcription against it and making several hundred mostly trivial corrections. It was fun, and like John Kittle I felt that by making these corrections the world is a better place in a very small but measurable way.