This is a report about the current state of the collaborative curation of TCP texts. While I have written about this topic many times on this blog, this report is written for newcomers who have an interest in what was printed before 1800 but may or may not know anything about TCP texts.
TCP stands for Text Creation Partnership and refers to a partnership between Proquest and a consortium of university libraries that between 2001 and 2015 created approximately 65,000 SGML/XML transcriptions of printed items published before 1800 and written in English, with a sprinkling of not quite 1,000 texts in Latin, French, Welsh, and some other languages. This was a mixed commercial/academic enterprise, with an agreement to move the texts into the public domain within five years of their completion. About 30,000 texts moved into the public domain in 2015. The rest will follow in 2020.
The digital texts were created by transcription services, many of them by Apex Typing. Little is know about the people who did the work except that they nearly always worked in far away countries, were paid by the keystroke, and may or may not have known English, which may or may not have been an advantage.
The texts range from news sheets and broadside ballads to learned multi-volume and multilingual tomes, but the median length of a TCP text is less than 7,000 words, and only a little more than a quarter exceed 20,000 words. They are part of a print heritage of not quite half a million titles published before 1800 and catalogued in the English Short Title Catalogue (ESTC). More than half of those titles were published after 1750, and only 15% were published before 1650.
The EEBO, Evans, and ECCO TCP corpora
There are three different sets of TCP texts, but they were all transcribed according to the same protocols, which were observed with reasonable consistency across the fifteen years of the project. EEBO-TCP is by far the largest set. Its 60,000 titles come close to being a deduplicated library of books printed before 1700. There is probably no other period of comparable scope and significance where the printed record has been transcribed with comparable completeness and consistency. The approximately 5,000 Evans TCP texts were chosen from the Evans bibliography of American books before 1820. They represent about 15% of the titles in that bibliography. That is less coverage than offered by EEBO-TCP but it is still a substantial sample. The 2,000 ECCO TCP transcriptions are a cherry-picked collection with an emphasis on canonical high-culture texts. They probably include no more than 3% of distinct titles from that period but a library of 2,000 18th century books is still a sizable selection.
The nature and distribution of textual defects
The scribes in the digital scriptoria of Apex or similar businesses were not employed to produce scholarly, let alone ‘critical’ editions. They were employed to produce keystroke by keystroke transcriptions of what they saw in front of them–digital scans of microfilm images (of variable quality). Many things could and did go wrong in a chain of transmission in which a page, printed with too much or not enough ink to begin with, did not make it through four centuries without tears or blotches, suffered characteristic damage at the inner and sometimes outer margins during microfilming, and was not scanned with sufficient care.
The vendors and scribes were paid by the keystroke and held to an error rate of no more than one error in 20,000 keystrokes. But there was no penalty for words or letters they could not decipher. They were instructed to mark such lacunae or gaps as precisely as they could. Thus the texts are littered with millions of <gap> elements telling you that a specified number of characters, words, lines, paragraphs or pages are missing. For better and for worse the (dis)incentives in the contract that governed their work encouraged the scribes to be very conservative. There are no wild guesses, but there are many cases where the missing letter or word is perfectly obvious to a competent reader of the text. On the other hand, the scribes did not necessarily know the language of the text they were transcribing. They worked one keystroke at a time.
According to my rough counts, there are 8.4 million incompletely transcribed words in 1.6 billion words of running text, a defect every 200 words. The defects are very unevenly distributed across texts and demonstrate a particularly striking version of the familiar 80:20 rule. The worst 10 percent of texts contain 80% of the defects. This skewed distribution holds for pages within texts, and even locations on pages: the inner and outer margins of pages have a defect rate that is five times higher than other locations on a page. The poor quality of the microfilm image is the largest source of error. Given a good enough page, the scribes did a good job.
In Donald Rumsfeld’s language, defects marked as such by scribes are “known unknowns”. Scribal errors or printers’ errors are “unknown unknowns”. You would think that the proofreading of texts with many “known unknowns” would also turn up a high proportion of scribal errors. But that is not the case: a review of quite a few pages of TCP transcriptions did not show a strong correlations of “known” and “unknown” unknowns.
People have an ingrained tendency to judge a barrel by its worst apples, and many scholars take a dim view of the TCP texts. Judith Siefring and Eric Meyer in their excellent study “Sustaining the EEBOTCP Corpus in Transition” say more than once that in user surveys “transcription accuracy” always ranks high on the list of concerns. A defect rate of one word every 200 words is unacceptable in a corpus that lays a claim to being a major documentary sources for Early Modern Studies. But it is not as bad as it is sometimes made out to be, and it is eminently fixable over time. Ten percent of the texts are in very bad shape indeed, but half of them have no or very few defects
8.4 million defectives is a lot of defective tokens, but many hands make light work. Some form of an “Adopt a Text” campaign can break down the task into manageable bits. I became interested in the problem of collaboration in the context of a larger project that sought to transform the texts into a linguistically annotated corpus. Linguistic annotation starts with a process of “tokenization”. If characters are textual atoms, words are textual molecules. MorphAdorner, the tool chosen for this annotation, takes a a very explicit approach to identifying token. Each word or textual molecule is wrapped in a special tag or element and gets a unique and corpus-wide ID. It turns into a discrete digital object that you can “adorn” with additional properties or attributes. A small subset of texts annotated in this manner became the data source for WordHoard, a tool that I used in the last decade of my teaching as a way of drawing students’ attention to the texts at the molecular level of particular words.
It so happens that by far the most common form of textual defects in the TCP corpora is of a kind that can be fixed by changing a single token, deleting it, or adding a new token before or after it. These operations can be performed in an atomic manner on the discrete digital word objects created by linguistic annotation. It is also the case that most of those changes, deletions, and insertions do not require a professional command of the very formidable discipline of textual criticism. They are obvious to careful readers willing to look closely at the source.
The workflow of a textual correction follows a logic of “find it, fix it, log it.” The fixing is often the easy part. Finding and logging are harder. In the case of the TCP texts, the finding is simplified by two factors. First, the “known unknowns” draw attention to themselves by marks in the text, and secondly the Web has dramatically reduced the time cost of looking up the source text via its digital surrogate on the EEBO-Chadwyck site. Textual defects in TCP texts typically are the result of poor EEBO images, but in a majority of cases the image is still good enough to fix the defect. The image manipulation powers of modern browsers very often help clarify what the naked eye cannot see—and that despite the fact that the resolution of EEBO images is quite low.
The Web has been an even greater help in logging defects once they are fixed. It is one thing to fix something in your copy of a book, but what about getting that correction into the text that circulates, whether in print or in a digital format? If you see that “th●s” in the TCP text corresponds to “thys” in the image, how much time are you willing to spend on telling that to the relevant authorities who can integrate that correction into the source text?
I experimented with forms of “distributed annotation” in various classroom exercises. I never required them but gave students an option to substitute text curation for a paper hoping that such “hands-on” exercises would teach them something about the slippery ground of Early Modern texts and that in learning this lesson they might do work useful to others. The best proof for the utility of this approach was a charming essay that two undergraduates wrote about becoming “fluent in Marlowe“.
Web-based correction tools
In these early experiments the students looked at the EEBO images on the Web, but the textual data were kept on spreadsheets. Fast forward a few years , and Craig Berry built a single-interface Website called AnnoLex. It showed defective words in context, had a data entry field for corrections, automatically displayed the page image from the EEBO site, and had a “back office” window that allowed an editor to approve or reject emendations. While in principle annotations could be made by anybody anywhere, in practice they came from small groups of Northwestern students in summer internships. But the central point of this application was its universality. If a user logged into AnnoLex and made a correction, the ‘who’, ‘what’, ‘when’, and ‘where’ of that action were logged by the machine and associated with the unique ID of the word token affected by it. And the corrections could be automatically integrated into the source texts with a complete audit trail.
This project attracted the attention of Joe Loewenstein at Washington University in St.Louis. Two very remarkable remarkable students of his, Kate Needham and Lydia Zoells joined the Northwestern collaboratory. In the early summer of 2015 Kate and Lydia, joined by Hannah Bredar at Northwestern, separately or together went on a curation romp that led them to the Bodleian, Folger, Houghton, and Newberry Libraries as well as the special collections of Northwestern and the University of Chicago. They fixed about 12,000 incompletely or incorrectly transcribed words. Add that to 46,000 changes the Northwestern students had made previously as well as to a sizable number of corrections made by a group of students at Amherst supervised by Peter Berek, and the median rate of textual defects per 10,000 words for some 500 non-Shakespearean plays dropped by an order of magnitude from 14.6 to 1.4.
The limitations of the single purpose tool AnnoLex defined the need for a broader environment that would support “curation en passant”, fixing defects as you encountered them in reading a text or looking up something in it. The current EarlyPrint site–developed with support from the Mellon Foundation–is an early implementation of that goal. It it is slow in some ways and clunky in others, but it works, and there is a fairly clear path for improving it over time. The site, which uses the XML database eXist, currently contains some 12,000 TCP texts and by the end of the year will contain all public domain EEBO and Evans TCP texts for a total of about 30,000 titles.
In addition to the standard reading and search environment of the underlying database the site has an Annotation Module that supports structured and free-form annotations attached to the IDs of particular words. The structured annotation template lets users perform change, delete, and insert operation, while the free-form template supports comments of any kind. Textual corrections are displayed in the text immediately, with different colour codes marking their status as pending, approved, or rejected. But the permanent integration of a correction into the source text is a separate event. When a word is highlighted in amber, green, red, or red, it means that for the purpose of current display, the spelling of a word in the underlying source is replaced by an emendation that has not yet been reviewed, has been approved, or has been rejected by a reader with editorial privileges. Once approved corrections have been integrated into the source text, the annotations relating to it are removed from the EarlyPrint site, but there is a permanent record of them. How to display it is still under discussion. In many of the non-Shakespearean plays, the corrections are listed in a textual appendix that the site cannot currently display. It is, however, present in the source files.
High-quality public domain images
A small percentage of the current texts (376) are mapped to high-quality and public domain digital images contributed by various institutions to the Internet Archive. A sizable majority of them (268) are dramatic texts, mainly from the Boston Public Library. Right now, the EarlyPrint site is probably the most convenient site to explore a wide variety of Early Modern plays in the form of “digital combos”. The 29 plays at the Folger site of Early Modern English Drama are much more carefully curated and more elegantly presented than anything on this site. It is an advantage of the EarlyPrint site that it lets you look quickly across a lot of plays.
The non-dramatic digital combos include some very early texts (Fabyan’s Chronicles), some very long texts (Hakluyt, Holinshed), a set of Anglo-Irish texts from Notre Dame’s holdings, sermons by Calvin and a translation of Eusebius from the Princeton Theological Seminary, and Gerard’s Herbal from the Getty Institute.
Digital combos add evidentiary and aesthetic value to a TCP transcription. They are of course crucial to the task of correcting defects, but they also tell modern users much about the look and feel of a book for its original audience. There are at least another 1,000 image sets on the Internet Archive and other sites that could be used for digital combos, and their number will grow as libraries digitize their rare book holdings and make them accessible via IIIF servers.
Mapping the page numbers of TCP texts to image numbers is a labour-intensive task. We have developed a semi-automatic workflow in which a user gets a Googlesheet that has the page numbers of the TCP texts and the URL to the image set at the Internet Archive (or wherever). The user can map the image numbers to the page numbers, taking advantage of a simple spreadsheet function that automatically populates a column with numbers. This may take ten minutes or more than an hour. The important variable is not so much the length of the text as blank pages, missing, or misnumbered images in the image set. But in the context of a particular user pursuing a particular project, the time cost of mapping pages to images is a tolerable part of the set-up cost.
For any work that depends on comparing a transcript with a facsimile, the ease of visual aligment is a critical factor. If the text was generated by optical character recognition (OCR), text and facsimile will naturally align because the line breaks of the text follow that of the image. Because the TCP transcriptions do not record line breaks the alignment of text and image is harder. Knowing something about the word ids makes the task a little eaiser. The unique IDs of each word in an EarlyPrint text are constructed by adding a simple word counter to the image number of the EEBO scan from which the TCP transcriptions was derived. By clicking on any word in an EarlyPrint text that ID becomes visible. An ID like ‘A68218-015-a-1630’ points to a location at word 163 of the left side of the EEBO double image #15 of text A68218. If a look at the pages gives you a rough sense of the number of words on that page the word number will give you an idea where to look for it on the page.
Facsimile images on the Internet Archive (and similar sites) are usually in the JPEG 2000 (or JP2) format. They have ten times more pixels than the EEBO images, and they are much more manipulable. If you use the EarlyPrint site on a 15 inch lap top–not to speak of a desktop machine with a 27 inch screen–you have a text curation lab that would have been unimaginable thirty years ago. Quite ordinary standard tools and public domain texts and images put quite fine-grained editorial work within reach for anybody with the inclination and patience to do it. It is not everybody’s cup of tea, but if you are drawn to this kind of work you will find it in many ways rewarding, and whatever you do for yourself quickly becomes accessible to anyone.
Edge cases of text and image alignment
Handsome, legible, and easily manipulated images make the task of textual correction more pleasant and more efficient, especially if you work with a large monitor. But in some cases and even with the best monitor and the highest quality image it is still very difficult to move from a line of transcription to its place on the image and back. Double-column folios with densely printed text are one example. If column breaks had been encoded in Holinshed’s Chronicles or Foxe’s Book of Martyrs it would be possible to align columns of texts with image columns. But they weren’t, and with books of this type the visual alignment of text and image remains a difficult and slow task.
An even harder problem is posed by tables and indexes at the end of very long books. Texts of this kind are very difficult to encode in the first place, and their alignment with images is intrinsically problematical. For instance, The auncient ecclesiasticall histories of the first six hundred yeares after Christ, a 1577 translation of Eusebius and other church historians, contains in its last sixty pages a multicolumn parallel chronology of six centuries of church history in different parts of the Mediterranean. Here you have a text that challenged the display potential of the printed page to begin with. Co-ordinating the layout of the digital encoding with the printed layout is beyond the powers of a template-based display system that does at best rough justice to the typographical details of particular pages.
Such cases are not common, but where the printed page has more than one column, the alignment of the facsimile image with the display of the digital transcription becomes a tricky business very quickly.
Machine-based correction of textual defects
The nature of “known unknowns” in the TCP texts makes it possible to generate machine-based corrections. These still need to be checked by human readers, but it takes much less time to approve such a correction than make and review a manual correction. In the screen display of TCP texts scribal markings of missing letters typically appear as black dots (● or Unicode \u25cf): ‘th●s’. If the marking is accurate (as it is about 90% of the time), a spelling like ‘th●s’ is equivalent to the “regular expression” ‘th.s’, where the dot is a wildcard character matching any alphanumerical character. Larry Wang and Sangrin Lee, two students of Doug Downey in Northwestern’s Computer Science Department used a “long short-term memory” (LSTM) neural network algorithm to generate context-aware replacements for defective words. This is a computationally expensive operation, and it took a machine several days to work its way through some 50,000 texts. The algorithm cannot guess whole words, and it is useless with numbers, but it does an excellent job with word-medial gaps. Word-initial gaps are harder because the algorithm cannot make reliable decision about capitalization. Word-terminal gaps pose a different problem because the scribes quite often confused final characters and punctuation marks. That said, the algorithm produced usable results for 80% of “known unknowns”, and for those its success rate on average may be close to 90%.
In the current EarlyPrint environment the suggestions of the LSTM algorithm appear in the texts in a dark orange as ‘autocorrections’. Simon Kellwayes’ 1593 Defensatiue against the plague offers a good example of the algorithm at its best. This text of not quite 30,000 words has 140 defective tokens. The defect rate of 47 per 10,000 puts it in the upper half of the worst quartile– a ‘D’ text on an academic grading scale. The LSTM algorithm offered suggestions for most defects. 17 of them needed correction. It was not quite an hour’s work ‘turning’ the pages, and a good deal of that time was spent waiting for the often balky IIIF server at the Internet Archive to load the matching page image.
Curation and the division of labour
“Divide and conquer” applies in two ways to the task of collaborative curation. Individuals can choose this or that text to work on, but they may also prefer to focus on particular types of correction. So far in this report I have focused on the correction of single words as the most common type of textual defects. But there are also missing lines, paragraphs, and entire pages. We have not yet developed a robust workflow for the transcription and review of missing text chunks or pages. I am aware of about 11,000 missing pages in 2,200 out of 53,000 EEBO-TCP texts. Roughly 4% 0f the texts are missing one or more pages, and roughly the same number are missing a total of ~ 21,000 lines or paragraphs. Individuals who are comfortable with fixing individual words may not trust themselves with transcribing whole pages, while others may find that an interesting challenge. Elisabeth Chaghafi at the University of Tübingen restored hundreds of missing paragraphs in Gerard’s Herbal, a famous text whose transcription was badly botched. It was a apparently a close call to accept the transcription. The reviewers’ hope that somebody might some day fix it was rewarded (The restored paragraphs have not yet been added to the text, but will be added shortly).
There are many tasks ranging from the very simple to the quite complex. All of them are worth doing, and each one that done improves a text in some fashion. In a print world it does not usually make sense to publish incomplete work. In a digital project like this one users will work with a corpus whose texts are at different stages of completion. That is inevitable and OK as long as one is explicit about what has and has not been done. It may be useful to think of the corpus as a very large ship and of Early Modern Studies as a very long voyage in the course of which many repairs are made. But there is never a dry dock.
Degrees of certification
The US Department of Agriculture has a system of grading beef that ranges from ‘Prime’ to ‘Utility’ (also known as ‘pink slime’). A similar grading scheme would be helpful for the EarlyPrint corpus. It defines standards and measures the degree of compliance with them. The TCP had quality assurance standards. The vendors’ transcriptions were partially proofread and had to pass the standard of no more than 1 error per 20,000 keystrokes. It would have better for the reputation of the project if the texts had come with graded warnings about the number of lacunae, expressed as percentile rankings. That would have drawn attention to the fact that the number of good or not so bad apples greatly exceeds the number of bad ones.
The current EarlyPrint texts use the academic A-F scale to measure the extent to which individual texts fall short of the goal of zero “known unknowns.” An obvious benefit of that scale is its demonstration of how many texts come reasonably close. It may be helpful to add more stringent or explicit measures. It is out of scope for the project to produce documentary or critical editions of any text, although individuals may use a particular text as the point of departure for a new project. What is the highest ambition for any text in the EarlyPrint corpus as part of that corpus? Here is a hierarchy of certifying statements that users would find helpful if they were made by contributors whom they have reason to trust:
- I have corrected all known defects. I have also proofread the entire text and have explicitly corrected obvious printer’s or scribe’s errors (USDA Prime)
- I have corrected all known defects. I have proofread every tenth page, have explicitly corrected obvious printer’s or scribe’s errors, and counted them (USDA Choice)
- I have corrected all known defects (USDA Select)
The third of these statement defines a plateau that all texts in the corpus should aspire to. The second and first are useful ascents beyond the original goal, but they are still preparatory to projects that are properly termed scholarly editions.
The social and technical challenges of collaborative curation
Some years ago a colleague of mine in Computer Science observed that “directing the crowd” was a key ingredient of successful crowdsourcing. “Crowdsourcing” as a term seems to have gone out of the language. Nobody wants to be part of a crowd. Everybody (except for Groucho Marx) wants to be a member of a community. But “directing the crowd” is a useful term, especially if you don’t think of it as a way of telling people what to do, but as a way of removing obstacles and encouraging groups to form around shared interests.
Defects in the TCP texts will either be fixed incrementally and iteratively by users over time, or they will not be fixed at all. There are no EEBO Heinzelmaenchen, the gnomes of German folklore who come at night and do all the work that needs to be done, preferably by someone else.
Siefring and Meyer in their survey say that when asked whether they reported errors, 20% of the users said “yes” and 55% said “no.” But 73% said that they would report errors if there were an appropriate mechanism and only 6% said they would not. There is a big difference between what people say and what they do. But given a user-friendly environment for collaboration and mindful of the 80:20 rules, it may be that a fifth users could be recruited into a multi-year campaign for a rough clean-up of the corpus so that most of its texts would be good enough for most scholarly purposes.
It is a social rather than technical challenge to get to a point where early modernists think of the TCP as something that they own and need to take care of themselves. Greg Crane at a conference some years argued that “Digital editing lowers barriers to entry and requires a more democratized and participatory intellectual culture.” In the context of the much more specialized community of Greek papyrologists, Joshua Sosin has successfully called for “increased vesting of data control in the user community.” The engineer John Kittle helped improve the Google map for his hometown of Decatur, Georgia, and was reported in the New York Times (16 November 2009) as saying:
Seeing an error on a map is the kind of thing that gnaws at me. By being able to fix it, I feel like the world is a better place in a very small but measurable way.
Compare this with the printer’s plea in the errata section of Harding’s Sicily and Naples, a mid-seventeenth century play:
Reader. Before thou proceed’st farther, mend with thy pen these few escapes of the presse: The delight & pleasure I dare promise thee to finde in the whole, will largely make amends for thy paines in correcting some two or three syllables.
Errors are typically discovered in the process of working with data. The fixing of most defects in the TCP corpus is a matter of seconds or minutes. The likelihood of correction is a function of the ease and speed with which it can be reported. There are two scenarios that follow from a particular instance of curation en passant. In the first you want to return to what you were doing with minimal interruption. In the second, you’re like John Kittle. Something “gnaws at you”. You suspect that there are more errors of this kind and want to follow up on them, whether in the current text or in other texts that are related in some way.
The Annotation Module supports the first scenario well. It does less well on the second, largely because it is still quite slow in moving from one page to the next. The longer the text, the slower the page turning. It is also and somewhat inexplicably the case that page turning slows down as you move through a long text. It may well be the case that in the aggregate the digital pages don’t turn more slowly than printed pages, and that chasing and fixing defects is still faster with this digital tool than in a paper-based environment. But users expect snappy performance. They are impatient and more attentive to the time cost of a particular move than to the time cost of an entire operation.
For users who want to fix all defects in a text or all defects of a certain type in a set of texts, the solution may be a step back to the future– an environment like AnnoLex in which the user sees rows in a dataframe, where defective tokens are shown in a keyword-in-context environment and each row contains a hyperlink to the page image. In a more sophisticated version the link would take you to the particular region of the image where the word in question is found. The ID structure of the EarlyPrint texts in principle has sufficiently precise information to target regions of an image.
It is an open and practical question whether such dataframes can be assembled on the fly or rely on precomputed data. From the end user’s perspective, however, it should be easy and quick to switch between a dataframe and a page view of the text.
Birds of a feather
There are lone-wolf curators, but most people like company and are more likely to stay engaged if they can see their work as part of a collective effort, however loosely organized, that makes a difference to a limited corpus. Freedom of the press, witchcraft, gardening, military history are some of the topics around which groups can form, but there is a prior need to identify clusters of text that meet group interests Think of it as the social version of Ranganathan’s second and third laws of library science: “Every reader their book” and “Every book its reader.”
At Northwestern and Washington University summer research grants for small groups of student have been a very effective model for collaborative curation. You have their undivided attention for six or eight weeks during which they can do good work that is useful to others and from which they learn a lot. It makes sense to give such grants to students at the end of their freshmen or sophomore years because it may prepare them for more independent work later. Grants of this kind are an excellent investment in the human capital of the next generation of Early Modern scholars.
Classrooms are a very formal setting that in principle offer many opportunities for group work, but in practice there are many obstacles. I am confident that many teachers of Early Modern history, literature, or religion would like to integrate some hands-on document oriented work into their syllabus. There is no shortage of broadsides, ballads, pamphlets, sermons, and other texts whose size fits well into the scope of class assignments. There are plausible scenarios where students work on different parts of a long or work on different texts and establish their relationships. The Annotation Module would support textual curation as well as free-form annotation as long as it targets individual words. But in its current form it lacks group privileges that limit a project to the group and within the group support annotations that are private, shared with the group, or shared with the group leader.
Beyond the Academy
Greg Crane’s vision of a “more democratized and participatory intellectual culture” should take us well beyond the Academy. There is a large pool of highly educated retirees who would find various forms of “textkeeping” a rewarding activity. For some of them it would be a return to intellectual interests that they had as undergraduates. For others it would be the pursuit in an “old book” environment of issues that concerned them in their professional lives.
Is there a place for collaborative curation of Early Modern books at the other end of the age scale, in the junior and senior years of High School? If you look at excellent work done by College freshmen–and I have seen much of it in my teaching, it is hard for me not to believe that the same students a year earlier would do good editorial work with old books and learn much from it. My career goals were clear to me well before I graduated from high school, and I am far from alone in this. The Germans have a proverb that says “Früh krümmt sich was ein Häkchen werden will” (the future hook bends early), and in Heywood’s collection of proverbs you find “It pryckth betimes that shalbe a sharpe thorne.”