“Revolutionizing Early Modern Studies?” was the question that governed the recent  EEBO-TCP 2012 conference sponsored by the Bodleian Library. I gave a talk there about “Towards a Book of English: A linguistically annotated corpus of the EEBO-TCP texts.” In another blog I will write about the ways in which this project will keep Phil Burns and me busy for much of this academic year. Here I write about the conference as a whole and look at the ways in which it takes the pulse of Early Modern Studies with EEBO-TCP. I will also reflect on the directions the TCP could and should take in the coming decade.

My reflections start from the premiss that the TCP is a magnificent but flawed project. Fortunately, the fixing of those flaws is well within the competence of the scholarly data communities that constitute the TCP user base. Much of this blog is about the ways in which Early Modern data communities can and should take charge of “their” data and over time transform the EEBO-TCP texts into a corpus whose standards of accuracy and completeness meet the traditional standards of well-edited printed texts.

Working with digital data is a Janus-faced enterprise. You discover that some things need to be done “to” the data  before you can do anything “with” them. “Curation” is a general term for doing things to data. It includes but goes well beyond traditional activities lumped under “editing.” “Exploration” is a general term for the analysis and interpretation of data.  Curation and exploration are two sides of the coin of working with digital data. Curation is the servant of exploration, and in some ideal world curation (preferably done by somebody else) precedes exploration. But the real world of mass-digitized text archives is messy. It is a little like repairing a ship at sea. No dry dock,  few mariners, and the passengers have to help with the repairs if the voyage is to continue. We need a social and technical space of collaborative curation, where individual texts live as curatable objects continually subject to correction, refinement, or enrichment by many hands, and coexisting a different levels of (im)perfection. Call it scholarly crowdsourcing in an environment with a commitment to two of John Heywood’s English proverbs: “Many hands make light work,” and “Rome was not built in a day.”

I have focused on those conference papers that helped me develop the major point about curation and exploration as the Siamese twins of working with digital data. My apologies to the authors whose papers I ignored. It is not a judgment about their interest or quality. In any event, there are helpful abstracts of all the papers and poster sessions at the conference site.

 Exploring billions of words

The stated ambition of the EEBO-TCP corpus is to create a TEI-XML transcription of at least one version of every ‘book’ or ‘title’ that was published between 1473 and 1700 and has survived. These two words have a way of blurring the more closely you look at them. But from a distance they are not meaningless, and you can base rough counts on them. 70,000 or therabouts is the TCP count. 40,000 exist now and according to Andrew Hardie at Lancaster add up to about a billion words.  A lot of words from one perspective: more by three orders of magnitude than the 800,000 +/-50,000 words that make up the Bible or Shakespeare.

Not a whole lot of words from another perspective: Greg Crane wrote somewhere that in the decade after the Civil War American newspapers printed about two billion words each year. And barely a rounding error in the Hathi Trust Digital Library with its 3.6 billion pages that add up to at least a trillion words.

Three years from now, there will be complete coverage of Early Modern English books from the admittedly crude perspective of “every book at least once.” Moreover, a good third of those books will pass into the public domain in 2015, and the other two thirds will follow them over a five-year period. Eight years from now, anybody can get at any of these books from anywhere and at any time, subject only to access to the Internet. You can focus on the billions of people who still will have no access to the Internet or for whom access to Early Modern texts will never be high on their priority list. But it may be more productive to focus on the individuals all across the globe (counted in the tens or low hundreds of thousands) for whom some encounter with EEBO-TCP transcriptions changes the ways in which at some point in their lives they do research, teach, or pursue hobbies.

In thinking about the difference that EEBO-TCP makes to Early Modern studies, it is important to think about the scale of that discipline at different times. How many (or few) people were engaged in it when A. W. Pollard joined the British Museum in 1883 and began the bibliographical work that remains the foundation of both EEBO and TCP? Through most of the 20th century geography was the major barrier to access. Microfilm lowered that barrier, but it was never a medium associated with ease. Compared with the access conditions to Early Modern texts that prevailed into the late twentieth century, EEBO-TCP provides global access to these texts for  scholarly, professional, and lay communities via the same tools they use for other data. That is a very big difference. Will it “revolutionize” Early Modern Studies?

If by “revolutionize” you mean a decisive change of direction in either the object or method of inquiry, the answer is “no.” If you think of it as the cumulative effect of a lot of little changes over time, the answer may be “yes.”  Here is Douglas Engelbart, the inventor of the computer mouse and one of the giants of early interface design. In an essay about “Augmenting Human Intellect” he said:

You’re probably waiting for something impressive. What I’m trying to prime you for, though, is the realization that the impressive new tricks all are based upon lots of changes in the little things you do. This computerized system is used over and over again to help me do little things –where my methods and ways of handling little things are changed until, lo, they they’ve added and suddenly I can do impressive new things.  (http://www.bootstrap.org/augdocs/friedewald030402/augmentinghumanintellect/ahi62index.html)

The quotation came to my mind several times in listening to the papers at the conference. Whatever else EEBO-TCP may do, it certainly lowers the time cost of some steps in a project,  and by doing so it may change the calculus of the possible by redefining what is quite literally “worthwhile.” A group of papers fell under the broad headings of intellectual history or Begriffgeschichte.  The key preparatory exercise in all such inquiries is the often tedious gathering of words or phrases and their organization by time, place, or genre. It is the critical step for papers with titles like “The semantics of liberty in Early Modern English” (Stephen Pumphrey), “Re-marking Revenge” (Alison Findlay and Liz Oakley-Brown)  or “The emergence of ‘new philosophy’ in the discourses of seventeenth century philosophy” (Jacob Halford).

Somewhat related to such inquries are studies in reception history, such as “Snapshots of Early Modern English Responses to French Poets” (Peter Auger), which traced references to Du Bartas in the TCP corpus, or “Reading Demonology in early modern England” (Simon Davies),  a hunt for witches in books with titles where you would not expect them. Methodologically, these papers all rely on the fact that the TCP texts are mediated through search engines that create elaborate indexes. When you look for a word, you are looking for a word in an index that within seconds can retrieve hits from a billion words or more. John Lavagnino in his keynote address cited a lament from Keith Thomas, a grandmaster in the art of finding pithy quotations through an art of studied serendipity and weaving them into graceful historical narratives.  The finding phase, Thomas laments, can now be done by a “moderately diligent” undergraduate in the course of a morning.

It is almost certainly the case that the old-fashioned way of finding things “by hand” made it more tempting to look around as you gathered stuff and to  contextualize it as you moved along. But why measure the value of a treasure by the time cost of finding it? There may be a temptation for a careless researcher to limit “context” to the five or so words that appear before and after a “hit” in a KWIC display.  But that was not the way of the young scholars at this conference. They worked on the assumption that if the machine helps you to find more stuff more quickly and across a wider swath of texts, you can spend more of your time on properly and slowly contextualizing the results.

There were differences in the ways of looking for words. For some of the presenters, the look-up on the computer was not different from the look-up in a book. For Peter Auger, the inquiry into the reception of Du Bartas yielded hits in the low dozens that he could not have found by thumbing through printed books. For Matthew Steggle, the echoes of a lost play by Thomas Dekker turn on a handful of passages that would have been equally hard to find manually.

In some other papers, the look-up techniques were more complex and incorporated practices long familiar from corpus linguistics. At Lancaster there has been  lot of interesting work by the CREME group (Center from Research in Early Modern English), which sees its mission as bridging the gap between humanities disciplines and the NLP technologies long familiar to linguists. CREME gave a set of interlocking presentations about their approaches. The historian Stephen Pumfrey and the literary scholars Alison Findlay and Liz Oakley-Brown talked  about the words and concepts of ‘liberty’ and ‘revenge’ in Early Modern texts. The computational linguists Alistair Baron, Andrew Hardie, and Paul Rayson described the pre-and post-processing of data as well as the software tools that are required for NLP procedures. In this environment the power of the simple look-up is enhanced by steps that precede and follow it. CREME has an elegant program called VARD for “variant detector” . VARD can map a high percentage of heterogeneous spellings to standard forms. But not all: does the string ‘wee doe’ refer to a very small female deer or is it just ‘we do’ with e’s stuck on at the end? Humans find it easy to disambiguate this problem, but machines struggle with it (A presentation by Marie-Hélène Lay talked about a French version of such a program used by Les Bibliothèques Virtuelles Humanistes (BVH) at Tours).

At the other end of the processing chain, large result sets  can be automatically grouped and sorted.There are more than 150,000 occurrences of ‘liberty’ in the EEBO corpus.  You can slice the corpus by various criteria, establish the relative frequencies of terms, and use simple but striking visualizations to compare apples with apples, and sometimes with oranges. The CREME team did a fine job of illustrating the potential of NLP techniques for the analysis of historical or literary data. Heather Froehlich (Strathclyde) reported on the pedagogical uses of similar techniques developed at Carnegie-Mellon, the University of Wisconsin, and Strathclyde University by David Kaufer, Michael Witmore, and Jonathan Hope. The basic tool is Docuscope,  a very large dictionary of short phrases or grammatical patterns that are mapped to a taxonomy of about 100 micro-rhetorical acts. The distribution of these acts across sections of a text or different texts becomes the basis for comparative analysis. The results of such analysis can then be represented in a heat map that is like a mileage chart tabulating the distance between cities. Shades of color are used to express the distance between different texts. It is a quite effective tool for expressing gross properties of a corpus.

On the plane home I read Siddhartha Mukherjee’s Emperor of All Maladies: A Biography of Cancer  and came across his trenchant description of the young Rudolf Virchow, who “perplexed by what he couldn’t see, …turned with revolutionary zeal to what he could see: cells under the microscope.” He created a “cellular theory” of biology, based on

two fundamental tenets. First, that human bodies .. were made up of cells. Second, that cells only arose from other cells– omnis cellula e cellula, as he put it. (p. 15)

A generation later (1879) in Prague Walter Flemming

stained dividing salamander cells with aniline, the all-purpose chemical dye used by Paul Ehrlich. The stain highlighted a blue, threadlike substance located deep within the cell’s nucleus that condensed and brightend to a cerulean shade just before cell division. Flemming called his blue-stained structures chromosomes –“colored bodies” (p. 340)

In the past thirty years, Mukkerjee goes on to argue, our understanding of how cancer works has largely come from teasing out the details of molecular operations within a cell.  Mutatis mutandis, are there similar opportunities in the microscopic analysis of detail across large textual corpora?  For Virchow and Flemming, things became visible through magnification and staining, a torture of sorts. Goethe’s Faust eloquently exclaims

Geheimnisvoll am lichten Tag
Lässt sich Natur des Schleiers nicht berauben,
Und was sie deinem Geist nicht offenbaren mag,
Das zwingst du ihr nicht ab mit Hebeln und mit Schrauben.

Mysterious in the light of day
Nature cannot be robbed of her veil,
And what she will not reveal to your spirit
You cannot force from her with levers or screws
(My literal translation)

True at some level and untrue at another. It would be naive to argue that emergent phenomena, such as complex texts, can be fully accounted for by an analysis of their parts.  On the other hand, the insights gained from such an analysis are not negligible. There is very often an interesting path from the observation of micro-phenomena to the understanding of larger structures. The tools of corpus linguistics bear some resemblance to the slides, stains, and microscopes of the histologist.  Michael Witmore has written in praise of “prosthetic reading.” A nice phrase, especially if we remember that all reading is prosthetic, that books are prostheses for our faulty memory, and that the Biblical concordance invented by medieval monks (and brilliantly described by Andrew Prescott in a recent blog) is a second-order prosthesis to help with the reading of prosthetic books.  Think of the current NLP tools as early third-generation prostheses, not much subtler than Virchow’s microscope, Flemming’s stained slides, or the gas burner named after the chemist Robert Bunsen.  Over time our encounters with texts are likely to be be more heavily mediated, but as time goes on those encounters will become second nature, as non-prosthetic as cuddling up with a book on a sofa may feel to us now.

If stained slides and Bunsen burners are too much science for humanities research, think about ladders. After coming home, I helped my wife painting our garage and was reminded once more that house painting is largely an art of ladders.  Computers can be very helpful ladders, letting you get at the critical spots more quickly and with less trouble.


So much for the conference’s concern with the analysis or exploration of textual data.  Projects focusing on editing or otherwise curating data were aptly summarized by the title of a poster session by James Cummings, “Re-use, enhancement, and exploitation: An investigation of projects using EEBO-TCP materials.”  It helps to remember what the  TCP texts are not. They are not diplomatic transcriptions of the kind that capture typographical detail or pay much attention to what history of the book scholars call “the materiality of the text.”  They are best seen as efforts to capture the sequence of words as they were spelled and to express major structural divisions (chapter, headings, tables, lists, paragraphs, lines of verse etc) through the conventions of the Text Encoding Initiative (TEI). To use Nelson Goodman’s distinction between ‘allographic’ and ‘autographic’ works, the TCP treats texts as fundamentally allographic objects whose typographic representation is largely irrelevant to the act of making sense of them.

Nothing prevents scholars from adding value to the TCP texts by upcoding them in various ways. The sparse but consistent articulation of the texts into TEI-XML elements provides a sturdy framework for such efforts. You can add typographical detail by consulting the EEBO page images, or better, the original pages from which those images were derived. In most cases this will be a lot cheaper than doing a new transcription from scratch. Michelle O’Callaghan and Alice Eardley (Reading) express this approach very clearly in their project for Verse Miscellanies Online.  They argue that the TCP XML-TEI transcriptions “have the potential to change EEBO from a relatively closed archive to an open resource,” and they see their project as one example of “recent digitizing projects” that “have begun to use these TCP files as the basis for digital editions and add levels of tagging and new user functionalities.”

A very ambitious example of this approach is  the Hakluyt project at the National University of Ireland in Galway. It  will use the TCP transcription of Hakluyt’s The Principal Navigations, Voyages, Traffiques, and Discoveries of the English Nation as a raw source to be “corrected against the PDF images of the Huntington Library copy,” but will also include supplementary source materials ignored by the TCP. The outcome of this   project will be a 14-volume critical edition published by Oxford University Press.

A similar approach governs The Folger Digital Folio of Renaissance Drama for the 21st Century (F21), which will engage the interests and talents of ambitious undergraduates to create digital editions of some 500 plays written or performed between 1576 and 1642.  While much of that project will focus on maximizing interoperability of a kind that supports corpus-wide algorithmic inquiries, it will certainly include quite a few individual projects that aim at reviving in a digital environment the materiality of the printed book in which a given play was first embodied.

Elizabeth Scott-Baumann (Leicester) and Ben Burton (Oxford) have used prosodic upcoding to create  a “data base of poetic forms.”  Because of the TEI encoding of the base texts, this project starts with excellent field position (somewhere in midfield rather than on its own goal line): TCP texts mark lines of verse very accurately and do a pretty good job of identifying stanzas and similar groupings  through the use of the ‘line group’ or <lg> tag. An investigation into rhyme schemes is central to this project. The existing encoding goes quite a ways in showing investigators where to look for stuff.

A session on EEBO TCP in the classroom was particularly lively. In addition to the presentation by Heather Froehlich, already mentioned in the session on analytical tools, two papers by Mark Hutchings (Reading) and Leah Knight (Brock University) demonstrated different strategies of leading students of the “Norton generation” towards a more direct encounter with the particulars of Early Modern culture.  For Hutchings this takes the form a final-year seminar in which students choose a short text from EEBO and “produce a critical edition that conforms to modern editorial conventions.” Leah Knight asks her students to pick a text from a particular year and within the constraints of a semester become experts about that year and contextualize their text within it. The student as cook rather than client in a restaurant with a fixed menu.

Fixing and improving the TCP texts through collaborative curation

“Transcribed by hand. Owned by Libraries. Made for everyone.” This is a very nice tag line if you want to think of the TCP texts as a present to “the great variety of readers, from the most able to him that can but spell.” But the tag line may suggest that once the TCP archive has been “made” the work is done, and everybody can just do “their own” work with it. Many casual users seem to think of the project in those terms.  When they complain about quality problems (as they often do), it does not occur to them that they could help fix them. To be fair to them, the TCP has not invited them to help with the clean-up of data. It is possible to point out errors, and having done so myself, I can attest that corrections are accepted gratefully and incorporated into the texts. But there is not at the moment a system for soliciting, receiving, and reviewing user contributions that over time would make the TCP a more complete and accurate representation of its source texts.

What might such a system look like? As I mentioned earlier, the Hathi Trust is bigger by three order of magnitude than the TCP, whose holdings are a rounding error in any Hathi Trust statistics. But what if you imagined the TCP as a “special collection” inside Hathi, rather in the way in which rare book collections are inside large research libraries, with special attention paid to the nature of the holdings and the types of research they enable? Neil Fraistat and Doug Reside once coined the acronym CRIPT for “curated repository of important texts.”  Imagine TCP as an inner ring of Hathi Trust and the foundation for what I have called a Book of English and defined as

  • a large, growing, collaboratively curated, and public domain corpus
  • of written English since its earliest modern form
  • with full bibliographical detail
  • and light but consistent structural and linguistic annotation.
Users benefit if all books in libraries are subject to the “perpetual stewardship” that Paul Courant and Penelope Kaiserlian talked about at the Shape of Things To Come conference at Virginia in 2010 (see also my blog on The Great Digital Migration). Kaiserlian spoke with particular eloquence about valuable primary data that we “cherish and preserve.” The roughly half million pre-1800 imprints in the English Short Title Catalogue fall amost by definition in that category, if only because of their relative rarety. Within that group the 70,000 unique titles before 1700 have a particularly important claim on our attention. Scholarly users are deluded if they think that these texts will take care of themselves or that somebody else will do it in just the ways they would like to see it done.
Users will need to be recruited much more actively into the many different tasks of perpetual stewardship.  Instead of users working with data over which they have no control and that are cared for by somebody else, we need scholarly data communities that take charge of “their” data. Models of that kind are common in the life sciences. In the humanities, the community of ancient papyrologists has articulated the goals very clearly. Here is Roger Bagnall, now the director of the Institute for the Study of the Ancient World at NYU, about two big changes that technology brought to the ‘visions and goals’ of papryologists:
 One is toward openness; the other is toward dynamism. These are linked. We no longer see IDP as representing at any given moment a synthesis of fixed data sources directed by a central management; rather, we see it as a constantly changing set of fully open data sources governed by the scholarly community and maintained by all active scholars who care to participate. One might go so far as to say that we see this nexus of papyrological resources as ceasing to be “projects” and turning instead into a community.
If Greek papyrology is too rarefied for your taste, consider crowdsourcing in the context of Apple’s new mapping service. A blog about its problems by an expert attracted a lot of attention. There is a useful summary in the Washington Post:

A lot of the data that Apple needs, Dobson says, must be supplied by humans. And part of that should be via crowd-sourcing techniques, he says. However, Apple doesn’t have a good source for this yet, says Dobson, who now runs a consulting service called Telemapics, also the name of his blog. For Apple to fix Maps, it will require a lot of hires, namely of experts in mapping, and a quality assurance (“QA/QC”) team experienced in spatial data, he said.


The problems with Apple’s maps are not so different from the current problems of the TCP texts. I yield to nobody in my admiration for this project, but, as I have said on various occasions, the texts in their current form are not good enough for many research purposes. There are a lot of little problems. With some texts, they are just minor irritants. With other texts, whole sections are disfigured to the point of being useless. Most damaging is the negative halo effect that spreads from error-ridden pages and undermines trust in the resource as a whole even where it is good enough.

The bad thing about the little problems is their ubiquity. The good thing is that the fixing of most of them does not require specialized knowledge but lies within the competence of  careful and literate readers, whether high school students in AP classes, undergraduates, teachers, scholars, other professionals,  amateurs from many walks of life, and –a particularly important group — educated retirees with time on their hands and a desire to do something useful.

Those are the potential crowdsourcers all over the globe, and there are many of them relative to the size of the problem. What about the ‘quality assurance (“QA/QC”) team’ with experience in the relevant data?  For a decade, Michigan and Oxford have retained professionals with expertise in Early Modern Studies to review the transcribed texts as they come in from the vendors. Both in terms of relevant expertise and organizational structure, this set-up can be easily be converted to a relatively small-scale operation designed to encourage and manage collaborative curation. The biggest challenge facing such a team is probably not the correction of corrections (although there will be some of that). The challenge will be to “direct the crowd” as one of my colleagues put it. The manifest problems of Early Modern text corpora, whether OCR generated or manually transcribed, are rarely textual cruces or other instances of the “philologically exquisite,” a phrase I borrow from Michael Witmore. They are incomplete or garbled transcriptions, often the result of poor digital scans of badly microfilmed images of faint or blotchy printed pages.Think of them as the equivalents of bottles, cans, and other trash in national parks. It is not hard to pick up a bottle or fix an incomplete or incorrect spelling. But coordinating millions of such acts is not easy.

Nobody wants to fix a million errors, but many people would fix several hundred in books that mattered to them for whatever reason. Thus the first challenge is to break the global task of corpus-wide curation into many local tasks that provide immediate benefits to user contributors while making them feel good about contributing to a larger enterprise.

Secondly, you need to embed tools for data curation in a broader environment for working with texts. Fixing errors, which Housman compared to a dog hunting fleas, is an obsessive task for some. But for most contributors curation will be an intermittent activity. If you come across an error you will fix it then and there, if you can do it with minimal interruption of what you are “really” doing. A distributed curation environment needs to come as close as possible to “reading with a pencil,” a workflow in which the correction of an error, the highlighting of a passage, or a marginal comment are the work of the same hand or tool with seamless transitions from one task to another. To repeat an earlier point, curation and exploration are the Janus face of working with digital data.

Thirdly, scholarly users of a text want assurances about its quality. The United States Department of Agriculture lists eight grades of beef, from prime through choice to utility. In the world of print, the scholar’s and publisher’s name act as a kind of certificate stamped on each text. A collaboratively curated corpus of cultural heritage texts will benefit from a comparable but more dynamic system of certification that makes credible assertions about the current quality ranking of each text in a corpus by comparing a list of known or likely errors with a list of suggested or approved corrections.

The key to such a dynamic system of certification is a comprehensive curation log–a digital version of the Berichtigungsliste (correction list) that papyrologists have kept for almost a century.  A user-contributed correction is an act that associates a particular passage in a text with a user with particular editorial privileges. In the two billion words of the final EEBO-TCP corpus, such a log may run into millions of entries. These entries result in fixed errors, but they also provide the basis for performance-based editorial privileges, and their analysis may help with the algorithmic identification and correction of further errors.   From a proper curation log you can generate a text information package for every work (or even page of a work) and help determine the degree to which they can trust the text. A ‘prime’ text should always be a text that has gone through the editing routines associated with a good diplomatic edition from a reputable publisher. Uncorrected OCR is ‘utility grade.’ It is easy to think of intermediate grades and the use of traffic signal colors to warn readers about likely errors or not yet approved corrections.

Finally (and already implicit in the previous points), good curation workflows must build on more sophisticated interactions of humans and machines. The three R’s of animal experimentation, Replace, Reduce, Refine are relevant here. In a system of entirely manual analysis and correction of error, the time cost of making a correction consists of the time it takes to make a judgment and the time it takes to find the passage, record the decision, and move on to the next case. The second will exceed the first by at least an order of magnitude, and several orders of magnitude if the human tries to create the full log entry that a machine can write out in a split second. It is a mistake to believe that you can create good enough transcriptions of Early Modern texts solely by algorithmic means. But you can perform massive cleanup operations on them by using machines to target human intervention much more accurately and effectively. That is the point of “directing the crowd.”

Undergraduates as curators

Of the various potential lay contributors to text curation undergraduates have a special role. There are many of them, they are young and flexible, we encounter them in institutional settings that makes it relatively easy to organize and incentivize them (whether with money or credit), and there is the undeniable fact that in the social and natural sciences they have for generations played a significant role in maintaining labs and experiments, whether as curators or guinea pigs.

Over the past two years I have been in a lot of  discussions, formal and informal,  with students, colleagues, deans, and foundation officials about engaging humanities undergraduates in the scholarship of their teachers. There is  widespread agreement that the humanities lag behind the social or natural sciences. The solitary habits of humanities scholars and the lack of lab environments are often cited as reasons for this lag.  Discussions typically turn on finding projects and tasks that are appropriate contributions in terms of utility, pedagogical value, and giving students a sense of having a proper stake, however small, in the enterprise. Because of its collaborative potential, digitally based work is often seenas a promising avenue for meeting these goals.

Gregory Crane (Perseus Project) and Katherine Rowe (Bryn Mawr) have argued eloquently for the contribution that undergraduates can make to the care and feeding of primary data and that it is a good thing for them as well. Crane’s model is the “citizen scholar” (he did write a book about Thucydides, after all). Rowe argues for the “amateur scholar.” Love and duty are powerful motivators, and it is not always easy to tell where one ends and the other begins.

Greg Crane’s poster child for undergraduate contributions to the curation of primary data is the Perseus treebank of Greek and Latin texts. Second-year students of Greek parse sentences from Aeschylus, Herodotus, etc, and these sentences contribute to a growing archive  of Greek texts with complex linguistic annotation.  This is an easy case: undergraduates have always parsed Greek and Latin sentences and will continue to do so as long as those languages are taught. Feeding the output of the students’ work into the Perseus treebank does not change what they are doing anyhow, but gives them the additional satisfaction of doing something useful (by the admittedly rarefied criteria of Greek scholars who love treebanks).

In the curation of TCP texts, the needed work consists first of all and for quite a while of fixing millions of incompletely or incorrectly transcribed words through some combination of algorithms and ‘humint’ work.  There is a lot of menial work here, and the question arises whether such work teaches students anything they need to learn or whether this is a wasteful and exploitative use of their time.  In “The Politics and Poetics of Transcription,” one of the subtlest papers at the conference, Giles Bergel was skeptical about reducing transcription to mere ‘data capture’ and argued that “transcription, far from being a mechanical or mundane activity, can be a demanding intellectual discipline.” That reminded me of a touching passage in Eduard Fraenkel’s preface to his famous edition of the Agamemnon. He talked about earlier generations of 19th century scholars who in a once-in-a -lifetime trip to Italy would think about some way of “giving back” and would painstakingly transcribe some manuscript so that it would have a new lease on life.

Undergraduates who spend some hours or days checking a transcription against the digital page from which it is derived will probably conclude that they do not want to do this for the rest of their lives. But they will learn that it is quite difficult to copy a passage with unfamiliar orthographic or typographic habits –a point well made by Giles Bergel. They may also gain some respect and appreciation for the enormous and mostly invisible labour that makes their easy access to digital archives possible. Above all, they will learn that even for quite famous texts the ground of textual truth is rarely bedrock and quite often very thin ice. That is a very useful lesson to learn, and the reflections of bright undergraduates on this lesson are a joy to read.

The correction of the very many, very simple, and philologically very unexquisite errors is certainly the first requirement for raising the TCP texts to a quality level that one associates with respectable scholarly editions in a print world. But there are more complex problems where the intellectual payoff is more obvious. In the transcription of Ben Jonson’s Works, there are two dozen untranscribed passages (mostly Greek)  marked as “foreign.”  A Classics major who transcribes these passages and wonders why they are there and what they are doing has probably learned at least as much as from writing yet another paper. And s/he has done some good that can be acknowledged in the “credits” for the text.

At the other end of the scale, there are the late 17th century Paraphrase and Annotations by H. Hammond on the New Testament and the Psalms, which contain respectively 23,858 and 17,347 untranscribed words or phrases, presumably in Greek or Hebrew. How much of the detail in this book is unintelligible without those missing words? But it would take dedicated members of a very special user community to check and transcribe 40,000 Greek or Hebrew passages by hand, even if most of them are quite short.

Some forms of curation go beyond correction or completion and involve data enrichment. The Folger F21 project envisages the transformation of Early Modern play transcriptions into a fully interoperable corpus. For this you would want consistent act and scene divisions of plays. You would also want interoperable cast lists, with character coded by sex, age, and social status. A corpus-wide prosopography of dramatic characters over time would tell you much about changing tastes and habits. The print editions of many plays lack cast lists or act and scene division. Providing them in accordance with some corpus-wide rules is a task that requires discretion and cannot be done without a good reading knowledge of the play. Undergraduates who have completed this task will almost certainly have done more intellectual work than is invested in a standard paper, and they will have learned a lot about how a play works.

There is little doubt in my mind that undergraduates can make very substantial contributions to the curation of the TCP texs.  It is another question whether such work can or should be integrated into credit-bearing classroom work or whether it is better done through summer internships or similar arrangements. The tasks  that need doing are not as easily integrated into traditional assignments as is the parsing of sentences in second-year Greek. Much depends on speed and ease with which curatorial routines that make sense for the TCP can be integrated into the schedule and work flows of a course in which curation will be only a part of the total assignments. But these are practical questions.

Interfaces and work flows that have been tested by and work for undergraduates will almost certainly and  with very few changes work for other groups of users.

Better tools for exploration

Curation was defined by Philip Lord as

The activity of managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and reuse. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Higher levels of curation will also involve maintaining links with annotation and other published materials.

What about the discovery tools that help scholars analyze and interpret a well-curated corpus? Tools for the analysis of large literary text archives are by and large wretched. Imagine a sales representative who is asked by his boss to report on the last quarter and does so by plopping a bunch of invoices on her desk in no particular order. That is pretty much what happens in most text searches. It works well enough if there are only seven hits, but it does not scale beyond a dozen. The Google model improves matters by ranking the returns so that you can immediately focus on a top list. This works brilliantly in many cases, but it has two great weaknesses. First, you are never quite sure how the machine goes about its work. Secondly, and more importantly, there are many research situations where there is no top list but answers emerge from patiently working through the large data set returned by some query.

A TCP text is a digital surrogate with a query potential that in some ways exceeds the original.  The speed with which you can look up something across many texts is the most obvious example of the superior query potential of the digital surrogate. But current search engines do not let users take advantage of some of the most powerful features of the TCP transcriptions. The currently  available corpus of some 40,000 texts adds up to the largest, most comprehensive, and most valuable text archive encoded in TEI. On the other hand, there is no search engine that lets you “decode” or exploit the added value created by the TEI encoding. If you look up the word ‘king’ in Shakespeare’s Hamlet on the TCP search engine, the return cannot distinguish between occurrences of the word in a speech, a speaker label, a stage direction, a passage in prose or verse, even though those distinctions are very accurately encoded in the data. The Philologic search engine has some element-aware features, but they are not easy to use. The CQP website at Lancaster strips out the XML tags.  For most practical purposes, search engines treat the TCP texts as if they were plain text.  TEI encoding greatly enhances the query potential of digital surrogates, but that query potential is mostly ignored.

In the Natural Language Processing world (NLP) there are many sophisticated techniques for preparing data, refining searches, and using statical routines or visualizations to process and display the results. Using these techniques effectively requires more highly developed computational skill than most humanities scholars possess.  The CREME group at Lancaster has taken some valiant steps towards lowering the entry barriers for ordinary humanits, but much work remains to be done.

Unfortunately, progress in this domain requires intensive work by highly skilled programmers, and progress is measured in months or years rather than days and weeks.  It is very difficult to build software that scales up gracefully from simple and easy tasks to more complex operations. Here is a list of tasks that a good search engine should be able to perform quickly,  across a corpus of billions of words, and with a gentle  learning curve for non-technical users:

  1. Limit a search by bibliographical criteria
  2. Refine bibliographical criteria to articulate differences of text category and chronological or geographical properties associated with texts
  3. Limit a search to particular XML elements of the searched text(s)
  4. Simple and regular expression searches for the string values of words or phrases
  5. Retrieval of sentences
  6. Searches for part-of-speech (POS) tags or other positional attributes of a word location
  7. Searches that combine string values with POS tags or other positional attributes
  8. Define a search in terms of the frequency properties of the search term(s)
  9. Look for collocates of a word
  10. Identify unknown phrases shared by two or more works (sequence alignment)
  11. Compare frequencies of words (or other positional attributes) in two arbitrary sub-corpora by means of a log likelihood test
  12. Perform supervised and unsupervised forms of text classification on arbitrary subsets of a corpus
  13. Support flexible forms of text display ranging from single-line concordance output to to sentences, lines of verse, paragraph length context and full text
  14.  Support grouping and sorting of search results as well as their export to other software programs

Combining a TCP corpus in a state of perpetual curation with the capabilities of such a search engine would add up to a very powerful “digital carrel,” but we are a long way from reaching that goal (For more about corpus query tools, see Towards a digital carrel: a report about corpus query tools).