June 2020 – Scalable Reading

The following is a hypothetical and introductory lecture to students who have shown an interest in the Early Modern world, whether its art, history, literature, music, politics, religion, or science. I wrote it in 2016. It has been lightly edited since,

My goal in this talk is to tell you a little about the documentary infrastructure of Early Modern Studies in the Anglophone world and about the changes that digital technology is making to it. Jerome McGann, one of the most distinguished editors of his generation, observed in 2001 that “in the next fifty years the entirety of our inherited archive of cultural works will have to be re-edited within a network of digital storage, access, and dissemination.” That “re-editing” or “re-mediation” is a big enterprise. Its tasks range from mundane chores to operations drawing on highly specialized knowledge. Undergraduates have done a lot of useful work on the simple — and sometimes not so simple — side of that spectrum, and wherever they have laid their hands on particular texts they have noticeably improved them. I would like to persuade you become engaged in an enterprise of collaborative curation where you as the future users of the documentary infrastructure for Early Modern Studies participate in its production.

The allographic journey of texts

Early Modern Studies used to be known as The Renaissance, but that name has fallen out of favour, perhaps because it smacks too much of an attitude that David Bromwich somewhere characterized as “we are so smart now because they were so dumb them”. Early Modern is less judgemental and more in keeping with Ranke’s view that “every generation is equidistant from God”. In German usage “Early Modern” marks a period that begins somewhere around 1400 and ends somewhere around 1800. In the English-speaking world, those four centuries tend to be divided into two parts, known as “Early Modern” and “The Long 18th Century.” 1660, the end of the English Civil War and the restoration of the monarchy, serves a convenient divider. The “Long 18th Century” is roughly coterminous with what Americanists call “Early American”. From a digital perspective, the documentary infrastructure problems of Early Modern and Early American are quite comparable.

Many changes over time turned Chaucer’s ‘medieval’ into Shakespeare’s ‘early modern’ England. The invention and rapid adoption of printing are of particular significance in the context of this discussion. Nelson Goodman in his Languages of Arts distinguished between autographic and allographic objects. The former, e.g. Michelangelo’s David, are uniquely embodied, but there is no privileged system of writing down a Shakespeare sonnet or a Bach fugue.

The history of texts is an allographic journey with stages of re-mediation where texts are written down (‘graphic’) in a different manner (‘allo’). Consider the history of the Iliad. Rooted in oral poetry, it was probably first written down around 700 BCE in an alphabet the Greeks had quite recently adapted from a Semitic alphabet used by Phoenician traders. Around 400 BCE this alphabet was modified to provide a more nuanced representation of Greek vowels. Alexandrian and Byzantine scribes added breathing marks and accents to help with pronunciation and disambiguation. If your first encounter with a Greek Iliad was via an Oxford Classical Text you would have seen a type face derived from the handwriting of Richard Porson, an 18th century English scholar. A page of that text would have been ‘Greek’ to Plato because it looked nothing like what he was used to. A page from Venetus A, the 10th century Byzantine manuscript and most important source of the text, would have been just as Greek to him. He would have found it a little easier to make sense of the opening line of the Iliad in ‘betacode’ a workaround for representing Greek letters with Roman capital letters on an IBM terminal keyboard. Here is the first line in Greek letters and in betacode:

μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος
MHNINAEIDEQEAPHLHIADEWAXILHOS

I say this because in an Early Greek vase painting of a pretty girl, the legend “hē pais kalē” (written vertically) looks much like H PAIS KALH.

I dwell on this in such detail because readers through the ages have had difficulty with the allographic nature of texts. Readers and writers are deep “tool conservatives”, are apt to think of change as a loss of authenticity and feel that the original (or most familiar) encoding is an essential part of the text. What Joel Mokyr in The Gift of Athena celebrates as lowering the access costs to knowledge, they have considered a profanation of sacred knowledge. Thus an Italian 16h century writer said that “the pen is a virgin, the press a whore.” In this regard digital texts are to books what books were to manuscripts, and low-status codices to high-status scrolls. In the end, however, the low upstart has always won. For you, a laptop or mobile device will usually provide the most convenient and often the only access to an Early Modern book, especially if you stray beyond the safe limits of the canonical.

The English Short Title Catalogue

Traditional London taxi drivers have to pass an examination in which they demonstrate “The Knowledge”, their command of some 25,000 streets over a 100 square mile area, including points of interest and good ways of getting from here to there. You can’t have The Knowledge unless somebody has named the streets, given numbers to the houses on them, and kept a public register of them. A street without a registered name might as well not exist. Ditto for books. The Cartesian cogito of books might read “I am catalogued, therefore I am.” No cataloguing, no scholarship. Cataloguing is itself a scholarly activity of considerable complexity and has a non-trivial effect on scholarly field boundaries.

In the early 1880’s Alfred Pollard, a young man excluded from a teaching career by a very bad stutter, found a job in the British Museum’s Department of Manuscripts. Some forty years later he published, together with G. R Redgrave,A short-title catalogue of books printed in England, Scotland, & Ireland and of English books printed abroad, 1475–1640 , the first systematic census of English print before 1640. That is why every imprint before 1640 has an STC number. Twenty years later Donald Wing at Yale extended this work, and books between 1640 and 1700 got “Wing” numbers. Fast forward another generation — we are now in the early days of personal computers — and there is a new and digital project to create a census of all 18th century texts, the Eighteenth-century Short Title Catalogue. Then the editors of that catalogue decided to combine Pollard and Redgrave, Wing, and the ESTC in a new and digital animal called the English Short Title Catalogue, which aims at being the authoritative description of the roughly half million books published in the English-speaking world before 1800. About a quarter of them are Early Modern; three quarters belong to the Long 18th Century. The ESTC is built on a foundation of more than 120 years of bibliographical labour. If you have read a little Virgil in its original Latin, you might remember tantae molis erat Romanam condere gentem.

Early English Books Online (EEBO) and the Text Creation Partnership (TCP)

In the late 1930’s libraries began to create microfilm copies of books. For many years, University Microfilms, an offshoot of the University of Michigan, was the leader in this effort. By the sixties many early modern texts were available in that effective but unloved format. Instead of combining business with pleasure in an expensive trip to the British Museum you could now go into the basement of a provincial university library and spend hours reading some 16th text while operating the microfilm reader in a dusty and windowless room. Variable in quality, with pages missing here or duplicated there, microfilm was nonetheless a powerful gift of Athena and broadened access to rare books.

University Microfilms changed owners several times before ending up in the hands of the Proquest corporation, which around 2000 digitized the microfilms of English books before 1700 and made them available over the Web, where they were “free” for the privileged members of institutions that could afford the expensive subscription. When some years ago I asked a colleague “What difference have digital texts made to your work?” I barely had time to finish my question before he shot back: “EEBO has changed everything.” The digital scans are not better and sometimes worse than the microfilm images, but you can get at them at 2 am in your pyjamas. Access is king and often trumps quality.

The Text Creation Partnership (TCP) is a close contemporary of EEBO. Broadly speaking, the project, completed in 2015, aimed at creating a deduplicated library of books before 1700 in transcriptions that were faithful to the orthographic practices of the printers and used protocols of the Text Encoding Initiative (TEI) to articulate structural metadata. A raw page of TCP text is full of the markup in angle brackets you are familiar with from HTML. It “containerizes” text in a way that lets a machine identify chunks of text as lines of verse, paragraphs, list items, tables, titles, signatures, epigraphs, trailers, postscripts, quotations, etc.

The TCP corpus of not quite two billion words in ~60,000 titles is not a very useful resource if you are interested in the intratextual variance of different editions of the same work, but it is an unmatched resource when it comes to pursuing the intertextual filiations of the first two centuries of print culture. Its structural markup lets you extract text chunks across thousands of documents. The tools for doing so have become more user-friendly, but it still takes a little while to acquire moderately sophisticated text processing skills.

The Query Potential of the Digital Surrogate: The Digital Combo

The form in which you read a text is always a surrogate, an arbitrary embodiment of an intentional object. The wistful and whimsical prefaces with which early printers apologize for their errors and encourage readers to correct “slips of the pen or press” (quae praelo aut penna lapsa vidisses) in odd ways reflect their understanding of texts as objects always a little beyond our grasp.

If a text is always encountered through a surrogate that is in some respects arbitrary, what is an appropriate re-mediation of an Early Modern text in an increasingly digital world? I am not talking about the end of the printed book, which is likely to remain unchallenged as the best way of reflective engagement with a single text. But even in the print world the modal form of scholarly reading involves constant movement among many books. A lot of mechanical ingenuity over centuries has gone into cutting the time cost of moving from one book to another, from the often aggressively technological book wheels of the Early Modern world to Jefferson’s elegant Lazy Susan.

Any medium has ‘affordances’ or things that you can do with it more readily than in another medium. Ranganathan’s fourth law of library science says “Save the time of the reader.” Many new affordances are about reducing the time cost of some activity. Compared with a scroll, the codex is a random access device that allows rapid movement across the pages of a book. Finding devices, such as tables of content or indexes are impracticable in scrolls. The digital medium is even more agile than the codex. It supports rapid and precise alignment of text and image, and it supports rapid “search and sort” operations across millions of pages of texts as if they were pages from one very long book.

A few years ago there was a Princeton conference about “research data lifecycle management”. I read about it in a blog that had this quotation from a talk by Brian Athey, the chair of computational medicine at Michigan: “Agile data integration is an engine that drives discovery.” That’s not the way humanists talk, but we’re familiar with the idea. The Weimar edition of Luther’s works runs to sixty volumes, not counting a dozen German and Latin index volumes of people, places, things, and citations. Those index volumes have been engines of discovery for generations of scholars.

Theodor Mommsen’s Corpus Inscriptionum Latinarum (CIL) offers an even more striking example from a predigital world. Before 1850 many Latin inscriptions were known, but knowledge of them was scattered. Around the time that Mommsen wrote the first volume of his Roman History (which won him the second Nobel Prize in Literature), he started an edition of Latin inscriptions based on new ‘autoptic’ transcriptions. He directed the project for half a century and edited many of the volumes himself. By the early twentieth century, a good Latinist in a then quite provincial Midwestern university (Indiana or Northwestern) had access to its sixteen volumes on eight feet of library shelving, with the corpus of inscriptions clearly organized in a time/space continuum. That was a transformative event for the study of Roman administrative and legal history. A lot of agility was built into those heavy tomes.

“Research data life cycle management” and “agile data integration” are helpful terms in thinking about a complex digital surrogate that I call the “digital combo”. This surrogate combines three different aspects of a text. First, the digital facsimile of a page in an Early Modern took gives you access to a privileged (and often the only) witness to the words in a work. The look and feel of the page also provide a lot of information about the milieu from which the text originates or the manner in which it addresses its audience. Much can be learned from such “paratextual” features.

Secondly, a careful digital transcription is much more agile than the printed text or its digital facsimile. You can cut and paste from it, and you can search within it. Thirdly, if the transcription is part of a corpus and if the transcriptions of each text have been done in a reasonably consistent manner, the agility of each text becomes an agility of the corpus. You can read and search the text within the context of the others. You can also read the corpus as if it were one very large book. This digital Book of Early Modern English is much bigger than the Luther Book or the Book of Roman Inscriptions. It adds up to a re-mediation of the Early Modern print heritage with affordances that we have only begun to explore.

Digital combos are nothing new. On the Internet Archive and in the Hathi Trust library there are now millions of digital surrogates that combine page images with automatic transcriptions produce via “optical character recognition” (OCR). Searching uncorrected or ‘dirty’ OCR across vast corpora is a crude but often successful way of finding stuff in books printed since the 1800’s . But while OCR has made giant strides in the past two decades, for texts before 1700 it is still a pretty hopeless enterprise because the printed lines too often resemble crooked teeth.

The Text Creation partnership offers digital combos of a much higher quality. They combine TEI-XML transcriptions with EEBO images. But these digital combos have several problems. The image quality is often poor and well below the standards of contemporary images. Secondly, poor image quality has led to many errors and lacunae in the transcriptions. Completion and correction of the transcription has been a frequent user request. Finally, the images are behind a paywall, and in North America there is not much access to them outside research universities counted in the low hundreds.

Can you have digital combos that combine high-quality and public domain images with TCP transcriptions in an environment that allows for the collaborative curation and exploration of the texts? Take a look at https://texts.earlyprint.org. The site includes ~ 52,500 texts, including some 800 plays and a large portion of the Thomason tracts, a famous collection of books and pamphlets from the period of the English Civil War (1640–60). There are currently 630 digital combos. Not quite half of them are plays.

Over the past few years many Rare Book Libraries have begun to make some digital surrogates of their holdings publicly available. Not all of their Early Modern holdings are mapped to ESTC numbers, and very few of them are mapped to TCP texts. But if a catalogue record lists an ESTC, STC, or Wing number, the mapping to a TCP text is trivial. One could imagine a loosely coordinated enterprise in which libraries give priority to digitize image sets that map to TCP text and avoid overlap with already existing digital combos. Do this steadily for five years, and the results will be significant.

Special Features of Early Modern Studies

Remember “The Knowledge” of the London taxi driver, 25,000 streets across a hundred square miles. What would it mean to have “The Knowledge” of Greek tragedy or Arthurian legend? Here the equivalents of the street names and numbers belong to the past. A mapping of that knowledge depends on answers to three questions. How much has survived? How much of the surviving materials have been mapped? How carefully and consistently have they been mapped?

From the perspective of those questions the systematic digital re-mediation of the Early Modern print heritage is especially promising. First, much of what was printed has survived. We know the titles of more than 1,000 Greek tragedies, but only three dozen have survived. Martin Wiggins in his magisterial census of British Drama 1533–1642 (Oxford, 2014-) counts 543 plays that have survived and were printed between 1567 and 1642. This list is not radically different from a1656 “Exact and perfect CATALOGUE of all the PLAIES that were ever printed; together, with all the Authors names.”

One must be cautious in extrapolating from drama to other genres. One should also remember that the print heritage is a quite different and more formal animal than the written heritage. But it appears that a high percentage of what was printed has survived and a high percentage of what survived has now been digitally transcribed in a remarkably consistent format. There may not be another epoch of comparable scope and significance where the printed record has been transcribed so completely and into a digital format that offers rich opportunities for agile data integration.

What about the quality and consistency of ‘mapping’ or transcription? This is a particularly important question for digitally encoded materials. An IBM executive once observed that humans are smart but slow, while computers are fast but dumb. Human readers adjust easily and tacitly to a wide variety of textual conditions. Computers will travel at lightning speed across vast stretches of text, but they will fail as soon as they encounter a textual condition about which they have not been told in advance. Given the variety (and plain inconsistency) of early modern print practices, the TCP archive has achieved considerable success in maintaining a level of consistency that support complex analytical operations across the corpus as a whole or subsections of it.

Finally, by current computing standards, this corpus of less than two billion words is no longer particularly large. The textual data fit comfortably on a smart phone that may cost less than a replica of Jefferson’s revolving bookstand. Computationally assisted operations of scholarly interest on a corpus of this size no longer require an expensive infrastructure but can be done on quite ordinary laptops.

Natural Language Processing (NLP) and Augmented Ways of Reading

When I reviewed the writing samples of job candidates in the eighties, candidates from Yale could be spotted right away because their writing samples were printed on Yale’s mainframe in a quite distinctive style. Those were the days of Deconstruction, and in interviews we would joke about ghosts in the margin if the machine put notes in odd places. The candidates had no special interest in computers, but the mainframe would automatically renumber their footnotes. For this they would do anything, and they acquired the non-trivial text processing skills that it took to babysit a complex document like a dissertation on a mainframe computer of the eighties. The GUI interface of Microsoft Word relieved users of the need for such knowledge. Today the text processing skills of the modal humanities student or scholar do not extend beyond very simple wild card and right-truncated searches.

This is a pity. A modest amount of Natural Language Processing ( NLP) goes a long way towards increasing the speed and accuracy of basic operations involved in text-centric work. The word ‘computer’ is misleading in its suggestion that the machine is mainly about numbers. The name of the famous early computer language ‘lisp’ is an abbreviation for ‘list processor’ — a much better name for a machine that spends much of its time making, sorting, and comparing lists or extracting items from them. These are basic operations in scholarly work. There is nothing particular digital about them, and you need not sell your humanistic soul to some technological devil to take advantage of the fact that the machine can do some simple things much faster and with fewer errors than you can.

The digital re-mediation of the Early Print heritage will benefit greatly from the systematic application of Natural Language Processing technologies (NLP) that are widely used in Linguistics and many social sciences. There are persistent misunderstandings about the role of such technologies in the humanities. In the second part of Shakespeare’s Henry VI the peasant rebel Jack Cade indicts the Lord Say with these words:

It will be proved to thy face that thou hast men about thee that usually talk of a noun and a verb, and such abominable words as no Christian ear can endure to hear. (2 Henry VI, 4.7.35ff.)

This is an excellent example of the deep resistance that people have to the ‘explicitation’ of the tacit knowledge that humans bring to the task of “making sense” of language. A text written in the Roman alphabet is a very sparse notation that depends heavily on the skills that the reader brings to the task of making sense of it. Computers cannot make sense of anything. They can only follow processing instructions. If you want the machine to simulate some forms of human understanding, you must introduce the rudiments of readerly knowledge in a manner that the machine can process. These rudiments are very crude, they are added through machine processes, and they have error rates up to 4%, but they significantly increase the query potential of documents, especially when done “at scale”. Typically you give every word a unique identifier, and add its ‘lemma’ or dictionary entry form of the word, and a “part of speech” tag. Such annotation increases the size of the text by a whole order magnitude. It produces what a witty colleague has called a ‘Frankenfile’ that is human-readable in principle, but not in practice. On the other hand, it can be processed very fast by a machine. Here is a sample of a few words from Jack Cade’s indictment:

<w xml:id=”sha-2h640703704″ lemma=”talk” pos=”vvb”>talk</w>
<w xml:id=” sha-2h640703705″ lemma=”of” pos=”acp-p”>of</w>
<w xml:id=” sha-2h640703706″ lemma=”a” pos=”d”>a</w>
<w xml:id=” sha-2h640703707″ lemma=”noun” pos=”n1″>noun</w>
<w xml:id=” sha-2h640703708″ lemma=”and” pos=”cc”>and</w>
<w xml:id=” sha-2h640703709″ lemma=”a” pos=”d”>a</w>
<w xml:id=” sha-2h640703710″ lemma=”verb” pos=”n1″>verb</w

Most readers are familiar with searches where you put in a string of characters and retrieve matches that contain the string. But quite often you may be interested in unknown strings that meet some criteria. An annotated corpus supports such queries. It can retrieve a list of all nouns (or proper names), sentences that begin with a conjunction or end with preposition, or phrases that match the pattern “handsome, clever, and rich” in the opening sentence of Jane Austen’s Emma. Run against a corpus of Early Modern drama, such a search yields results like “The Scottish king grows dull, frosty, and wayward.”

If you have data of this type for an individual text, there are some primitive but powerful ways of aggregating them across a corpus such as the 55,000 TCP texts. You count the total occurrences of a word in a corpus, which gives you its “collection frequency.” The number of texts in which it occurs, the “document frequency” often provides more useful information. If you know which words in a text occur in only one or two other texts, it may be a good idea to look at them. Procedures of this type do not replace reading, but they augment it. They also provide ways of making visible aspects of a text that reading does not easily reveal. Some of these count data are best thought of as lexical metadata that are added as appendices to a corpus somewhat in the same manner in which indexes are added to books. Aristotle writes about relative frequency as an important property of words (Poetics 21).

A Digression about Counting

Humanities scholars often distrust quantitative arguments. Or so they say, but in practice their discourse relies heavily on such concepts as ‘none’, ‘few’, ‘some’, ‘many’, ‘most’, ‘all’, ‘much more’, or ‘a little less’. ‘hapax legomena’ is one of the oldest critical concepts and originally referred to words that re found only once in the Homeric corpus.. Athenaeus in the Deipnosophist, a third-century compilation of literary and culinary gossip, pokes fun at a pedant, who asked of every word whether it occurred (keitai) or did not occur (ou keitai) in Attic Greek of the classic period. Hence his nickname Keituoukeitos. The great classical scholar Wilamowitz turned a German children’s phrase einmal ist keinmal (once doesn’t count) into Einmal ist keinmal, zweimal ist immer (once is never, twice is forever). The English linguist J. B. Firth is famous for his dictum that “you shall know a word by the company it keeps.” My mother liked to say that “three hairs on your head are relatively few, three hairs in your soup are relatively many.”

Humans in fact are inveterate, skillful, but informal statisticians, forever calculating the odds. The French mathematician La Place observed that probability theory is just common sense reduced to a calculus (le bon sense réduit au calcul). It makes you appreciate more exactly ce que les esprits justes sentent par une sorte d’instinct, sans qu’ils puissent souvent s’en rendre compte. It is very hard to capture the meaning and tone of esprits justes. A colloquial rendering might be “what savvy minds just ‘get’ without being able to say why.”

A sense of a word’s frequency or currency is an important component of your knowledge of it. Money and philology mix wittily in Love’s Labour’s Lost when Costard reflects on the tips that pretentious superiors give him as ‘remuneration’ and ‘guerdon’, words well outside his base vocabulary (LLL 3.1.145ff.). If you have a properly annotated corpus it is not difficult to quickly extract quite detailed information about the ‘currency’ of words in texts or groups of them. Visualizations of such data have become quite popular. Think of the Goggle Ngram browser or the word clouds that are likely to accompany a New York Times report about a State of the Union address.

Close, Distant, and Scalable Reading

August Boeckh, another philological giant from the 19th century , said that “philology is, like every other science, an unending task of approximation.” You never quite get there, but you have to start somewhere. As a first-year graduate student you are like the would-be London cabbie and have to get “the” or at least “enough” knowledge of the Early Modern world to find your way around. Your dissertation may be about the equivalent of Highgate, but you need to have some idea where that is in relation to Elephant&Castle or Ladbroke Grove. For these crude mapping operations, it is extraordinarily helpful to combine bibliographical data from the ESTC with quite primitive frequency data about the distribution of words in works of interest to you.

It is important to be very clear about what such computationally assisted routines can and cannot do. They will not tell you much about canonical authors that you cannot get more quickly from other sources. There is a simple reason for this. The works of canonical authors have been crawled over for generations by thousands of the slow but smart computers also known as human brains. That is why findings of statistical studies about famous authors often tempt you to say with Horatio “There needs no ghost, my lord, come from the grave / To tell us this” (Ham. 1.5.125).

In some cases, such inquiries generate fascinating second-order results. Take J. F. Burrowes’ 1987 classic Computation into Criticism : A Study of Jane Austen’s Novels. From one perspective, Burrowes does not change your view of any of the novels or characters in them. From another perspective, he tells you how much of your view of this or that character is shaped by the relative frequency of the thirty most common words in that character’s speech. You don’t learn much about the ‘what’, but you learn a lot about the ‘how’. That said, Jane Austen probably was not planning to increase the percentage of first-person pronouns when putting words in the mouth of Sir Walter Eliot.

Most authors are not canonical, and much scholarly work turns on knowing enough about texts that you would rather not read very closely or at all. Over the two decades of the English Civil War the bookseller George Thomason collected all pamphlets and books published during that period. The 22,000 texts miraculously stayed together and ended up in the British Museum, providing extraordinarily dense eyewitness coverage of a transformative period of English history. Thomas Carlyle called them

the most valuable set of documents connected with English history; greatly preferable to all the sheepskins in the Tower and other places, for informing the English what the English were in former times; I believe the whole secret of the seventeenth century is involved in that hideous mass of rubbish there.

There are valid and rewarding ways of cherry-picking your way through this “hideous mass” looking for salient detail, breathtaking in its brilliance or idiocy. But there are equally valid ways of looking for geographical or temporal differences in the distribution of quite ordinary phenomena that will not draw attention to themselves when encountered separately here or there. “Hideous mass of rubbish” could be a way of describing an average week of Twitter or Google queries. I read somewhere about a dissertation whose author was interested in what you could learn from Twitter about regional differences in attitudes towards gay people. Unsurprisingly, the author found that openly gay Twitter stuff was much more common in California than Southern states. Also unsurprising, but striking and haunting was the fact that in Georgia (if I remember correctly) the most frequent completion of the Google question “is my husband” was “gay.”

A finding of this kind is a good example of what Franco Moretti has called “distant reading”, a term that challenges the “close reading” introduced by New Critics and whose ethos is epitomized in the title of Cleanth Brooks’ Well Wrought Urn. Moretti’s “distant reading” can be seen as a digitally enhanced return to an earlier mode that surveyed an entire era from a distance. In thinking about the distinctive affordances of the digital medium I prefer the term “scalable reading.” A properly annotated digital corpus lets you abstract lexical, grammatical, or rhetorical patterns from a distance, but it also allows you to “drill down” almost instantly to any particular passage. If the time cost of zooming in and out is very low you can afford to follow up many leads in the hope that not all of them are wild goose chases.

It is true that reading from a distance offers you a shallow understanding of a text or corpus. But so does reading about it in an introductory survey that keeps you at a safe distance from any detail. In scalable reading you are at least potentially close to real words in real books, and “drilling down” is a second’s work. There will be never be a substitute for looking very closely at the words and context that matter to your argument. But scalable reading is much better than a survey at helping you “look for” what you should “look at” closely. It is also true that some patterns are more readily seen from a distance. I know of no better introduction to the Iliad than the list of the three most common words ordered by descending frequency: man, ship, god.

Cultural Analytics and Data Janitoring

‘Augmented’, ‘distant’, or ‘scalable’ reading are terms for techniques of analysis characteristic of an emerging discipline called ‘Cultural Analytics’. The term is a portmanteau of ‘Cultural Studies’ and of ‘Analytics’, which is computer science jargon for stuff you do with data. The tool kit of analytics has become more powerful over the years, and the size of an annotated TCP corpus (~ 150 GB) is no longer an obstacle for running iterative ‘analytics’ across the entire set within minutes or hours rather than days or weeks. In terms of subject matter, a high percentage of the TCP corpus belongs to History, Law, Political Science, Religion, and Sociology. Much of that corpus lends itself to forms of analysis that have been practiced in the Social Sciences. Prominent is the use of lexical analysis to “predict” the ideological formations and likely choices of individuals or groups. I put the word “predict” in quotation marks because it has very little to do with a real future. It is the technical term for the results that machine produces when it executes some statistical routine to a data set.

The application of such techniques to the Early Modern corpus is clearly promising. But computers are very fussy about the format in which they are fed their data. You typically must do a lot ‘to’ the data before you can do anything ‘with’ them. Curation (doing to) and exploration (doing with) are two sides of the coin of working with digital data. A 2014 article in the New York Times gives an eloquent account of ‘data janitoring’ and reports a data scientist as saying “Data wrangling is a huge — and surprisingly so — part of the job… It’s something that is not appreciated by data civilians. At times, it feels like everything we do.” “Welcome to the Club of Invisible Work”, a scholarly editor might say.

The TCP texts still need a lot of “data janitoring — remedial work in the common use of that term. Some of it has to do with their appeal to readers. “Plain Old Reading” will remain the most important form of interaction with the texts. Human readers have a low tolerance threshold for typographical errors. They are annoyed by them even or especially if the error does not obscure the meaning of the word. The texts will be our main windows to the past for a long time, and we should keep those windows as clean as possible out of a sense of respect both to the texts and their readers.

Other forms of data curation have more to do with making it possible for the machine to process the data in a manner that will give scholars trustworthy results in response to their queries. The goal of collaborative curation in that regard is to give the data an ‘interface’ that different types of analytics can use with as little need as possible for additional data janitoring . Given the variance of Early Modern orthography a data layer of standard spellings is of equal value both to readers and machines. The benefits to modern readers are obvious. For the machine, standardized spelling is a way of both erasing and articulating orthographic variance. Most of the time a user looking for ‘lieutenant’ will not be interested in the 145 different ways in which that word is spelled in Early Modern texts before 1640. But if the variant spellings are dependably mapped to ‘lieutenant’ there are straightforward analytics that count, group, and sort the variants by time or frequency. Standardized spellings also make it significantly easier for an algorithm to find shared phrases of two or more words, a critical source of evidence for intertextual research.

Algorithms can do a pretty good job of creating standardized spellings, but if you want perfection you need some human intervention, and the earlier the text the greater the need. If this happens through “dispersed annotation” of a central resource by many hands on the basis of a shared protocol, consistency is more readily achieved, and corrections offered in one location may be propagated algorithmically to other locations. “Dispersed annotation” is a technical term from the life sciences where the curation of genomes and their subsequent placement in a shared repository is a standard practice. The Early Modern corpus as a cultural genome offers a useful metaphor for some purpose. At least it draws attention to practices in the sciences, widely approved if not always followed, of iteratively curating and sharing research data so that they can be used by others for other purposes.

Engineering English or “Only Connect”

In a 2004 paper Philip Lord defined curation as

the activity of managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Higher levels of curation will also involve maintaining links with annotation and other published materials.

Like the proverbial woman’s work, curation is never done. The digital corpus of Early Modern texts is stable if you think of it as a fixed number of texts, with occasional additions or replacements. It is ‘dynamic’ if you think of the layers of metadata that are in practice associated with digital data. It is in Boeckh’s phrase an “unending task of approximation” to ensure that the data and metadata are “fit for contemporary purpose, and available for discovery and re-use.” Contemporary purpose changes, and so do the tools available to scholars. New tools change the calculus of the possible, but taking full advantages of a now potential requires making the data accessible to the tools. Tools and data are not always to tell apart. Are metadata tools or data or both?

The responsibility for keeping scholarly data “fit for purpose” ultimately rests with the scholarly communities that base their work on them. Not so long ago the tripod of cultural memory rested on the work of scholars, publishers, and librarians. Increasingly it rests on collaboration among scholars, librarians, and IT professionals. The technical changes have been dizzying. What has not changed is the special responsibility that scholars have for the ‘fitness’ of their data.

“Had we but world enough and time”: the fame of this counterfactual from Marvell’s To His Coy Mistress depends on the fact that we don’t. Time is the most precious of all commodities. Our decisions are governed by our sense of what is quite literally ‘worthwhile’. “Save the time of the reader” is a domain specific application of the art of ‘engineering’, the exercise of ingenium (the Latin root of ‘ingenuity’) to create ‘engines’ that will reduce the time cost of getting from some ‘here’ to some ‘there’.

The maintenance of a collaborative environment for curation is in a very literal sense an engineering problem whose solution can take advantage of the phenomenal decrease in the time cost of many activities. Computer scientists use the term “mixed initiative” for tasks that divide work between a human and a computer. These divisions can take many forms. Consider the remedial task of fixing simple typographical errors. Computer science majors at Northwestern have used neural network techniques to make context-sensitive decisions about the correct spelling. There is a good chance that between half and two thirds of the five million textual defects in the TCP corpus can be fixed by a machine with a high degree of confidence. That leaves a lot of remedial work to humans, but a “mixed initiative” environment makes that work less tedious. Textual correction has a “fix it, find it, log it” workflow, where the finding and logging take up much more time than the fixing, which may be the work of seconds. The EarlyPrint sites currently house about a quarter of the publicly available TCP texts in an environment that supports “curation en passant”. If you come across a defect in a text fixing it may take no more time than writing a word in the margin of a book. The software takes over the logging. It will reduce to seconds the time cost of looking for other defects. If you work with a text in that environment corrections you make for yourself become available to others without any additional work on your part.

A similar logic is at work if you move beyond remedial tasks to the construction of higher-order metadata that are the digital successors of the indexes of citations, people, places, and things in the Luther edition. Such indexes are created by machines running a script, but it takes a lot of iteration, data checking, and tweaking by humans to make the indexes “fit for purpose.”

Up-to-snuff and free images are the most appealing part of a re-mediated corpus of the Early Modern print heritage. As said before, side-by-display of text and image provides visible proof for the trustworthiness of the transcribed text, while layout and typography have their own story to tell. Matching a transcribed page of text with a new image from the same edition is a non-trivial task that benefits from a mixed-initiative approach. The identifiers of the digital texts are based on the IDs of the EEBO images rather than the often unreliable or non-existent page numbers of the source texts. The identifier of a newly made image will be based on the practices of the institution that created the image and will also have nothing to do with the page number of its source. You cannot automatically align text and image , but you can create an environment in which it will rarely take more than fifteen minutes for a reader with a laptop, an Internet connection, and an interest in a particular text to align text and image and create a digital combo.

A Hortatory Conclusion

To return once more to the concept of “research data life cycle management”, Early Modern printed books have entered a phase in their life cycle in which the affordances of digital media create many opportunities, but it will be the work of many individuals to turn those opportunities into real improvements and enrichments of a textual heritage. Joshua Sosin, a papyrologist at Duke who played a major role in the Integrating Digital Papyrology project, has argued forcefully for “investing greater data control in the user community”. Clay Shirky has written eloquently about “cognitive surplus” in his book with that title. In a digital world lots of people have lots of hours that they can spend (or waste) in different ways. Many of the 55,000 texts will benefit from some attention of a housekeeping kind, and much of that attention can be given to textual problems that can be solved in minutes or hours at a time rather than days or weeks and whose solution calls on patience and attention to detail rather than highly specialized professional competence.

The Renaissance Society of America and the Shakespeare Association of America have between them several thousand members, and their students are probably counted in the tens of thousands. Together they have a lot of cognitive surplus. If a little of it is spent every year on improving the corpus of Early Modern texts, the cumulative effect of five years’ work will be considerable. If work with old books is anywhere on the horizon of your career expectations, you could think of contributing to this effort as a form of service in which useful work is done and useful lessons are learned.

Month: June 2020

Re-mediating the Documentary Infrastructure of Early Modern Studies in a Collaborative Fashion