The EarlyPrint site contains linguistically annotated and partly corrected TCP texts in an environment that supports collaborative curation. For many practical purposes the approximately 60, ooo TCP texts from 1473 to 1700 add up to a deduplicated library of that period. EarlyPrint looks forward to a time when for (almost) every book in that corpus there exists a complex digital surrogate that includes
- A transcription that strikes a balance between being faithful to the printed source while being easy to read and use on the laptops and other mobile devices that are the “tables of memory” on which 21st-century scholars do much of their reading and writing.
- Good quality page images that provide the witnesses to check the transcriptions and offer modern readers a sense of the materiality of the text in its original embodiment.
- Bibliographical, structural, and linguistic metadata that can be used separately or in conjunction to explore particular texts and support forms of “distant” or “scalable” reading across the entire corpus or parts of it.
This blog post is about metadata that will be added to the EarlyPrint site in the months and years to come. Metadata are literally “data about data”. If you have a deconstructive turn of mind, you may say that there are no data. On a closer look all data turn into metadata. True enough, but for practical purposes you can say that just the words of a sonnet (or other verbal construct) are “data”, which is Latin for “givens”.
The Hellenistic poet Callimachus is famous for saying that mega biblion mega kakon or “big book big bad”. He should have known since in his day job he was the head of the library at Alexandria. His lost work Pinakes–a catalogue of the holdings of that library– was the first large-scale catalogue. Library catalogues are the metadata genre par excellence. But metadata come in many shapes and sizes.
List making from the seven sages or world wonders to the top forty has been a universal human activity, and metadata are largely lists of one kind or another. Medieval monks invented the prototype of the modern database by creating a “concordance” or alphabetical list of Biblical words to prove that all the “places” of a word in a text “concord” and reveal the harmony of the Word of God. As soon as printed books had page numbers that supported the easy look-up of words, it became possible to add indexes to a book. Thus the 1573 edition of the works of Tyndale includes 18 pages of “A diligent, and necessary Index, or Table of the most notable thynges, matters, and woordes contayned in these these workes of Master William Tyndall. The letter A. signifieth the frist columne, and B. the second columne of the same side.” Fastforward three centuries to the beginning of the 60-volume Weimar edition of Luther’s works, which over time acquired a dozen index volumes (in Latin and German) of person, places, things, and citations.
Ranganathan’s Fourth Law of Library Science says “Save the time of the reader”. Brian Athey, the chair of computational medicine at the University of Michigan, said at a conference about life cycle research data management that “agile data integration is an engine that drives discovery”. Collecting all of Luther’s works in one place and in one format took a lot of time, but saved the time of countless readers. The same is true of the index volumes. Ditto for the sixteen volumes of Latin inscriptions that were the life work of Theodor Mommsen and utterly transformed the documentary infrastructure for the study of Roman law and administration. Lots of “agile data integration” in those massive late 19th century projects.
In principle, digital metadata are no different from the stuff found in the indexes to Tyndall and Luther. In practice, however, the difference in scale makes for a difference in kind. The dozen index volumes of the Luther edition–slimmer than the text volumes– may add just 3% of texts (though of greater informational density). The index of Biblical citations makes it possible to find John 3.16 or Matthew 28.19 relatively quickly, but the look-up is measured in minutes rather than second. Spitzweg’s picture of the bookworm nicely summarizes the search environment in the world of print. The book shelves are a metadata system. The ladder is a search tool.
In the Early Print corpus, metadata exceed the data by an order of magnitude. That is by no means an uncommon ratio. But the time cost of look-ups is measured in milliseconds rather than minutes. It still takes a lot of time and effort to construct metadata, make them fully searchable, and display them in ways that make it possible for human hands and eyes to grasp them. But once that is done, the time cost of a look-up across a corpus of two billion words may be less than the time it took Spitzweg’s bookworm to do a lookup in a single book. Digital metadata shrink a large corpus to the size of book that you can hold in your hands.
What metadata and for what ?
A scholarly project is typically driven by a research question, and the research question establishes the boundaries for data and their curation. Corpus-wide metadata have less clearly defined boundaries. Mommsen’s corpus of Latin inscriptions was not driven by a singe research question but by a conviction that a systematic curation of inscriptions from the Roman Empire by time and place would create a critical documentary infrastructure and change the “calculus of the possible”. He was right. John Unsworth coined the useful acromym MONK for “Metadata Offer New Knowledge”.
The EarlyPrint corpus has a triple-decker structure where metadata exist as bibliographical data at the document level, as linguistic data at the molecular level of individual words, and as discursive data that articulate the internal structure of a document (chapter, paragraph, note, etc.). The interaction of the three levels is a critical factor.
To begin with the molecular level of words, every digital file comes with a primitive, powerful, but not very stable form of indexing. A modern text editor can “on the fly” traverse millions of characters strings, remember where they begin and end (offsets) and tell a user almost instantly how often and where a given string pattern occurs. A user who knows a little about “regular expressions” can look for strings that match a pattern.
In an off-set based system any change in the file will impact all locations that follow the change. In a more stable environment you think of every word as a letter that is put in its own envelope with a unique, permanent, and sequential address. As long as you keep the letter in its envelope, you can always reconstruct the proper order of words, however much you scramble the envelopes for this or that purpose.
You can also put other stuff in the envelope, in particular linguistic information. Readers bring an extraordinary amount of largely tacit knowledge to the task of making sense of words on a page. Linguistic annotation “explicitates” some rudiments of readerly knowledge and put them into the text in a manner that the machine can process and behave as if it were reading. What the machine lacks in subtlety it makes up for in speed. The ability to simulate very limited forms of “understanding” across very large data sets and at lightning speed is a powerful thing.
Named Entity Recognition
Whether Tyndale’s works or a modern digital corpus, names are the most commonly indexed terms. NER is an acronym for Named Entity Recognition, an important sub-discipline of Natural Language Processing. You are rarely interested in a name for its linguistic properties, but you want to take advantage 0f state-of-the-art NER or NLP to find a name. And you want a good index to distinguish between John the Baptist, the evangelist, or King John. You also want it to distinguish between place names and personal names. Not to speak of compound names, the parts of which are not names, such as ‘United Provinces’.
In its current statue EarlyPrint does a good job of identifying strings that are names. With one exception, the identification specifies that something is a name, but it does not distinguish between types of names or identify the referent of a name. But those tasks are part of the project’s long-term agenda.
In standard English since the 19th century sentence-internal capitalization has been an excellent indicator of names. It is not very helpful before 1700 where almost any noun may be capitalized for any reason, and it is even less helpful for the early 1500’s where names are not consistently capitalized. Modern NER tools (e.g. the Stanford Named Entity Recognizer) do not do a good enough job for Early Modern texts.
The exception is is Purchas His Pilgrimage, an early 17th century ethnographic compilation of over a million words. In this work all names have been mapped to categories of person, place, organization, and literary through a “mixed initiative” of fancy computer program and a very bright and conscientious undergraduate. The book probably includes a high percentage of names used before its publication. It will make for good training data. There are a variety of contemporary gazetteers and dictionaries from which one can build lists that then can be mapped to modern authority files.
The EarlyPrint corpus includes about 19,000 abbreviations with some 15 million occurrences. A large majority of them refer to books and their authors. Here too work so far has focused on establishing that a string is an abbreviation–a very simple task if you look at any one, but a quite complicated task if they run in the millions. Once you are confident that you have identified most abbreviations with tolerable accuracy, it is a simpler task to map them to their expanded forms. Abbreviations often occur in marginal notes. Because marginal notes are usually in very small type and both inner and outer margins are error-prone regions for microfilming and transcription, the error rate for notes in the TCP corpus is at least five times as high as for ordinary text. That complicates the task of building the citation network that provides key evidence about the “struggles and wishes” of the age and would lend itself to interesting forms of visualization. Modern visualizing tools are amazingly powerful even in the hands of non professional programmers. But however clever and striking, they cannot deliver good information if the underlying data are fragmentary or too “noisy”.
Most readers know the Bible as a set of “books” divided into chapters and verses. Neither God nor its human authors wrote it that way. It was the work of the French printer Robert Estienne who in the 1550’s applied to the New Testament a “verse-and-chapterification” that had been developed in the mid-fifteenth century for the Hebrew Bible. This citation system spread very rapidly, and since the late 16th century Biblical verses have been cited in a very consistent manner.
Biblical citations overwhelm all other citations in Early Modern books. It is possible to identify them in a largely automatic and quite accurate manner. A cynical observer might say that Biblical verses became rhetorical hand grenades to be lobbed at your opponents in the theological quarrels of that age. There is almost certainl much to be learned from the distribution of Biblical citations across time, denominations, and authors.
Aristotle in the Poetics distinguished between common and rare words, recognizing that frequency is an important property of words. J. B. Firth famously said that “you shall know a word by the company it keeps”. Word clouds and other lexical visualizations have become regular features of reporting about the President’s State of the Union address. It is important to keep in mind that any visualization–however pretty–is just numbers dressed up in colourful clothes. Sometimes there is very little underneath the clothes. The lexical data in the EarlyPrint corpus support a wide variety of quantitatively oriented inquiries. They exist in a format that can be easily ingested by many different tools.
It is in principle possible to take any text in the corpus and profile against the lexical data from a thirty-year window, where that text sits somewhere in its middle. It is equally possible to profile it against the entire corpus. The differences between the two profiles might be interesting in itself. There are two very simple numbers that provide the basis for many complex operations: the “document frequency” or number of texts in which a word occurs and the “collection frequency” or the total number of occurrences in a given corpus. If you have those two numbers as background data for word counts in a particular tex, a lexical profile might show
- common words that are disproportionately frequent
- words that are disproportionately rare
- words that occur only in very other texts
- words that are unique to this text
Readers who know a particular genre are likely to have a pretty good implicit sense of word distributions in their texts, and in most cases such profiles will tell them nothing new. But “telling underuse” of a common word may be a feature where the mindless counting of the machine may draw a reader’s attention to interesting phenomena.
The Early Modern Print site about Text Mining Early Printed English takes important steps in the direction of making frequency visible. It lets you track the history of a word with very striking visualizations of changing frequencies over time in the corpus as a whole. But given a properly annotated corpus, you want to move towards defining subcorpora that let you profile a particular text against its genre in its generation. It takes a fair amount of “agile data integration” to make this possible for non-geeky humanities scholars. And most humanities scholars will be non-geeky for many years to come.
It is worth pointing out in this context that in just about any digital project the most difficult and time-consuming task consists of taking something that is easy for the geek and making it easy for the non-geek. For every unit of effort that it takes to make X accessible to the geek, it takes ten units to make it accessible to the non-geek. ‘UI’ or user interface design is a huge time sink.
Cataloguing from below
A book written in the last hundred years and given a Library of Congress or Dewey number will show up on a library shelf in the company of very close neighbours. Not so with books before 1700. Most of them do not have Dewey or LC numbers to begin with, and the discursive universe of the pre-industrial world does not map very well to the categories of more modern (in practice late 19th century) taxonomies. TCP texts have MARC records. Some of them have subject headings, often not more helpful than “texts before 1800.”
60,000 books is not that large a number. A big Barnes and Noble store might hold that many titles on its shelves. So it is tempting to think of constructing a digital late 17th-century bookstore with shelves for various birds of a feather. But some books have very different neighbours, and a digital text be in more than one place at once. So it may be more productive to think of a system in which each text can generate its own neighbourhood from various and sometimes contradictory signals. There is no cataloguer who after some deliberation assigns a text to this rather than that category. Instead, there are features in the text that reach out to other texts, either because those other texts are the only ones sharing that feature or because the other texts are like it in having disproportionately much or little of some common feature.
The features could be quite primitive, atomic, or in themselves not very meaningful. The XML encoding of the TCP source texts use some three dozen tags. Texts that are similar in the distribution of those tags are likely to be similar at higher and more interesting levels of organization. The expressive power of these very primitive features is quite striking. A text with a lot of instances of ‘℥’ or ‘℞’ belongs to cookery, pharmacy, or medicine–a territory with overlapping boundary then as now. XML tags like <sp> point to plays or play-like text. Relatively rare words in a text may tie it to other texts in which those words occur. The notes in a text may tie it strongly to one or more citation networks.
Some text clusters may be so strong that it is worth identifying them as quasi-formal subject categories. But in most cases it may be enough to draw attention to textual features whose presence in other texts are likely to be tell-tale signs of similarity. For the end user the question is “What features in this text, however humble, are likely to lead me to similar texts?” For the developers who worry about metadata and design, the question is “How do I draw attention to the features that will help the end user find similar texts?”