This is a report on a “mixed initiative”–a term of art in computer science–that combines old-fashioned philological elbow grease with new-fangled long short-term memory neural network processing (LSTM). The goal is to fix as many as possible of the approximately five million incompletely transcribed words in the 1.7 billion word TCP corpus of English printed books before 1700. I call them “blackdot” words because in the Web presentation of those texts missing letters are represented by the Unicode character U+25CF or “Black Circle.”
Some basic facts about TCP transcriptions
Between 2000 and 2015 the Text Creation Partnership created what for many practical purposes amounts to a deduplicated library of the English print heritage between 1473 and 1700. The TCP archive contains about 60,000 distinct titles, ranging from the frontispiece of a book from 1487 whose only words are the letters I.N.R.I below a picture of Jesus to the 3.2 million words of Du Pin’s New History of Ecclesiastical Writers (1693). The median length of a text is just shy of 7,000 words. If you think of a book as something that contains 25,000 words or more, only 7,200 titles (12%) are “books” in our sense of the word. Pamphlets, broadsheets, and other essay-like genre make up the bulk of the archive.
The TCP texts were originally transcribed into SGML, the ancestor of HTML and XML. They were transcribed by the staff of off-shore vendors–many of them in India–who transcribed letter for letter rather than word for word and may or may not have been fluent in English. The transcribers were paid by the keystroke, and quality control rules for accepting a transcription said that on a proofreading of 5,000 words or 5% of the pages (whichever is smaller) there should be no more than 1 error per 20,000 keystrokes.
What the workers in these digital scriptoria of the early 21st century saw were digital scans of double page microfilm images of printed books, made at some point between 1938 and the nineties, with the British Library and the Hungtington Library as the largest contributors to the project. Th printed pages were of variable quality, and so were the microfilm images. It is not easy to produce a clean transcription from a poor digital scan of a poor microfilm of an even poorer printed page. None (or very few) of the microfilms were produced with the help of “deskewing” technologies that counteract the problems of projecting an angled double page on the flat surface of a 35 mm image. These conditions affected the results: what with the skewing effect of the ‘gutter’ and the faint and small print of notes in the outer margins, line-terminal words, line-initial words, and words in the margins have an error rate that is higher by half an order of magnitude.
From the vendor’s perspective, some rules of survival offered themselves. You want to get it right, you can’t spend forever on each keystroke, and at the margins it is less risky to say “I can’t read this” than venture a guess. Which accounts for the fact that the TCP texts are littered with instances of ‘bu●’, ‘a●’, ‘i●’ where the basic rules of English syntax make ‘but’, ‘as’ or ‘in’ the only possible solutions. From some comments in the excellent user survey of TCP texts by Judith Siefring and Eric Meyer, I gather that some readers find these easily fixable mistakes even more irritating than genuinely hard cases. But if those readers put themselves in the shoes of a manager or worker in those far-away scriptoria they would do exactly the same thing.
All things considered, the most striking thing about the TCP is not how bad they but how good they are given the many constraints that went into their making
What to do about the problem
Fixing the millions of transcription errors in the TCP text is a good example of a problem that should be fixed by somebody else. It is not easy to find that somebody else. Who wants to spend their fixing dumb typos in old books, and who is going to reward you for that effort? There is a non-trivial number of passages that are hard to read and that attract philological sleuths, but the odds of persuading people to work on them will increase greatly if you can find a way of solving the many simple problems in a completely or largely automatic fashion.
Enter Larry Wang and Sangrin Lee, two students of Doug Downey’s in Northwestern’s Computer Science Department. They have no particular interest in what they charmingly call “old books”, but they have a keen interest in long short-term memory neural networks. I can use that term correctly in a sentence, but that is the extent of my understanding of it. I also understand that this is a cool technology because it takes advantage of the properties of “GPU” processors (the chips that govern the display of video). It is even cooler because as of now it only works on NVIDIA processors (my Mac doesn’t have those).
From a computational perspective, a “blackdot word” is a “regular expression” if you replace the blackdot with a plain dot. ‘i.’ is a pattern in which ‘i’ is fixed and ‘.’ can stand for any alphanumerical character. In English, it matches the common words ‘if’, ‘in’, ‘is’ ‘it’. But which is the proper choice in any particular occurrence? In most cases it is not difficult for a human reader to decide whether ‘if’, ‘in’, ‘is’ or ‘it’ is the right reading. Stephen Greenblatt in an essay on gender confusion in Twelfth Night famously wrote that “the palace of the normal is built on the shifting sands of the aberrant.” And Montaigne (?) ruefully observes that language flows daily through our hands like sand. Nonetheless ‘if’, ‘in’, ‘is’ or ‘it’ are in most cases easy to tell apart. Nor do you need to resort to a page image to figure out which it is. You need to be careful, though. It is tempting to think that the regular expression ‘spe.ial’ is matched by ‘special. But it is also matched by ‘spetial’.
An LSTM algorithm will troll through a corpus of texts, look for occurrences of a pattern (e.g. ‘i.’) and then uses probability estimates to determine which choice is the likeliest in the context of a particular occurrence. The other day Sangrin sent me the results of an LSTM run through some 15,000 TCP texts from the 1600’s. It took his machine five days to “read” through all the texts, examine all the alternatives, and come up with suggestions. That’s an eternity in a world of Google look-ups, but it’s nothing compared with the it would take humans to read through those 15,000texts and religiously keep track of their findings.
Via some intermediate steps that five-day run generated a table with 1.5 million blackdot corrections with entries like
|A09443-082-a-3170||he**t||I the Lord search the||heart||and trie the reynes And||0.99755|
In 20% of the cases the machine did not offer a correction or tried to match patterns that contained numbers or symbols that are in principle beyond correction. I read quickly through a 1% random sample of 12,000 entries. The error rate seemed very low, and I looked more carefully at a sample of 1,2000 entries. The error rate was about 10%. Which means that the LSTM process produces accurate results for 90% of 80% of blackdots or 72% of all blackdot words.
In the EarlyPrint site at http://texts.earlyprint.org enriched and partly corrected versions of TCP texts are maintained in an environment that supports collaborative curation. By the end of the summer we populate that site with all 25,000 texts that are currently in the public domain. If further tests bear out the initial finding of a 90% success rate, it may make sense to move all corrections into the source texts, but flag them as machine-corrected items, leaving it to users over time to endorse or challenge particular emendations.
I looked in detail at the 1624 Relation or iournall of the beginning and proceedings of the English plantation setled at Plimoth in New England (A089810, ESTC S110454). This is a text with 150 blackdot words– a ‘D’ text with 57 transcription problems per 10,000 words. The LSTM processed produced 120 usable results of which 88 were correct and 32 were not correct for a 73% success rate.
Blackdot words cluster heavily in a quarter of the TCP texts. An overall success rate of 90% does not mean sucess will be evenly distributed. You may expect a lower success rate in texts where blackdot words cluster heavily. But compare the time cost of making 120 corrections by hand with the time cost of reviewing 120 corrections and emending 32.
“Many hands make light work” is an old proverb. It will take many hands and quite a few years to bring all the TCP transcriptions up to the standards you associate with scholarly work. Machines can help a lot. The workflow of textual corrections follows the order of “find it, fix it, log it.” The LSTM work of Larry and Sangrin will simplify and speed up that workflow. Future collaborators will face an environment in which the errors have been found and fixed. What remains to be done is checking the fixing and emending the small percentage that hasn’t been fixed properly. Still a lot of work, but a lot less than if it had to be done with pencil and paper.