Engineering and English are alphabetical neighbours in a university list of disciplines, but the members of those disciplines tend to think of the other as on the other end of the disciplinary spectrum. But work in English departments has for centuries depended on the engineering work that created and refined printing. Future work will depend on the work of software engineers that have created and will refine the environment in which the digital “re-mediation” of our print heritage is taking place, not to speak of whatever is “born digital” .
Ranganathan’s fourth law of library science says “Save the time of the reader.” The formidably engineered book wheels on Early Modern engravings are very much in the service of that law. So are the find operations of modern computers that reduce the time cost of typical look-ups by orders of magnitude. What about saving the time of the editor?
The 60,000 titles of English books before 1700 and American books before 1800, transcribed by the Text Creation Partnership (TCP), add up to a magnificent but flawed enterprise. There are about six million incompletely transcribed words. The transcribers worked from digital scans of microfilms, often not very good, of printed books, some times even worse. Given the conditions in which they did their work, it speaks to their ingenuity and patience that they did as well as they did. From another perspective, six million defective words in a corpus of two billion words just add up to 0.3%. For most search operations “noise” of that size is unlikely to affect any “signal.”
True, but scholarly readers are fussy about their texts, and the defects cluster heavily. 80 percent are found in a quarter of the texts, and almost half of them cluster in just 10 percent. So the defect rate across the worst 6,000 texts is about 3%. At that level “noise” drowns at least some “signal” and that level of error will irritate even non-fussy readers.
it would take a large number of volunteers with a high pain threshold for boredom to take part in a collaborative enterprise of manually fixing those defects. Can some of that work be “downsourced” to a machine with a tolerable margin of error? The answer is “yes”, and in the rest of this blog I report I report on a successful pilot project undertaken by Larry Wang and Sangrin Lee,undergraduate students of Doug Downey in the Computer Science department of Northwestern’s McCormick School of Engineering.
When transcribers could not read one or more letters in a word they were instructed to describe as precisely as they could the extent of the gap. On web sites with TCP texts you see the gaps as a sequence of one or more black dots, e.g. ‘s●ditiously’. You can think of the defective word as a regular expression, e.g. ‘s.ditiously’, where the dot is a placeholder for any alphanumerical character. Because the transcribers were very accurate in defining defects, machine-learning techniques can fix between half and two thirds of defective words at a tolerable level of error.
The current experiment has been done with about 15,000 texts from the 17th century. The machine executes a program that as a first step gathers frequency data about each word. It then looks at the context of the defective word, identifies other occurrences of the regular expression, and makes probability judgements about the the replacement word that most closely matches the context of the defective word. The program does not run on the normal CPU of a computer, but on the GPU or Graphics Processing Unit that is responsible for image display. It turns out that these chips are particularly good at a kind of neural network analysis that thrives on fuzzy boundaries.
You don’t have to be a very learned person to figure out that ‘seditiously’ is the only possible replacement for “‘s●ditiously”, but getting a machine to do this is computationally expensive. A high-end desktop with the appropriate Nvidia GPU can process about a dozen texts per hour. So it would take about 500 hours of computer to work through the entire TCP corpus.
How well does it work? Look at the Google spreadsheet that reports the results for Englands complaint to Iesus Christ, against the bishops canons of the late sinfull synod, a seditious conuenticle, a packe of hypocrites, a sworne confederacy, a traiterous conspiracy, a 50 page pamphlet from 1640 (TCP A00011, ESTC S101178). 237 of its ~21,000 words are defective, which puts the text in the worst decile of the TCP archive. The program offers fixes for 143 or 60% of the defective words, and it is clearly right in all but 13 cases.
In four of the 13 cases the machine finds the right match, but the match is itself a printer’s or transcriber’s error. The writer almost certainly did not intend the spellings ‘discipine’ (8), ecclesiastcall (5), seruple(~40), or sludgates (1), but they occur in the corpus with the frequencies noted in parentheses.
In other cases, the program operates correctly, but the transcriber misdescribed the gap. Thus in the sequence
courage zeale magnanimity undaunted constant res●●tion to stand out in
a human reader will conclude quickly that the replacement must be ‘resolution’, but ‘resection’ occurs several times, and the machine is not smart enough to figure out that the replacement does not make sense semantically even though it meets the formal requirement. Something similar happens with
posture or placing therof as altee● both the nature and use
Here the correct reading is ‘alters’, but the double ‘ee’ is a transcriber’s error in a quite illegible word.
What about the defective words for which the program did not offer any solutions? There probably is room for tweaks that would catch what look like simple and common cases: ‘Apostl●s’ , ‘he●nous’, or ‘Cathedra●’. Perhaps the machine can be taught to be more context aware. The word ‘backsliding’ occurs several times in the text, and there are three missed instances. But in each of them it appears that the transcriber misdescribed the defect: back●●tiding, backe●●ding, backes●ding. Can a machine be made smart enough to ‘see’ that?
If the results from Englands Complaint hold true for the corpus at large, it appears likely that at least half of all defective words can be corrected algorithmically without human review. The machine will make some decisions that leave the defective word no worse off, and it will make some mistakes. In the source texts for this operation every word is wrapped in a <w> element with a lot of metadata. The corrected word carries a @type attribute with the value “machine-corrected”, creating opportunities for later review. There may be procedures for identifying machine mistakes.
The procedures for fixing “known defects” can with some modifications be used for a project of identifying and fixing the unknown number of unknown defects. Misprints will in most cases produce spellings that are unique or rare. So you can consider all rare or unique spellings as suspect and look for spellings that fit the context and are one or two “edit distances” away from the suspect spelling. The review of Englands complaint led to the identification of 58 occurrences of four highly suspect spellings. Extrapolate from that one case to 60,000 texts, and you see a lot of typographical cruft that would be amenable to machine correction with no or light human review.
Is this worth doing? You probably wouldn’t do it for last month’s Twitter. And many of the TCP texts don’t leave you with a high esteem of their writer or a desire to return to them for re-reading. The Thomason tracts–pamphlets from the English Civil–are an important part of the TCP corrpus. Thomas Carlyle called them a “hideous mass of rubbish,” but he added that they told you more than any other document about what Marx called the “struggles and wishes” of that age. For decades or centuries they will be our window to the past, and it is worthwhile keeping that window as clean as possible.