The are somewhere in the neighbourhood of five million incompletely transcribed words in the rougly two billion words of English books before 1700 transcribed by the Text Creation Partnership. Depending on how you look at it, that is either a  lot or not very much at all. Less than half a percent of words are incompletely transcribed. From an NLP perspective, half a percent of defective words is hardly worth worrying about. From a readerly perspective a book with a defective word every 200 words is pretty bad. Moreover, the defective words cluster heavily in about ten percent of the texts, and there are some texts where they seriously interfere with what readers or a machine can make of or do with the text. Scholars who complain about the TCP texts invariably mention the many gaps and demand that somebody should fill them.

In the TCP archives missing  letters are meticulously marked in the transcriptions with “gap” elements,  as in

lo<gap reason="illegible" extent="1 letter"/>e

When displayed over the web, the gap element is typically replaced by one or more black dots, depending on the value of the extent attribute in the gap element. For that reason I call those words “blackdot” words. What percentage of blackdot words can be identified by a machine and conclusively mapped to the right solution? And what percentage of blackdot words can be conclusively mapped to the right solution by  human readers who see the word in the center of a line of print with 35 characters on either side but no other information?  If that percentage is sufficiently high, it is worthwhile building a collaborative curation platform around such an approach. The time cost of a typical editorial decision will be measured in seconds rather than minutes.

Nine undergraduates and one graduate student from Northwestern’s Weinberg College of Arts and Sciences , its McCormick School of Engineering,  and its School of Communication are currently seeking answers to those two questions in a summer project jointly funded by the three schools. Initial results are very promising: 80% of blackdot words can be completed through a combination of machine-learning techniques  and machine-assisted editing routines that drastically reduce the human time cost of editorial decisions. The machine also relieves the human editors of logging the who, what, when, and where of editorial changes–chores that people are notoriously bad at. It is worth saying that missing words, paragraphs and pages are not part of this project and are in principle beyond the powers of the machine. But blackdot words are by far the most common nuisance in the TCP project, and fixing much of that nuisance is an entirely possible task.

Machine learning approaches are made possible by the precision with which the place and extent of  lacunae or gaps are marked in the TCP source texts. What spellings begin with ‘lo’ and end with ‘e’ with no more than one letter in between?  In books before 1700 be prepared to encounter ‘lobe’, ‘lode’, ‘loke, ‘lome’, ‘lone’, ‘lore’, ‘lose’, ‘loue’, and  ‘lowe’.  The machine is given a Language Model,  a program that makes probabilistic judgements. It knows about the words that precede and follow the blackdot word, and it ranks possible replacements according to frequency and context. If a replacement word occurs in the same work as part of the same trigram, it obviously has a stronger claim than a replacement word that occurs five times in different texts but never in the context of the blackdot word. But a spelling that occurs five times has a greater chance of being right than a spelling that occurs once in some other text. The particulars of the math are well beyond me, but the principles are quite intelligible to traditional philologists or judges who want to rule on a case as narrowly as possible.

How well does this work in practice? What follows is an interim report. We are working with a sample of 359 texts adding up to ~20 million word tokens, about 1 percent of the total EEBO TCP word count. We used the texts with a low defect rates as initial training data and created a data set in which each of the approximately 80,000 blackdot tokens becomes a data row like the following (slightly shortened and spread over two lines to accommodate WordPress):

 IDvalue  none other thing is here |sign●●●ed| by the altar then that
  signified |signifyed | n/a (MG-Y) (MG-Y)

The code MG-Y means that the machine found the spellings ‘signified’ and ‘signifyed’ in the middle of a trigram that begins with ‘here’ and ends with ‘by’ and occurs in the same work as the blackdot spelling. That is pretty compelling evidence.  The example also introduces a characteristic form of ambiguity. It is often easier to map the blackdot spelling to a standardized spelling than to a particular orthographic variant–about which more below.

The students are working through spreadsheets, judging the success or failure of the machine. This report is based on a sample of  6204 blackdot words found in 29 texts–an arbitrary sample of not quite ten percent and probably random enough for our purpose. 1752 of the tokens were in a foreign language, overwhelmingly Latin. I asked the students to ignore those tokens because none of them is a Latinist.  That leaves 4452 tokens, of which 3684 occur in the main text and 766 in notes.

The distinction between ‘main text’ and ‘paratext’ (notes) is important. Word tokens in <note> elements make up just 3.5 % of all word tokens but account for 17% of all blackdot words. Thus the defect rate for notes is larger by a factor of five.  There are several reasons for this. In the printed sources most of the notes are marginal notes in small print. The microfilms from which text texts were transcribed (via digital scans) often do a poor job on texts in the margins. And leaving aside those accidents of physical transmission, notes are a genre in their own right, with many arcane abbreviations.  While notes have a high error rate their content is  quite formulaic with stable conventions across time–especially from the 1580’s on. So there may be a way of “special casing” notes and getting a better sense of their quite limited lexical variance. If this were done one would probably start from a corpus-wide list of citations and abbreviations, which would be a very useful project in its own right.

Performance on main text

Like Gaul (and much else) the output of the machine is usefully divided into three parts.  The machine either offers just one choice, more than one choice, or no choice.

For 900/3684 main tokens (24.4) the machine offers just one choice, and it is right in 858 cases (95.3%).  Those 858 cases account for 23% of all corrections. Reviewing them is quite easy because there is only thing to look at.

For 2424/3684 tokens (65.8%) the machines offers more than one choice, but we  ignored everything beyond the third choice. One of the three choices is the only correct choice in 1434/2424  cases (59%), and they account for 39% of all corrections. This is a little harder because you have to look at two or three things and figure out which is right.

For 360/3684 tokens (9.8%), the machine offers no choice, but in 184 cases (51%), the  human reviewers provided the correct readings, which account for five percent of all  corrections. The solutions in many cases are obvious, and their analysis leads to the question how to get the machine to do what people find easy.  But things that people find easy are often very hard for machines.

Where the machine offers more than one choice, it fails in 990/2424 cases (40.8%)  to provide a single correct choice, but human reviewers found the right solution in 470 cases (47.5%), accounting for 12.8% of all corrections.

In summary, 2946/3684 blackdots words  were corrected for a success rate of 80%. There is considerable variance among the 30 texts. The median value is 80, but 12 texts have a success rate of 92% or better, while six texts have a success rate of 50% or less and three of them have success rates of 33, 0, and 0.

Where the machine fails

The machine may be said to fail where it either produces no correct choice at all or provides multiple choices that cannot disambiguated within the context of seven words on either side. The second possibility divides into cases where the machine offers different words that might fit the context or offers allographs of the same word (‘signified’ , ‘signifyed’). We have not kept a good enough count of the latter case, but it is a non-trivial problem, with a solution about which more below.

80 tokens (2.2%) consist of just one or two black dots. There is not very much the machine can do with a single dot token, which may be a letter, a number, a punctuation mark, or a stray mark on the page that the transcriber mistook for a letter.   Students who have collated transcriptions with their originals have reported many cases where the single dot does not represent a word token at all. And there is not much more the machine can do with a double dot token. These are cases where somebody has to look at the original or a good facsimile of it.

93 tokens (2.5%) consist of a black dot,  preceded or followed by a number. These tokens are nearly always part of bibliographical citations, and in the overwhelming majority they are part of a Biblical citation. Unsurprisingly, the frequency of such cases is much higher in notes, where they account for 15% of blackdot tokens. These cannot be resolved by a machine. Somebody needs to look at the printed source, preferably somebody who knows the Bible by chapter and verse and has a good sense of what is possible or likely in a hard-to-read marginal note.

You would expect for the performance of the machine to improve with the length of the blackdot word, and that is indeed the case. With very short words, the machine is sometimes misled in ways that can be corrected by tweaking the Language Model and its training data. Consider the three two-letter function words below and the weird replacements offered by the Language Model:

the pleasure of the almighties | ●f | it be ordained of God, that thou| kf|nf|wf
nswere for the soules of other men.|I●| then thou shalt bee so busied,| I0|I2|Iö
●●at beuty of thine from the body,|●r| from the soule● Not from the bo●y.| 3r|kr|tr

The  Language Model  here has picked up tokens from specialized vocabulary or symbols in some texts and wrongly applied them. There are probably ways of making the Language Model smarter about possible  two-letter words in English. It does not help that the English of EEBO-TCP texts is riddled with words and passages in Latin and other languages. MorphAdorner identifies the language of a word token with an accuracy rate around 95%. If that information could be integrated into the Language Model and the machine knows when it is dealing with English and when not, it could be told that ‘or’ is the only English two-letter word that ends with ‘r’ (I’m not sure whether ‘er’ occurs as a spelling of ‘ere’, but if it does, it would be exceedingly rare.)

More about notes

I skip a more detailed discussion of notes, but say that the machine does only half as well on paratext as on main text. Where the machine offers only one choice, it does as well on paratext as on main text, but in all other categories it does much worse, and the over all success rate is just 48%, Improvements in the machine’s performance would depend largely on providing a lot of domain expertise.

Quite a few texts have no or very few notes. In the 359 texts used in this project, a third have no notes at all. The interquartile range runs between 0 and 3% of word tokens ,with 0.2% as the median. Two dozen texts account for more than half of the ~700,000 notes.

What about Latin?

Latin accounts for 956/4640 or 20.6 of all blackdot tokens in the 359 texts, including very scattered and rare sprinklings in other languages. That strikes me as an outlier, but even if the percentage across the entire EEBO-TCP corpus were as low as 10%, it would pose a non-trivial challenge. Any attempt to deal with the curation of EEBO-TCP must come to terms with the ubiquity of Latin, and the many short interspersed passages may be harder to handle than longer passages, which are easier to tag and exclude.

I asked the students to ignore Latin  blackdot words and did not review them myself, except for the single choice returns, where the success rate is the same as for English main text and notes.  Reviewers with a reasonable command of Latin would probably do at least as well as reviewers of English blackdot words. There is very little orthographic variance in Early Modern Latin. The same word is mostly spelled in the same way, just as in modern languages, which makes life easier, both for the machine and the human reviewer.

How to improve performance

There are several ways improving the performance of the Language Model. The current version, of whose shortcomings we are acutely aware, was thrown together quite quickly, with relatively little domain specific modification. It has not yet gone through the half dozen iterations that incrementally improve performance in just about any project. But all things considered, it is already working well.

The current language model is semantically stupid and derives all its information from the length of the blackdot word, the distribution of black dots within the word, and  the words that precede or follow.  A neural network based language model–currently under development–would add some semantic data to the analysis. Given ‘●innow’ with ‘fish’ in the vicinity, it would pick ‘minnow’ rather than ‘winnow’. If you combine that with divide-and-conquer strategies for segmenting texts by period or genre you can anticipate non-trivial improvements.

The most significant improvement would come from moving the goal posts.  Think of mapping words not to their spelling in the printed source but to standardized spellings–rather like the King James Bible in a format that is orthographically standardized but not syntactically or lexically modernized: ‘vniuersitie’ becomes ‘university’, ‘coniure’ ‘conjure’, ‘bedde’ ‘bed’ etc. Those transformations are done in a preprocessing stage.   In such an environment you need not decide whether ‘imploy●●’ should be mapped to ‘imployde’, ‘imploy’d,’or  ‘imployed’ : the machine encounters ’employ●●;’ and maps it unambiguously to ’employed’.  Phil Burns’ Morphadorner, with some corrections currently underway,  can perform this preprocessing with considerable accuracy.

Consider the case of pay●tes in its context :

d science: to be briefe, euerye one pay●tes it out in his colours, as it please

The current language model offers no choice because it did not find the spelling ‘payntes’ in its training data and has no way of knowing that it is a form of ‘paint’. Human readers get this right when they see the token in the context of ‘colours’–and so might a neural network language model. But if you work with standardized spellings , and the language model “knows” about ‘y’ and ‘i’ and dropped ‘e’ in the plural suffix, it should also recognize that the proper solution is ‘paints’.

I know colleagues who will shudder at this suggestion, and I know others who would cheer. I think more would cheer than shudder. But it is not really an either/or proposition: it is a matter of creating a version of the corpus in which uncertainty is removed by reducing orthographic variance and doing better by claiming less.  And you can always backtrack from the standardized ‘paints’ and restore the original ‘payntes’. In the meantime the standardized text is easier to read for many readers, and it is certainly more algorithmically amenable.

There are problems with such an approach: what about ‘deere’, ‘hart’, ‘loose’, ‘ile’, ‘bore’, ‘wee’, ‘bee’,  and several hundred other trick spellings. There are also ways of dealing with them, and MorphAdorner has been pretty good at them.

Can you turn this into a service that supports collaborative curation?

By the end of this project in mid-August I hope that we will fix all or most of the fixable blackdots in the sample corpus, with particular emphasis on getting the main text right. Corrections will be automatically reintroduced into the texts, which will then form a larger and more accurate set of training data that can be used for fixing blackdot words in other texts.

The computer science students in the project are currently working on a prototype of a Django site that will go up on pythonanywhere (which runs on Amazon web servers) .  The site borrows a good deal from the collaborative curation tool AnnoLex.  It will mediate the output of the Language Model and allow users anywhere anytime to fix those blackdot words that can be fixed  by just looking at the few words around it.

So I could imagine the following scenario: you are an undergraduate working on an honors project, a graduate student working on a dissertation, or a faculty member writing a book or article. A small subset of EEBO-texts, half a dozen and hardly ever more than 100, form the data for your project. You want to clean them up as you work with them. You don’t really want to quote passages with blackdot words in them, and you’d probably fix them anyhow when you quote them. But if you do that you might as well make that emendation available to others.

There is a service that lets you do just that.  You give it the references to the EEBO-TCP texts you want to work with. The service runs those texts through the Language Model and puts the results on the site on pythonanywhere (or wherever it will be hosted).  You fix all the blackdot words that you can. The service then returns to you a version of the text with the corrections incorporated (and carefully logged as yours). If you are a user with the privilege of approving the corrections of others, the service incorporates your changes into a standard EEBO-TCP repository.If you are lucky you can fix all the blackdot words that way. But in any event you will incur the much larger time  cost of collating the transcription with the printed source only for those blackdot words (or longer lacunae) that actually require it. If you are working on a critical edition you will of course want to compare every transcribed word with every printed word. But that is a very special use case, and even it can benefit from a service of this kind.

In this scenario  the corpus is improved one text at a time. and over a period of years. User-driven curation has the great advantage that the texts that get fixed first are the texts that somebody actually wants to work with.

It would take some work to turn this work flow into a user-friendly and bullet-proof sequence of steps, but all of the steps have in fact been used quite successfully tested and used in the collaborative curation of the Shakespeare His Contemporaries Project (SHC).

 The team

David Demeter, a graduate student of Douglas Downey’s in the Computer Science Department, has built the Language Model, adapting software that is used  to processing the Wall Street Journal to the vagaries of Early Modern English.  Austin Bertram Chambers, Vickie Li, and Yue Hu (Hayley), freshmen in Northwestern’s McCormick School of Engineering,  are building the collaboration tool. Kimani Emanuel, Sally Moore Hausken,   Anelia Kudin,  and Katherine Elizabeth Poland are freshmen or sophomores in the Weinberg College of Arts and Sciences. Irina Ruth Huang is a sophmore in the School of Communication. The six of them are checking and correcting the output of David Demeter’s Language Model while reflecting on the ways in which it goes wrong and on how to make it work better.