Hannah Bredar, Madeline Burg, Melina Yeh, and Nayoon Ahn have been at work for four weeks in their clean-up operation of the Early Modern plays in the TCP archive. Nicole Sheriko helped them in the first week and has since then focused on preparing a Young Scholar Edition of Fair Em.
The clean-up operation proceeds on the assumptions that a high percentage of errors in the plays can be identified and fixed by submitting each text to two routines:
- Look for places where a word or letters in a word are marked as missing
- Review all spellings that occur only in that text
The broom or rake for this clean-up is annolex, a tool that was built by Craig Berry and uses as its input a tabular version of the TCP texts linguistically annotated with MorphAdorner. Annolex logs every correction as a discrete event, noting who did what, to what location in the text and when. Corrections are subject to review (at the moment I am the only reviewer). MorphAdorner has scripts that allow for the subsequent integration of approved corrections into the source texts. The amended source file is associated with a change log that keeps a record of all curatorial events.
There are 631 plays to be cleaned up, including
- 31 plays before 1576
- 488 play from 1576-1642
- 112 plays from 1643-1660
The dates refer to the likely date of creation rather than the publication date.
The first priority has been to clean up the 488 plays from 1576-1642. As of this morning the gang of four curators have worked their way through 336 or two thirds of those 488 plays. They have also done 88 of the 143 plays in the early and late categories. This is quite impressive, but alas, it does not mean that they have done two thirds of the work. They have gone through the plays in the ascending order of errors. Thus the 336 cleaned-up plays from 1576-1642 had ~11,500 errors, while the remaining 152 plays have ~32,800 known errors, and the twenty “dirtiest” plays have 11,800 known errors.
So a lot of work remains to be done, but the curators have gained a lot of experience that will help them tackle the harder plays. They probably will not get through all of them, and for some of them (e.g. Massinger’s Old Law), the page images are so poor that it may be simpler to transcribe the play afresh than correct what is there. But consider the likely outcome of their work. There will be cleaned-up versions of at least
- 450 plays from 1576-1642
- Two dozen plays from before 1576
- 80 plays from 1642-1660
A cleaned-up version of a TCP text is not the same as a thoroughly proof-read version, and in the cases where the TCP text had few errors to begin with, the benefits may be minor. But in the aggregate the effect is substantial: 90% of the plays will be available at a quality level that is good enough for many purposes and will not differ much across the plays.
Nicole has just completed a thorough proofreading of the play she is editing. Fair Em had an initial error rate of 18 missing words or letters per 10,000 words. It is one of forty plays in the middle of the pack with error rates between 15 and 20 per 10,000 words. After going through the two clean-up routines, she found very few additional errors. The four curators will each do a thorough proofreading of at least one play, so that we will have a somewhat clearer sense of the degree of improvement added by proofreading. My hunch is that it will be relatively modest, but it will be good to have more evidence.
What about the “broom” used in the clean-up? Annolex is a very simple tool and was built on a shoestring budget, but it works very well for simple “update” operations where you change something in a single token. You can also “insert” or “delete” tokens, which is necessary when you split wrongly joined words or join a wrongly split word. From the curator’s perspective, these are single operations, but for the machine they are multi-step operations, and it is easy to make mistakes. So the next version of Annolex should have a better way of supporting split and join operations.
If you look at the “Review” interface of Annolex, you will find a rather cluttered back-office interface with clear limitations. A reviewer can approve, reject, or hold a correction for further review. Ideally the reviewer should be able to correct a faulty correction, and improvements in the display would cut down on the time cost of reviewing corrections. But AnnoLex was built with very little money, and most of the time went into making the interface for the initial act of correction as efficient as possible.
When the four curators are gone and I have reviewed all their corrections, there are likely to remain several thousand cruxes, words that are missing or so garbled that we were not able to make sense of them. But there will be people all over the world who have just the right kind of knowledge to fix this or that particular error. It will be easy to create an AnnoLex site that is populated with just the residual cruxes, and I hope that folks with the right kind of knowledge will whittle away at that list, so that a year from now the remaining cruxes are counted in the dozens or low hundreds.