This is a report about four books by Thomas Hobbes that were transcribed by EEBO-TCP, have been linguistically annotated with MorphAdorner, are part of the EarlyPrint project, and will serve as the basis for an edition by a research group at the University of Lyon. The texts were published in 1651 and 1652, and they add up to ~450,000 words. I decided to review the linguistic annotation partly because these are high-value texts and partly to get a sense of how well MorphAdorner works for texts from as period when orthographic rules had reached approximately their modern form.
The four texts are Philosophicall rudiments concerning government and society, Leviathan, and two parts of a work that goes by the names of De corpore politico and Human nature. As EEBO-TCP texts go, the transcriptions are somewhat below average. The median rate of transcriptional defects for texts after 1640 is about one defect per 1,000 words. Across the four texts the defect rate is about 3 per 1,000 words. A defect rate of this scale is too low to make much difference to the accuracy of linguistic annotation where you would be happy to keep the error below 3 per 100 words of running text.
How I did it
I filtered out “grammatical” words that can be defined unambiguously without reference to context. That reduced the number of tokens to be reviewed from from 450,000 to 166,000. It tells you something about the world of Hobbes that the ten most common words are ‘men’ (1770), ‘power’ (1364), ‘law’ (1326), ‘nature’ (1005), ‘every’ (835), ‘own’ (700), ‘things’ (691), ‘laws’ (651), ‘reason’ (646), ‘another'(638).
I based my review on a sample of ~ 4,500 tokens (3%) whose type occurred at least five times, and I looked at each of 19,000 tokens whose type occurred less than five times. For the purpose of this review a “type” is lexical item in a grammatical state, defined as the combination of a lemma with a part-of-speech tag and a spelling
I did the review in an environment that displayed each token as a keyword in context together with its linguistic metadata and the frequency of each type. The data were kept in a Postgres database mediated through a GUI environment (Aqua Data Studio) with data entry fields for annotation. You may think of this as a philological version of a “shopping cart” on the Web. In nearly all cases this environment provides enough data to check the accuracy of the metadata for each spelling in the text.
The combination of quite elementary SQL commands, regular expressions, and the excellent string functions of Postgres allows for grouping and sorting tokens in various ways. Spellings are much more highly structured at their end. The Postgres “reverse()” function turns a word list into a rhyming dictionary. Unlike the Penn Treebank tag set, the tag set for EarlyPrint text distinguishes between the infinitive (‘vvi’) and present tense (‘vvb’) of a verb. If you select tokens with the pos tags ‘vvb’ or ‘vvi’ and sort the results by “reverse(kwicl)” you are sorting the results by the previous word. That order will quickly draw your attention to errors like ‘they’ or ‘that’ followed by a verb tagged as ‘vvi’ or a verb tagged as ‘vvb’ preceded by ‘to’, ‘did’ or ‘may’. (MorphAdorner has a more sophisticated output in which the KWIC display is combined with a tag trigram that makes it easy to retrieve clearly wrong sequences like “personal pronoun – infinitive” or “modal verb – present tense” ).
The error rate in the sample of higher-frequency tokens: 116 per 10,000 words
In the 3% sample of types with a frequency of 5 or more the error rate was 116 per 10,000 words, a little more than 1% . Tagging errors come in different shapes, some more serious than others. Mistaking a noun for a verb (or the other way round) is serious, but in English it is an easy mistake to make because many nouns and verbs are morphologically identical. There are six cases (‘care’, ‘desire’, ‘need’, ‘study’, ‘walk’, ‘want’) where a noun is misclassified as a verb. At the other end of the scale there are morphologically identical participial forms (‘-ing’, ‘-ed’) that function as adjectives, nouns, or verbs with edge cases that are not easy to tell apart. In nine cases (~10%) of all participial forms) MorphAdorner gets the morphology right but does not assign it to the proper syntactic function. As errors go these are not hard to live with.
The past tense(‘vvd’) and the past participle (‘vvn’, ‘loved’) are also morphologically identical except for the limited number of “strong” verbs where an older ‘-n’ form survives. In 8 out of 323 tokens classified as the ‘ed’ form of a verb, the assignment to the past tense or past participle is incorrect.
The spelling ‘ought’ in Hobbes can be either an indefinite pronoun or a modal verb. In 4 of 8 occurrences in the random sample ‘ought’ is wrongly tagged as a an indefinite pronoun (‘pi’). If you look at the text as the whole there are 214 occurrences, of which 57 are tagged as ‘pi’ and 157 are tagged as a modal verb (‘vmd’) . All of the ‘vmd’ assignments, but only 40% of the ‘pi’, assignments are correct. ‘ought’ and ‘aught’ are difficult words to disambiguate, but in this case MorphAdorner often misses the tell-tale signal of ‘to’ as the next word. The analysis of that in a small corpus suggests the utility of checking the entire corpus for cases where ‘ought’ as ‘pi’ is followed by ‘to’. If in all those cases you change the assignment to ‘vmd’ you will fix many errors at a very low cost of making new ones. This is one of quite a few heuristics that let you catch and correct tagging errors by means of semi-automatic forms of post-processing. Where do you cross the law of diminishing returns in the EarlyPrint corpus of 1.6 billion words? “1000 errors” sounds like a lot, but a word that occurs 1,000 times in EarlyPrint has a relative frequency twice as low as the relative frequency of a word that occurs in Shakespeare or the Bible. Those are humbling figures.
The error rate in low-frequency tokens: 500 per 10,000 words
The distribution of linguistic phenomena obeys a power law. If you think of words as landowners and of occurrences as pieces of property, ‘the’ is the richest person in the House of English. In the Hobbes corpus 12,000 combinations of a lemma, pos-tag, and spelling with frequencies of less than five account for ~ 75% of distinct types but share only ~ 4% of the “token wealth”. Broadly speaking, linguistic annotation is quite easy for the first 90% of word occurrences, which are made up largely by common words, many of which can be tagged without reference to context (‘city’, ‘determines’). It becomes progressively difficult for the final 10% of word occurrences that are made up increasingly rare words, hapax legomena, homographs, and typographical errors. The game of annotation is won or lost on the field of the last 5%.
The 19,000 word occurrences I reviewed add up to roughly 5% of the Hobbes corpos. After excluding spellings that were variously garbled by either the printers or the transcribers, there remain about 18,000 word occurrences that one should be able to associate with a lexical item in some grammatical state. In 850 tokens the part-of-speech tag was wrong, for an error rate of 4.7%. So the error rate for the last 5% is higher by half an order of magnitude than the error rate for the first 95%.
Errors are not evenly distributed. Where their rate is conspicuously higher than the average, you can look for specific causes and aim at reducing if not eliminating errors. Below is a table that summarizes the error rates for the twelve most common tags. It shows you “false positives”, i.e. words tagged as a singular noun when they are something else, and “false negatives”, i.e. words that should have been tagged as singular nouns, but were tagged as something else. In philology as in medicine, false negatives are usually harder to deal with than false positives.
|tag||tag count||false positives||false negatives|
There is a simple explanation for the striking outliers of false positives for names (‘nn1’). Almost half them are capitalized singular nouns, followed by capitalized adjectives, Latin word, and plural names ending in ‘-es’, a very tricky suffix that often ends a name (‘Artamenes’, ‘Xerxes’). Note that for ‘n1’ false negatives exceeds false positives by a factor of three. Most of them are nouns falsely classified as names.
For the other outlier, present participles (‘vvg’) false positives are also very high, but false negatives are very low. The tagger misses very cases where the participle is a verb, but it misclassifies many nominal or adjectival cases. This type of error is very common across the entire corpus. If you care about it enough, you can easily gather syntactic trigrams or pentagrams with the participial form in the middle and perform batch corrections on impossible or suspect sequences. This will cause some new errors along the way but fix many more old ones.
The error rate for the third person singular is very low. This may be a function of the fact that Hobbes uses the ‘-th’ suffix more often than the ‘-s’ suffix, which limits the opportunities for confusing the plural ‘-s’ with the third person singular ‘s’.
The table of false positives and negatives did not surprise me, except for the high percentage of wrongly classified nouns. But this is a shortcoming that is relatively easy to fix. I make no claim for this set of texts as being representative of mid-17th century texts, and I would expect that the error rates for other texts will vary somewhat, but from my general sense of working with the data for a long time I would be surprised by texts that vary a lot. I corrected the tagging errors as I came across them. It was less than a day’s work, these texts should now be in pretty good shape. If you care enough about a particular text to get the error rate below 1%, it is not hard to do.
I performed my analysis and corrections on a Mac mini with 64 GB of memory, a six-core processor, and a 2TB hard drive. But you could do it just as fast on a laptop with 16GB memory and a 500MB drive. A wide screen is a great timesaver for this kind of work. It would be nice to replicate this environment on the Web, but a multi-user environment would require more robust hardware than humanities projects can usually afford.
I have not yet done but will do an analysis of the entire EarlyPrint corpus. This will be based on a sample of around five million tokens, and I will try to analyze error rates by period. I did a cursory analysis of texts after 1630 some time ago. The overall results did not differ much from the Hobbes data. As you move back in time results become more problematical, and for texts before 1550 you would need a lot of post-processing and attention to individual texts if you wanted to drive error rates down to the level of the Hobbes texts.