Here is a link to a spreadsheet of the SHC corpus. The plays are ordered by known error rate, from low to high. For each play, the spreadsheet shows
- Its filename, derived from the filename in the TCP collection
- the author
- the title
- the date, which is the best estimate of creation rather than publication
- whether the play has a cast list
- the token count, including punctuation marks (about 15%)
- an estimate of missing or incomplete words
- The error rate per 10,000 words
As said before, we will pick low-hanging first and work our way up from plays with few errors to plays with many.
This is as good a place as any to acknowledge the work of two friends and colleagues without whom this project would never have got off the ground or anywhere: Craig Berry and Phil Burns. Phil wrote the MorphAdorner tool suite that created the data derivatives on which this project is based. Craig designed the AnnoLex web site that serves as the curation tool for SHC and will, I hope, have a bright future in the curation of other texts. Phil’s and Craig’ work over the past year has been supported by the Andrew W. Mellon Foundation, The Center for Library Initiatives at the CIC, the Ford Motor Company Center for Global Citizenship at Northwestern’s Kellogg School of Management, as well as Proquest and Northwestern’s University Library and its Academic Research Technologies group. I am deeply grateful to all of them.