Have you ever thought about the mdash, the long dash, \u2014 in Unicode parlance or paraphrased as — in the parlance of character entities? The odds are that you have not.  I certainly have not thought much about it, but it tripped me up this morning in the EEBO-MorphAdorner project that Phil Burns and I are engaged in.  In modern print culture, the mdash is so firmly established as a marker of word boundaries that you can dispense with spaces when using it. But what happens to tokenization when you come across such things as the following phrase in a work by Defoe:

my L&mdash;d M&mdash;&s;s of <HI>H&mdash;n</HI>’s

Here the mdash does not mark a word boundary but designates letters deliberately left out.  How do you tell the stupid machine that this phrase consists of five rather than eleven tokens? It is trickier than you think. You can start by telling the machine that an mdash marks a word boundary unless it is

a) preceded by a capital that is preceded by a word boundary marker and
b)  followed by one or at most two lower-case characters before the next  word boundary marker.

Then you keep your fingers crossed that this will cover enough of the cases and that the residue will fall within an acceptable margin of error for machine-based tokenization.

That’s what I did, and I am still keeping my fingers crossed.  But along the way I discovered some interesting fact(oid)s about mdashes.   To be precise, about mdashes, as they appear in the TCP transcriptions. I take it on faith that  mdashes in a TCP text represent printed mdashes sufficiently often to base plausible arguments on their distribution.

There are very few mdashes before 1590: they appear in just 11 or 0.3% of 3,342 texts published before that date. That makes you wonder whether those rare mdashes really are mdashes in our sense. And a look at some page images suggests that they are not or only dubiously so. There are horizontal lines (and in one case a short diagonal slash) of uncertain function   You find them in 1% of texts of the 1590’s. Around 1600 you see uses that are recognizably mdash-like. For instance, in Jonson’s Every Many Out of His Humor incomplete utterances trail off in an mdash. Between 1600 and 1630 mdashes occur in ~7% of texts. Then you have a steady rise from 13% in the 1630’s to 21% in the 1690’s. Except for the 1640’s, where mdashes are found only in 5% of the texts.

Which makes you wonder whether the mdash is a marker of conversation and politesse  with its concomitant ellipses, evasions, and little or not so little hypocrisies. Not much use for them in times of Civil War. At a very practical level, mdashes are common in plays, and plays were not a big genre of the 1640’s.

This little exercise in scalable reading is based on 44,000 TCP texts and an hour’s work with simple Python scripts. It is a much more time-consuming task to look at actual EEBO page images and figure out whether an mdash is an mdash in our sense.  You can carry the exercise forward into the 18th century. There are only some 2,000 TCP texts– a much more limited selection, probably biased towards the canonical and literary. The data are not random or large enough to make much sense of a breakdown by decade.  But it is certainly telling that mdashes are found in 1520 or more than two thirds of all the 18th century TCP texts.

It may be stretching the point to think of the mdash as an indicator of politesse  in a broad cultural sense. At the minimum, however, its relative frequency is a surprisingly accurator indicator of the kinds of words, phrases, and syntactic constructions likely to be found or not found in a text.  I am tempted to spend a little more time following the fortunes of the punctuation mark for which the Germans have the charming word Gedankenstrich  or ‘thought-dash’ But I’d better stick to the tedious but necessary task of checking whether the dumb machine gets good enough instructions to get tokenization right often enough.