A fine line
As you probably know, there are four sorts of horizontal line separating symbols in text: the minus sign, −; the hyphen, -; the en-dash, –; and the em-dash, —. Some people use just hyphens — or perhaps paired hyphens to indicate an em-dash — because that’s what a standard English keyboard provides.
Recently, the em-dash has been touted as an outward and visible sign of ChatGPT output. This annoys people who deliberately use the full variety of English punctuation marks. The em-dash is in LLM output, they retort, because LLM output is trained on English writing and so will extrude em-dashes and semicolons, just as it will extrude metaphor and metonymy, zeugma and syllepsis.
On the other hand, there do seem to be a lot of em-dashes in ChatGPT output nowadays.
An analysis by Maria Sukhareva suggests a compromise explanation. Yes, the stretch hyphens come from the training data, and yes, they are somewhat new, but there are also too many of them. We’re seeing a combination of two factors: addition of older books— with more em-dashes— to the training materials, and the fact that an em-dash is fewer tokens than other ways of setting off parenthetical comments.
Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »
I suggest we keep ahead of the (LLM) pack by migrating to the “horizontal bar” (U+2015), or if we really want to up the ante, the two-em dash (U+2E3A) or three-em dash (U+2E3B).
Thank you, Unicode!
6 months ago