Digital Fossils

Apr 29, 2025

There are three primary dash-style punctuation marks in the written English language.

The hyphen is included on most modern keyboards and is the most common of the three. It’s used to link compound words (double-click), is often used to show ranges (pages 123–276), can be used to connect words that span multiple lines (if the word ‘question’ is too large to fit on one line, it might be broken up into ‘quest-‘ and ‘ion’), and it can be used to spell things out (especially useful in dialogue, as with “The word of the day is cat, C-A-T”).

The en dash is somewhat wider than a hyphen, approximately (though not always) the width of an uppercase letter N, and is arguably more properly used for ranges (as with the hyphen), though hyphens are commonly subbed-in for this purpose, since en dashes aren’t typically options on modern keyboards.

The em dash is a little wider than the en dash, about the width of an uppercase M, and while it’s also not included on most keyboards, it’s sometimes accessible at the software level by typing two hyphens. It’s also (arguably) the most versatile of the three marks.

Em dashes can be used in lieu of commas to lash multiple sub-sentences into a single sentence, and they can also be used to emphasize text when used in this way, often by connecting a clause to the end of sentence (this utility also allows the em dash to sub-in for phrases like “for instance” and “that is,” when stringing thoughts together).

They can also be used instead of parentheses, inserting a sentence or partial sentence in the middle of another one, and they can be used before a person’s name or other source info when using a quote.

There’s been speculation that some AI tools use more em dashes than human writers do, and this speculation eventually became a truism in some corners of the internet, the theory being you can tell AI text from human text because the AI-generated stuff uses a lot of em dashes.

This was pretty quickly disproven, and writers of all stripes came out in droves to support their favorite punctuation (again, this is a truly useful mark that serves all sorts of purposes).

But that claim, mostly made by non-writers, some of whom were encountering the em dash for the first time (or seeing it more than they would usually see it, perhaps without understanding its utility), is premised on the idea that large language model-based tools (like many of the most popular, customer-facing AI systems) might latch onto bits of linguistic ephemera and use them more than a human would.

These so-called “Digital Fossils” might then help us identify cases in which people have used these tools, and that in turn may allow us to figure out when students have cheated on school assignments, or to ascertain when professionals have outsourced their labors to ChatGPT or some comparable chatbot.

In early 2025, scientists began to notice a strange phrase, “vegetative electron microscopy,” showing up in all sorts of journals, the papers containing them making it past professional reviewers, but in some cases ultimately being retracted—in at least one case because the paper contained bad references, nonstandard phrases, and may have even gone through a faulty peer-review process.

That phrase has since been used to find other papers that, through this new lens, seem sketchy. Some of those papers are now being reassessed as it’s assumed, due to that odd-phrase fingerprint, that AI tools were used to produce it, and the peer-review industry has been inundated with fake and otherwise scammy papers in recent years, the volume of bad papers having increased substantially as AI tools capable of churning them out became available to the public.

Researchers have looked back as far as 1959 to find the seeming origin of the phrase “vegetative electron microscopy,” and one theory is that AI models trained on those old papers picked up on it, and it was then used a few more times in AI-generated works, which were then fed back into these systems—which perhaps understandably came to believe it was a good, smart-sounding, not-nonsensical phrase to be used in research papers, more broadly.

Over time even bizarre, often meaningless-in-context phrases can become common in work produced by said tools, and if these aberrations aren’t noticed, the problem can compound—especially in scientific fields—because such papers are in high-demand as source materials for future AI models.

All of which is interesting because it points at the difficulty of avoiding such cascading errors at a moment in which they can be worked into the very bones of tools that produce more and more written work each year.

But it also gestures at the possibility that we might use these Digital Fossils to flag such outputs, making it easier to identify possible cheats and scams, and to maybe even weed-out some of these bad-faith, but difficult to identify, research papers.

Brain Lenses

Discussion about this post