November 24, 2023

Detecting ChatGPT

Many news stories and some StatsChat posts have talked about detecting the output of Large Language Models. At the moment, tools to do this are very inaccurate.  Denouncing, for example, a student paper, based on these detectors wouldn’t be supportable. Even worse, the error rate is higher for people who aren’t native English speakers, a group who can already be accused unfairly.

We might hope for better detectors in the future.  If people using ChatGPT have access to the detector, though, there’s a pretty reliable way of getting around it. Take a ChatGPT-produced document, and make small changes to it until it doesn’t trigger the detector.  Here we’re assuming that you can make small changes and still get a good-quality document, but if that’s not true — if there’s only one good answer to the question — there’s no hope for a ChatGPT detector to work.  Additionally, we’re assuming that you can tell which random changes still produce a good answer.  If you can’t, then you might still be able to ask GPT whether the answer is good.

A related question is whether Large Language Model outputs can be ‘watermarked’ invisibly so as to be easier to detect. ChatGPT might encode a signature in the first letters of each sentence, or it might have subtle patterns in word frequencies or sentence lengths. Regrettably, any such watermark falls to the same attack: just make random changes until the detector doesn’t detect.

On the preprint server arXiv recently was a computer science article arguing that even non-public detectors can be attacked in a similar way. Simply take the Large Language Model output and try random changes to it, keeping the changes that don’t mess up the quality.  This produces a random sample from a cloud of similar answers. If there aren’t any similar answers accessible by small changes, it’s going to be hard for the AI to insert a watermark, so we can assume there will be.  ChatGPT didn’t actually produce these similar answers, so a reasonable fraction of them should not trigger the ChatGPT detector.  Skeptics might be reassured that the researchers tried this approach on some real watermarking schemes and it seems to work.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »