Stop Tokens

concept 3 connections

Special tokens a program running an LLM watches for to halt generation. Chat is implemented entirely via stop tokens: the system prompt establishes roles ('assistant', 'user'), the runner appends user input, lets the model generate, and stops the moment the model emits 'user' (signalling the turn is over). 'user' is actually a poor stop token because it appears in ordinary text and is often two tokens; better choices are dedicated tokens like end-of-text (older models sometimes used literal newlines which is terrible). If a stop token is missed, the model carries on and talks to itself — notoriously, OpenAI's voice model will clone the speaker's voice and ask itself a new question; ChatGPT occasionally exhibits self-questioning before a censor model catches it. Triggering a stop token too early gives a truncated/broken response. Tool calling, reasoning roles, and agentic workflows all work by defining extra roles with their own stop tokens (e.g. 'commit' to end a JSON tool invocation). llama.cpp / llamafile exposes templates, stop-token configuration, a minimum-token parameter, and a token callback that can rewind one token and regenerate with different parameters.

Provenance

Created in: Next Token! — Chris Hasiński on LLM falsehoods 2026-04-18 07:42
Read by: 1 extraction