Fault-tolerant token-by-token parsing for structured output

question 3 connections

Audience question: since LLMs generate one token at a time with full context visibility, could you hook a fault-tolerant parser into the stream so that the moment one token strays from a valid grammar (TypeScript-style 'next valid word' prediction), you rewind and retry just that token? Hasiński confirms llama.cpp already does something similar for structured output; he's confident proprietary providers do too. You can also build it on low-level APIs since you control the input-array-to-output-number loop. The feature moved from client-side to server-side specifically because checking and interrupting fast enough requires local latency — a remote round-trip would be too slow for things like wiring Ruby LSP to a remote LLM. llama.cpp exposes a minimum-token parameter and a token callback that can remove a token and regenerate with different parameters — he recommends downloading it to experiment.

answer_summary

Yes — llama.cpp already does something like this for structured output, and proprietary providers likely do too. It had to move server-side because low latency is needed; local llama.cpp setups expose minimum-token and token-callback knobs to play with.

question Fault-tolerant token-by-token parsing for structured output

about

Structured LLM Output concept

The question is about enforcing structured output one token at a time.

question Fault-tolerant token-by-token parsing for structured output

about

llama.cpp tool

Hasiński points to llama.cpp as already implementing something similar.

question Fault-tolerant token-by-token parsing for structured output

asked_at

Next Token! talk

Audience Q&A following the talk.

Provenance

Created in: Next Token! — Chris Hasiński on LLM falsehoods 2026-04-18 07:42
Read by: 1 extraction