← Graph

Fault-tolerant token-by-token parsing for structured output

question 3 connections

Audience question: since LLMs generate one token at a time with full context visibility, could you hook a fault-tolerant parser into the stream so that the moment one token strays from a valid grammar (TypeScript-style 'next valid word' prediction), you rewind and retry just that token? Hasiński confirms llama.cpp already does something similar for structured output; he's confident proprietary providers do too. You can also build it on low-level APIs since you control the input-array-to-output-number loop. The feature moved from client-side to server-side specifically because checking and interrupting fast enough requires local latency — a remote round-trip would be too slow for things like wiring Ruby LSP to a remote LLM. llama.cpp exposes a minimum-token parameter and a token callback that can remove a token and regenerate with different parameters — he recommends downloading it to experiment.

answer_summary
Yes — llama.cpp already does something like this for structured output, and proprietary providers likely do too. It had to move server-side because low latency is needed; local llama.cpp setups expose minimum-token and token-callback knobs to play with.
question Fault-tolerant token-by-token parsing for structured output
about
The question is about enforcing structured output one token at a time.
question Fault-tolerant token-by-token parsing for structured output
about
llama.cpp tool
Hasiński points to llama.cpp as already implementing something similar.
question Fault-tolerant token-by-token parsing for structured output
asked_at
Audience Q&A following the talk.

Provenance

Read by
1 extraction