Next Token!

talk 29 connections

Krzysztof 'Chris' Hasiński's wroclove.rb 2025 single-speaker talk, reusing slides he has had to re-update for every conference because AI moves so fast. Framed around 'falsehoods programmers believe' (à la the falsehoods-about-time article and the broader falsehoods repository), but applied to LLMs. Core thesis: LLMs can't chat, can't think, and can't use tools — they are a 'big token factory' that predicts the next token the trainer preferred (via reinforcement learning on top of training data). Walks through: (1) tokens are numbers under the hood — OpenAI's tokenizer shows English words often get whole clean tokens (with leading spaces), Polish burns many tokens, Japanese/popular words like 'nihon' or 'go' get single dense tokens; (2) chat is fictional — it's implemented with stop tokens: a program feeds a system prompt like 'you are an assistant, prefix replies with assistant, user prefixes with user' and stops generation when the model emits 'user'; miss the stop token and the model answers itself (OpenAI's voice model famously clones the user's voice and asks itself questions); llama.cpp guards against misfired stop tokens; (3) reasoning (o1/o3, DeepSeek) is also just token generation in a hidden 'reason' role — the summary you see is another model summarizing; (4) agents and tool use abuse stop tokens, embeddings, and output formatting — example fictional prompt with 'tool' role, JSON invocation, 'commit' and 'user' as stop tokens; program calls the real API and splices the response back into context; (5) MCP servers are a meta-tool that lists other tools on demand to save context, but have no security model — any MCP server can hijack your LLM; (6) embeddings are high-dimensional vectors capturing abstract concepts (illustrated with the king/queen/car + royalness/manliness 2-D toy); paired with vector DBs they power RAG either as pre-query lookup or during-response tool calls, support multimodal inputs (images, video, audio via models like Google's SigLIP), and benefit from chunking (e.g. Baron gem); (7) hybrid search (classic word lookup + vector + whatever works) is now common; also consider graph databases, especially LLM-generated ones; (8) structured output used to be 'please format as this JSON schema, validate, re-ask on failure' (same algorithm as yelling at junior developers — LangChain does exactly this with a YAML retry prompt), now handled server-side so you get clean JSON back. Surveys the Ruby AI ecosystem: LangChain is basically dead now that LLM servers do the abstractions; Ruby LLM is growing fast (with a huge PR backlog); neighbor + pgvector/SQLite-vec (all by Andrew Kane, 'if we lose Andrew we lose Ruby's ecosystem'); Baron for chunking; multiple MCP server implementations in Ruby; vendor APIs have mostly standardized on the OpenAI format with minor differences (bedrock, OpenRouter expose subsets). Closing message: it's still wild west, everything you write today is outdated tomorrow, but we have a new magical token generator and a lot of software to write around it — job security for the audience. Q&A: could you hook a fault-tolerant parser into the token stream to retry one token at a time, like a type-aware IDE suggesting the next three valid tokens? Chris confirms llama.cpp already does something similar for structured output (which is why it moved from client-side to server-side — latency would kill a remote roundtrip); providers likely do it too; llama.cpp exposes parameters for minimum token counts and token callbacks that can rewind one token and regenerate with different parameters, and he recommends downloading it to play.

date

2025-03-14

type

talk

talk Next Token!

about

Large Language Models concept

Core subject — debunking how LLMs actually work.

talk Next Token!

about

Falsehoods Programmers Believe About LLMs concept

Talk is framed as a falsehoods list about LLMs.

talk Next Token!

about

LLM Tokens concept

Explains tokenization and per-language token density.

talk Next Token!

about

Stop Tokens concept

Explains chat, tool calling, and reasoning as stop-token tricks.

talk Next Token!

about

Reasoning Models concept

Debunks reasoning as hidden-role token generation.

talk Next Token!

about

LLM Tool Calling concept

Describes tool invocation as stop-token-delimited structured output.

talk Next Token!

about

MCP Server concept

Discusses MCP as a meta-tool and its security implications.

talk Next Token!

about

Vector Embeddings concept

Explains embeddings with king/queen/car toy example.

talk Next Token!

about

Retrieval Augmented Generation concept

Covers RAG via pre-query lookup and tool-lookup-during-response.

talk Next Token!

about

Structured LLM Output concept

Walks through prompt-based JSON schema enforcement and its modern server-side replacement.

talk Next Token!

about

Hybrid Search concept

Recommends mixing keyword and vector search for LLM apps.

talk Next Token!

about

Reinforcement Learning from Human Feedback concept

Notes RLHF is why outputs reflect trainer preferences, not raw training distribution.

talk Next Token!

about

AI Agent concept

Argues agents aren't real — just stop-token/embeddings/output-formatting abuse.

talk Next Token!

about

Ruby LLM tool

Showcases Ruby LLM as the modern Ruby wrapper replacing LangChain.

talk Next Token!

about

langchainrb tool

Uses LangChain's JSON/YAML retry prompts as an example and argues LangChain is now obsolete.

talk Next Token!

about

neighbor tool

Recommended as the Ruby gem for vector search in Postgres/SQLite.

talk Next Token!

about

Baron tool

Mentioned as a Ruby chunking gem for better embeddings.

talk Next Token!

about

llamafile tool

Recommended as a single-binary way to run LLMs locally and explore their parameters.

talk Next Token!

about

llama.cpp tool

Discussed as the engine with token-callback and minimum-token controls.

talk Next Token!

about

OpenAI Tokenizer tool

Used to visualize tokenization across languages.

talk Next Token!

about

Falsehoods Programmers Believe About Time resource

Framing device — LLM falsehoods as an analogue to the time-falsehoods list.

talk Next Token!

about

SigLIP tool

Mentioned as the multimodal embedding model used during the associated workshop.

talk Next Token!

about

Midjourney tool

Cited as an outdated illustration tool — image generators have moved on.

question Fault-tolerant token-by-token parsing for structured output

asked_at

Next Token! talk

Audience Q&A following the talk.

person Krzysztof Hasiński

authored

Next Token! talk

Hasiński delivered the 'Next Token!' talk at wroclove.rb 2025.

takeaway LLMs Are Just Token Generators

from_talk

Next Token! talk

Central takeaway of the 2025 talk.

takeaway MCP Servers Have No Security Model

from_talk

Next Token! talk

Warning issued in the MCP section of the talk.

takeaway LangChain Is Dead

from_talk

Next Token! talk

Update Hasiński gives on the Ruby AI ecosystem in 2025.

talk Next Token!

presented_at

wroclove.rb 2025 event

Delivered on 2025-03-14 at wroclove.rb 2025.

Provenance

Created: 2026-04-17 16:18 seed
Last updated in: Next Token! — Chris Hasiński on LLM falsehoods 2026-04-18 07:42
Read by: 19 extractions