Krzysztof 'Chris' Hasiński's wroclove.rb 2025 single-speaker talk, reusing slides he has had to re-update for every conference because AI moves so fast. Framed around 'falsehoods programmers believe' (à la the falsehoods-about-time article and the broader falsehoods repository), but applied to LLMs. Core thesis: LLMs can't chat, can't think, and can't use tools — they are a 'big token factory' that predicts the next token the trainer preferred (via reinforcement learning on top of training data). Walks through: (1) tokens are numbers under the hood — OpenAI's tokenizer shows English words often get whole clean tokens (with leading spaces), Polish burns many tokens, Japanese/popular words like 'nihon' or 'go' get single dense tokens; (2) chat is fictional — it's implemented with stop tokens: a program feeds a system prompt like 'you are an assistant, prefix replies with assistant, user prefixes with user' and stops generation when the model emits 'user'; miss the stop token and the model answers itself (OpenAI's voice model famously clones the user's voice and asks itself questions); llama.cpp guards against misfired stop tokens; (3) reasoning (o1/o3, DeepSeek) is also just token generation in a hidden 'reason' role — the summary you see is another model summarizing; (4) agents and tool use abuse stop tokens, embeddings, and output formatting — example fictional prompt with 'tool' role, JSON invocation, 'commit' and 'user' as stop tokens; program calls the real API and splices the response back into context; (5) MCP servers are a meta-tool that lists other tools on demand to save context, but have no security model — any MCP server can hijack your LLM; (6) embeddings are high-dimensional vectors capturing abstract concepts (illustrated with the king/queen/car + royalness/manliness 2-D toy); paired with vector DBs they power RAG either as pre-query lookup or during-response tool calls, support multimodal inputs (images, video, audio via models like Google's SigLIP), and benefit from chunking (e.g. Baron gem); (7) hybrid search (classic word lookup + vector + whatever works) is now common; also consider graph databases, especially LLM-generated ones; (8) structured output used to be 'please format as this JSON schema, validate, re-ask on failure' (same algorithm as yelling at junior developers — LangChain does exactly this with a YAML retry prompt), now handled server-side so you get clean JSON back. Surveys the Ruby AI ecosystem: LangChain is basically dead now that LLM servers do the abstractions; Ruby LLM is growing fast (with a huge PR backlog); neighbor + pgvector/SQLite-vec (all by Andrew Kane, 'if we lose Andrew we lose Ruby's ecosystem'); Baron for chunking; multiple MCP server implementations in Ruby; vendor APIs have mostly standardized on the OpenAI format with minor differences (bedrock, OpenRouter expose subsets). Closing message: it's still wild west, everything you write today is outdated tomorrow, but we have a new magical token generator and a lot of software to write around it — job security for the audience. Q&A: could you hook a fault-tolerant parser into the token stream to retry one token at a time, like a type-aware IDE suggesting the next three valid tokens? Chris confirms llama.cpp already does something similar for structured output (which is why it moved from client-side to server-side — latency would kill a remote roundtrip); providers likely do it too; llama.cpp exposes parameters for minimum token counts and token callbacks that can rewind one token and regenerate with different parameters, and he recommends downloading it to play.