LLM Tokens

concept 1 connections

Underlying output unit of an LLM. Text is broken into tokens that are really numbers; OpenAI's tokenizer visualizes how words map to tokens. English is over-represented in training sets so common English words become clean single tokens (often including a leading space); less-common languages like Polish burn many tokens per word; Japanese and popular domain words (e.g. nihon, go) get dense single tokens. LLMs work by predicting the next token given all prior tokens — not by chat, thinking, or tool use. Reinforcement learning layered on top of training means outputs reflect what trainers preferred, not the raw source distribution. With no input tokens at all, early ChatGPT would emit random garbage; modern systems pre-process prompts to catch and refuse that case.

Provenance

Created in: Next Token! — Chris Hasiński on LLM falsehoods 2026-04-18 07:42
Read by: 4 extractions