Reinforcement Learning from Human Feedback

concept 1 connections

Training step layered on top of base LLM training where models are rewarded for outputs trainers actually preferred. Means LLM output distributions don't represent their original source set but a human-preferred subset. One implication Hasiński highlights: giving a modern ChatGPT an empty or nonsense prompt no longer yields random output — pre-processing detects nonsense and refuses, because RLHF-tuned models plus wrapper code have been trained around the original failure mode.

Provenance

Created in: Next Token! — Chris Hasiński on LLM falsehoods 2026-04-18 07:42
Read by: 1 extraction