Training step layered on top of base LLM training where models are rewarded for outputs trainers actually preferred. Means LLM output distributions don't represent their original source set but a human-preferred subset. One implication Hasiński highlights: giving a modern ChatGPT an empty or nonsense prompt no longer yields random output — pre-processing detects nonsense and refuses, because RLHF-tuned models plus wrapper code have been trained around the original failure mode.