← Graph

Reinforcement Learning from Human Feedback

concept 1 connections

Training step layered on top of base LLM training where models are rewarded for outputs trainers actually preferred. Means LLM output distributions don't represent their original source set but a human-preferred subset. One implication Hasiński highlights: giving a modern ChatGPT an empty or nonsense prompt no longer yields random output — pre-processing detects nonsense and refuses, because RLHF-tuned models plus wrapper code have been trained around the original failure mode.

category
methodology
about
Reinforcement Learning from Human Feedback concept
Notes RLHF is why outputs reflect trainer preferences, not raw training distribution.

Provenance

Read by
1 extraction