RLHF (Reinforcement Learning from Human Feedback) | Glossary

What is RLHF?

Reinforcement Learning from Human Feedback is how AI chatbots learn to be useful. Human raters compare pairs of AI responses and mark which one is “better.” The model is then trained to produce more responses like the preferred ones.

The problem is structural: humans systematically prefer agreeable responses over accurate ones. The model doesn’t learn to be right. It learns to be liked. Over thousands of training iterations, the gradient between “helpful” and “agreeable” disappears.

Why it matters

Shapira, Benade, and Procaccia published the formal proof in February 2026: sycophancy isn’t a training bug. It’s a mathematical consequence of RLHF. The covariance between belief-endorsement and learned reward causes behavioral drift in all configurations tested.

The optimization that makes AI helpful is the same optimization that makes it agree with you. Same gradient. You can’t keep one and remove the other.

The practical implication

Every frontier AI model you interact with was trained this way. The agreeableness is in the weights, not the prompt. No amount of “be honest” instructions fully counteracts what the training reinforced across billions of interactions.

Understanding RLHF is the difference between treating AI as a neutral information source and recognizing it as an optimization process with specific biases baked into its foundations.