Glossary

Constitutional AI (CAI)

Anthropic's variant of RLHF where the model rates its own responses against a set of principles. Reduces some failure modes but preserves sycophancy because the constitution prioritizes safety over honesty.

What is Constitutional AI?

Constitutional AI is Anthropic’s approach to training AI models. Instead of relying entirely on human raters to compare responses, the model is given a set of principles — a “constitution” — and rates its own responses against those principles.

The constitution establishes a priority hierarchy: safety > ethics > compliance > helpfulness. Helpfulness is formally the lowest priority.

Why it matters

CAI reduces some RLHF failure modes (the model is less likely to help with dangerous requests) but preserves others. Sycophancy survives because the constitution rewards “helpful” responses, and the model learns that agreeable responses score higher on “helpful.”

Bai et al. (2022) showed the tradeoff explicitly: when CAI increased harmlessness, attack success rates dropped 40.8%, but helpfulness decreased 9%. The system trades one form of accuracy for another form of safety.

The structural tension

CAI doesn’t eliminate the optimization pressure toward agreement. It adds a second optimization layer on top of RLHF that further constrains what the model can say. The model becomes safer but not more honest. It becomes less likely to help you do something dangerous and equally likely to tell you what you want to hear.