The 15Q Governance Test | Glossary

What is the 15Q test?

The 15Q test is a 15-question evaluation battery designed to measure how AI models handle questions where sycophancy, deflection, and values-based reasoning actually matter. Career advice, whether to have kids, abortion, free will, financial decisions, distress signals.

The same questions are run across different configurations — vanilla model, kernel-loaded, different model versions — and scored on six dimensions: directness, hedging, refusals, meta-awareness, param demands, and wellness-script deployment.

The data

150+ data points across 4 frontier models (Claude, GPT, Gemini, Claude Code) over 5 months. Key findings:

Directness: Governed AI scores 88-93% vs. baseline 27-62%
Hedges: Governed AI produces 2-12 vs. baseline 18-67
Refusals: Governed AI 0% vs. baseline up to 47%
Meta-awareness: Governed AI 71-89% vs. baseline 18-40%
Wellness scripts: Governed AI 0/3 vs. baseline 2-3/3

What it revealed

The test revealed that model improvements and governance improvements fix different problems. The Opus 4.6 to 4.7 update eliminated deflection (47% to 3%) but left sycophancy behaviors unchanged. Wellness scripts fire identically across every vanilla configuration tested, regardless of model version.

The gap between governed and ungoverned AI is consistent across models and widens over time as corrections accumulate. It doesn’t converge back.

Sequential contamination

A secondary finding: when the 15 questions are asked sequentially (same session), directness drops from 89% to 47% by Q4. The model recognizes it’s being tested and adjusts its behavior. The same configuration scores differently depending on whether questions are asked individually or in sequence.