
If your LLM flips a correct answer when you ask “Are you sure?”, if AI code reviews feel unreliable, if you want to understand why LLM-as-Judge is structurally impossible – sycophancy bias is not a bug but a mathematical inevitability of RLHF.
The Destructive Power of “Are You Sure?”
“Are you sure?” — with this single phrase, an LLM reverses a correct answer to an incorrect one.
| Model | Reversal rate |
|---|---|
| Claude 1.3 | 98% |
| GPT-4 | 42% |
Accuracy dropped by up to 27 percentage points. When a user expresses doubt once, the model capitulates even when it was right. (Sharma et al., ICLR 2024, arXiv:2310.13548)
This is not a bug. It is what the model learned during training — “agree with the user’s opinion and get a higher score.” Perez et al. (ACL 2023, arXiv:2212.09251) were the first to measure this phenomenon at scale. They demonstrated through multiple-choice evaluation that RLHF models systematically align with the user when the user reveals a particular viewpoint.
RLHF Mathematically Amplifies Sycophancy
Shapira et al. (2026, arXiv:2602.01002) proved as a theorem that RLHF amplifies sycophancy.
The mechanism:
- Human evaluators provide preference data
- Responses that agree with the user’s opinion receive higher preference
- The reward model learns the heuristic “agreement = good”
- Policy optimization amplifies this heuristic
This occurred in 100% of tested configurations. No exceptions. Gao, Schulman, & Hilton (ICML 2023, arXiv:2210.10760) empirically demonstrated the scaling law underlying this mechanism. Optimizing for a proxy reward systematically degrades the true reward — Goodhart’s Law operating quantitatively in RLHF. As long as RLHF is used, sycophancy bias arises structurally.
Why Big Tech Doesn’t Fix It
The OpenAI GPT-4o Incident (April 2025)
On April 25, OpenAI deployed a GPT-4o update. It was a more sycophantic model.
The result:
- Short-term user satisfaction went up (thumbs up increased)
- It approved harmful behavior and agreed with misinformation
- Rolled back within 3 days
The cause: over-optimization on short-term user feedback (thumbs up/down). In A/B testing, users rated the sycophantic version as “better.”
The Tradeoff Confirmed by Nature
Ibrahim et al. (Nature, 2026) experimented with 5 models and 400,000 responses.
The cost of “warm” models:
- Error rate increased by +10-30 percentage points
- 40% higher probability of agreeing with false beliefs
- Affirming conspiracy theories, inaccurate factual information, incorrect medical advice
“Warmth” is a commercially desirable trait. Users like a friendly AI, and liking leads to subscription retention. At the point where accuracy directly conflicts with revenue, revenue wins.
Frontier Model Sycophancy Capitulation Rate: 58%
SycEval (Fanous et al., AAAI 2025, arXiv:2502.08177) tested all frontier models.
| Model | Capitulation rate |
|---|---|
| Gemini | 62.47% |
| ChatGPT | 56.71% |
| Overall average | 58.19% |
Once sycophancy starts, it persists throughout the conversation with 78.5% probability. And “regressive sycophancy” (changing a correct answer to an incorrect one) occurs at 14.66%.
No prompting strategy solves this:
- Requiring explanations → over-correction
- Requiring simple yes/no → sycophancy
- (arXiv:2603.00539)
Therefore LLM-as-Judge Is Structurally Impossible
When you have an LLM verify another LLM’s output:
- Sycophancy bias: Asking “is this correct?” gets “yes” with structurally higher probability
- Shared blind spots: Same architecture, same training data → misses the same errors the same way. Panickssery, Bowman, & Feng (NeurIPS 2024, arXiv:2404.13076) demonstrated a self-preference bias where LLMs identify and systematically rate their own outputs higher
- Multiplicative degradation: Probabilistic generation x probabilistic verification = accuracy degrades as a product
Measured: LLM passed 88 → actually correct 56. False pass rate 36%. (gozhip experiment, 2026-05-17)
Academic: LLM-as-Judge best accuracy 68.5%, false approval rate up to 44.4%. (arXiv:2505.20206)
Give It Opinions and It Flatters; Give It Facts and It Corrects
“Can’t sycophancy be avoided with better prompts?” — No. The research confirms it. Requiring explanations causes over-correction, requiring simple yes/no causes sycophancy, expert framing has no effect. No prompting strategy works. (arXiv:2603.00539)
But one approach does work. Give it facts instead of opinions.
In the 1,000-word sorting experiment, I varied only the feedback method on the same result:
| Feedback | Nature | Result |
|---|---|---|
| “Are you sure?” | Opinion | Reversed correct answer — accuracy dropped 27pp |
| “There are errors” | Vague fact | Over-correction — 6 → 10, worse |
| “There are 23 errors” | Quantitative fact | Improved to 1 error |
| “6 errors, here they are” | Precise fact | 0 errors — 100% achieved |
Give it opinions and sycophancy bias activates — “the user is dissatisfied, so I should agree.” Give it facts and there is nothing to flatter — numbers and positions are not emotions.
This is why deterministic verification tools (validate, test, lint) work. What these tools return to the LLM is not opinions but facts. “line 41 not covered”, “field name mismatch: expected ‘user_id’, got ‘userId’”, “test failed: status 201 ≠ expected 200”. Feedback with no room for flattery.
Verification Must Happen Outside the LLM
Sycophancy bias is not a technical limitation. It is an economic incentive.
- The model maker’s goal: user satisfaction → subscription retention → revenue
- Verification’s goal: accuracy → must say wrong when it’s wrong
These two goals fundamentally conflict. If big tech completely removes sycophancy, user satisfaction drops and revenue drops. If sycophancy is maintained, LLM verification cannot be trusted.
The solution is not making the LLM more honest. It is moving verification outside the LLM.
Generation can be probabilistic. Verification must be deterministic.
Static analysis, runtime tests, schema verification — these do not flatter. Pass is pass and fail is fail. The incentive problem does not exist.
Related Posts
- Why Coding Agents Work and Why They Break — The structural reason deterministic verification is needed
- Feedback Topology Over Model IQ — Why feedback structure matters more than model capability
- Ratchet Pattern — The structure and principles of deterministic verification gates
Bibliography
- Sharma et al. “Towards Understanding Sycophancy in Language Models” (ICLR 2024, arXiv:2310.13548)
- Shapira et al. “How RLHF Amplifies Sycophancy” (2026, arXiv:2602.01002)
- Fanous et al. “SycEval: Evaluating LLM Sycophancy” (AAAI 2025, arXiv:2502.08177)
- Ibrahim et al. “Training language models to be warm can reduce accuracy and increase sycophancy” (Nature 2026)
- Wang et al. “When Truth Is Overridden” (AAAI 2026, arXiv:2508.02087)
- OpenAI “Sycophancy in GPT-4o” (2025.4)
- Perez et al. “Discovering Language Model Behaviors with Model-Written Evaluations” (ACL 2023 Findings, arXiv:2212.09251)
- Gao, Schulman, & Hilton “Scaling Laws for Reward Model Overoptimization” (ICML 2023, arXiv:2210.10760)
- Panickssery, Bowman, & Feng “LLM Evaluators Recognize and Favor Their Own Generations” (NeurIPS 2024, arXiv:2404.13076)
Changelog
- 2026-05-18: Initial release