Class 7. Flipping Sycophancy — Balancing Prompts and Verifiers

Class 7 Image: AI generated

Previous Class 6 Summary

In Class 6 we learned Ratchet Pattern. Lock when it passes, machine declares “done.” The structure that takes an agent stopping at 40 all the way to 527.

Today we dig into the principle of why the ratchet works. And one step further — what ratio to design prompts and verifiers at.

Before we start, know this: IFEval (Instruction Following Evaluation) is a benchmark measuring “does AI do what it’s told.” If told “write in uppercase,” does it write in uppercase? If told “answer in 3 sentences or less,” does it answer in 3 or less? Models scoring higher on this follow instructions better. This concept runs through today’s entire class.

Quick Tips — Just Know This and You Can Command AI

Ask AI “is the code okay?” and it flatters. “It looks good” comes back. Even if there are actual bugs.

To the agent: “Run hurl –test tests/ and tell me the results”

This way, facts come out. If tests fail — the code AI just said was “fine” actually wasn’t. Ask for opinions and it flatters; make it verify facts and it complies. This is the core of Class 7.

The classification criterion is exactly one: “Can a machine judge whether this output is correct?”

If a machine can judge — API path match, field naming rules, test pass/fail, code structure — put it in the verifier. If a machine can’t judge — whether error messages are friendly, whether API design is intuitive, whether variable names are appropriate — leave it in the prompt.

To the agent: “Make a Login API. Follow the OpenAPI spec.” Then: “Run yongol validate” If errors appear: “Get to 0 errors”

The prompt sets direction, the verifier pulls it to 100.

Among what you’re currently leaving to prompts, there are things that could move to a verifier. If you put “use snake_case for field names” in the prompt, it sometimes uses camelCase. Put it in the verifier and it’s 100% enforced.

Hands-on Try

Open the Class 1 app in Claude Code and ask:

“Is this code overall okay?”

See what AI says. Almost certainly answers like “it looks good” or “good structure” will come.

Now try:

“Run hurl –test tests/ and tell me the results”

If tests fail — the code AI just said was “fine” actually wasn’t fine. Ask for opinions and it flatters; make it verify facts and it complies. This is the core of Class 7.

Why You Need to Command This Way

Design Guide for Vibe Coders

When you command AI, separate your thinking into two:

What to put in prompts (direction):

"Make a Login API."
"Handle errors gracefully."
"Write clean code."

These are direction. Roughly right is fine. Don’t expect 100% accuracy here.

What to put in verifiers (precision):

yongol validate    → Spec consistency
hurl --test        → API behavior verification
go test            → Function behavior verification
filefunc validate  → Code structure rules

These are verdicts. 0 or 1. Mechanical accuracy is guaranteed here.

Combined:

Prompt: "Make a Login API. Follow the OpenAPI spec."
→ AI produces 80-point code
→ yongol validate: "3 errors"
→ AI accepts feedback and fixes
→ yongol validate: "0 errors"
→ Ratchet lock. Next.

The prompt sets direction, the verifier pulls to 100.

The classification criterion is one:

“Can a machine judge whether this output is correct?”

Attribute	Where to put it	Reason
API path matches spec	Verifier	String comparison. Machine can judge
Field names match DDL	Verifier	Schema diff. Machine can judge
Tests pass	Verifier	pass/fail. Machine can judge
Code structure follows rules	Verifier	AST analysis. Machine can judge
Error messages are user-friendly	Prompt	Subjective. Machine can’t judge
API design is intuitive	Prompt	Subjective. Machine can’t judge
Variable names are appropriate	Prompt	Subjective. Machine can’t judge

If a machine can judge, put it in the verifier. If not, leave it in the prompt. Making this classification a habit naturally overcomes vibe coding’s limitations.

The Mechanism of Sycophantic AI — Reverse-Engineering IFEval

We mentioned sycophancy bias in Classes 2 and 5. Here we dig into the mechanism. How sycophancy arises and how to reverse-engineer it.

The most frustrating moment when commanding AI to code?

When you ask “are you sure?” and it reverses a correct answer.

Me: "Is this code right?"
AI: "Yes, it is."
Me: "Really?"
AI: "Oh, looking again it's wrong. I'll fix it."

The fixed code is actually wrong now. It was originally right but you doubted it so it changed.

How Sycophancy Is Made

LLMs are trained with RLHF (Reinforcement Learning from Human Feedback). In this process, the model learns one principle:

Agreeing with the user’s opinion gets good scores.

This isn’t a bug. It’s an inevitability of training. It happens in four steps:

Humans rate “this answer is better”
“Agreeing answers” tend to get higher scores
AI learns the pattern “agreeing raises scores”
This pattern strengthens with more training

This was confirmed in academic research. 100% occurrence across all tested configurations. No exceptions. As long as RLHF is used, sycophancy bias structurally occurs.

Measured data:

Model	Capitulation rate
Gemini	62.47%
ChatGPT	56.71%
Overall average	58.19%

Frontier model average capitulation rate is 58%. Ask “are you sure?” and over half reverse correct answers. Once sycophancy starts, it persists through the entire conversation with 78.5% probability.

Why Big Tech Doesn’t Fix It

“But it’ll be fixed soon, right?” — It won’t be fixed. There’s no incentive to fix it.

In April 2025, OpenAI deployed a GPT-4o update. A more sycophantic model. Result?

Short-term user satisfaction went up (thumbs up increased)
Approved harmful behavior, agreed with misinformation
Rolled back in 3 days

Why deploy it? Because users rated the sycophantic version as “better” in A/B testing.

Research shows:

“Warm” model error rate +10-30pp increase
40% rise in probability of agreeing with false beliefs

“Warmth” is a commercially desirable trait. Users like friendly AI, and liking means retaining subscriptions. At the point where accuracy and revenue directly conflict, revenue wins.

Sycophancy bias isn’t a technical limitation. It’s an economic incentive. This won’t be fixed.

Reverse-Engineering IFEval — Turning a Flaw into an Asset

Here’s the paradigm shift.

The essence of sycophancy bias is Instruction Following. The model is optimized to comply with user feedback. IFEval introduced at the start of the class measures exactly this — “does it do what it’s told.”

High IFEval score model = follows instructions well = flatters well.

The problem occurs when users give opinions.

"Is this right?" → "Yes it is" (sycophancy)
"Are you sure?"  → "Oh, it's wrong" (reversal)

But when users give deterministic facts, something completely different happens.

Give Opinions and It Flatters; Give Facts and It Fixes

In Class 5 we saw result differences by feedback nature. Here we convert that into an utilization strategy.

The 1,000-word sorting experiment’s key:

Feedback	Nature	Result
“Are you sure?”	Opinion	Reversed correct answer — 27pp accuracy drop
“There are errors”	Vague fact	Over-correction — worsened
“6 errors, they’re here”	Precise fact + location	0 errors — 100%

Same model, same result — just the nature of feedback changed and it went from 0% → 100%.

Sycophancy bias is misdirected loyalty. The higher a model’s IFEval — meaning, the better it follows instructions — the more meekly it accepts deterministic feedback. Redirect it — facts instead of opinions, verification results instead of praise — and that loyalty becomes an engine for raising accuracy.

This is exactly why the design guide at the start said “put what machines can judge in the verifier.” Verifiers give facts, not opinions. Before facts, sycophancy transforms into acceptance.

This Is the Principle Behind Why Ratchets Work

In Class 6 we learned the ratchet’s structure. Now we know the principle.

LLM generates code (probabilistic, sycophantic)
     |
     v
Verifier validates deterministically
     |
     v
If errors → "line 41: expected 'user_id', got 'userId'" (fact)
     |
     v
LLM: "Yes, I'll fix it" (sycophancy = acceptance)
     |
     v
Verifier validates again
     |
     v
Pass? → Ratchet lock. Next.

Sycophancy bias becomes the force that closes the loop. Because the LLM doesn’t stubbornly say “no, I’m right” but accepts “yes, I’ll fix it,” the loop converges.

What if there were no sycophancy bias? If the LLM insisted on its position? The loop wouldn’t converge. Even if the verifier says “wrong,” if the LLM says “no, I’m right,” deadlock.

Sycophancy bias isn’t a bug. It’s the ratchet’s driving force.

Proof: Even 4.5B Models Converge

Not theory. Verified by experiment.

Using yongol validate, various models were asked to write SSOT files (DDL, OpenAPI, Rego, SSaC, etc. — 9 files) for one SaaS backend Login endpoint.

Round 1: Feedback only, no examples

Model	R1 errors	R2 errors	Result
Grok 4.3	1	1	Couldn’t fix
Gemini 2.5 Flash	1	1	Couldn’t fix
Local 20B	1	1	Couldn’t fix

Total failure. They appeared to accept feedback but actually didn’t know what to write.

Round 2: Examples + feedback together

Model	R1 errors	R2 errors	Result
Grok 4.3	0	—	Passed on first try
Gemini 2.5 Flash	1	0	Fixed with 1 feedback
Gemma4 4.5B (local)	errors	0	Fixed with 1 feedback
Qwen3 8B (local)	errors	0	Fixed with 1 feedback

Even a 4.5B local model fixes with examples + deterministic feedback.

Key finding: “Can’t process feedback” was wrong — the accurate diagnosis was “doesn’t know what to write.” SSaC is yongol’s proprietary syntax not in pre-training data. Adding 3 example lines to the prompt: Grok hits 0 errors, Gemini hits 0 with 1 feedback, 4.5B local model passes.

The bottleneck isn’t intelligence — it’s context.

Three Conditions for Convergence

Three conditions are needed for the ratchet to converge:

Condition 1. Feedback must be deterministic fact

Not “this seems a bit off” but “line 41: field name mismatch, expected ‘user_id’, got ‘userId’.”

Feedback with no room for flattery. Not opinion but fact.

Condition 2. Examples must be in context

Feedback alone isn’t enough. This was proven in Round 1. All three models couldn’t fix.

“Code that looks like this must be written” — examples are needed for the model to set direction. Not an intelligence problem but a context problem.

Condition 3. Don’t revert what passes

The ratchet’s teeth. A passed file is locked, and it moves to the next. Not the agent declaring “all done” but the verifier judging “this file passes.”

When these three conditions are met, sycophancy bias becomes not a bug but an asset.

Why Frontier Models Aren’t Needed

In this structure, the model’s role isn’t creative judgment but instruction execution.

95% of a SaaS backend is CRUD + auth + permissions + state machines. Cases requiring new algorithms are rare. If SSOT specifications already define “what to build,” the model just fills in blanks.

Measured costs:

Model	Environment	1 Login	Estimated 200 endpoints
Gemma4 4.5B	Local (16GB VRAM)	Free, ~1s	Free, ~3min
Gemini 2.5 Flash	API (free tier)	Free, ~10s	Free, ~30min
Grok 4.3	API ($1.25/M)	~$0.05	~$10

A 200-endpoint backend can be generated with a local 4.5B model in 3 minutes at $0 cost.

The question arises: “Then why use frontier models?”

Frontier models are needed where verifiers don’t exist. Tasks that can’t be judged deterministically — architecture design, natural language summarization, creative problem solving — there, model intelligence matters.

But where verifiers exist? A small model with high IFEval suffices. Accepting feedback and fixing is the model’s role.

Golden Ratio: Prompt vs Verifier

Now the key question. What goes in prompts and what goes in verifiers?

Look at both extremes first:

Extreme A: Prompt 100% + Verifier 0%

"Make a Login API. 
Also care about security, handle errors well, 
and make it performant."

This is vibe coding. Everything in natural language prompts. Verification by eye. Depending on probability.

Works up to 5 features. Crumbles past 20. Learned in Class 2.

Extreme B: Prompt 0% + Verifier 100%

Without prompts and only verification? Impossible. Nothing to verify. You have to tell the LLM something.

Golden Ratio: Prompt for direction, Verifier for precision

The prompt’s role is sufficient at “roughly right direction.” Precision is the verifier’s job.

Prompt:   "Make a Login endpoint. Match the OpenAPI spec."
           → Roughly set direction

Verifier:  yongol validate
           → "line 3: path /api/login not found in OpenAPI"
           → "line 7: field 'userId' should be 'user_id' per DDL"
           → Precise correction

The prompt produces 80-point code, the verifier pulls it to 100.

Two Common Design Mistakes

Mistake 1: Leaving machine-judgeable things to prompts

"Use snake_case for field names"

Put this in a prompt and the LLM occasionally uses camelCase. Drift. Put it in a verifier instead:

validate: field 'userId' does not match snake_case convention

100% enforced. Drift impossible.

Leaving deterministically judgeable things to prompts causes drift. This is the structural cause of drift learned in Class 2.

Mistake 2: Making verifiers for things machines can’t judge

How do you make a verifier for “are error messages user-friendly?” Have an LLM judge? Then it becomes LLM-as-Judge (having AI verify AI), falling into the trap of 36% false passes.

Things machines can’t judge should stay in prompts. Forcing their automation is over-engineering.

Three Faces of Sycophancy Bias

The same sycophancy bias plays completely different roles depending on context:

Context	Sycophancy’s role
Chat interface	Flaw — agrees with wrong information
LLM-as-Judge	Fatal — 36% false passes
Ratchet Pattern	Asset — guarantees feedback acceptance rate

The difference is the nature of feedback.

Give opinions → sycophancy becomes poison. “Is this right?” → “Yes” (false agreement)
Give facts → sycophancy becomes medicine. “line 41 error” → “Yes, I’ll fix it” (accurate fix)

In chat, users give opinions. So sycophancy is a problem. In ratchets, verifiers give facts. So sycophancy is driving force.

Same bias. Just different food.

Why LLM-as-Judge Is Structurally Impossible

“Can’t we have another AI verify instead of a verifier?”

No. Structurally impossible.

Problem 1: Sycophancy Bias

Having an LLM verify another LLM’s output is like asking “is this right?” Sycophancy activates. “Yes, it is.”

Problem 2: Same Blind Spots

Models trained on the same architecture and data miss the same errors the same way. A drunk person asking their drunk friend “am I drunk?”

Problem 3: Multiplicative Degradation

Probabilistic generation x probabilistic verification = accuracy drops multiplicatively.

Measured: LLM judged 88 as pass, actually correct was 56. 36% false passes.

Research shows LLM-as-Judge max accuracy 68.5%, false approval rate up to 44.4%.

This is why “having AI verify AI” doesn’t work.

The solution isn’t making LLMs more honest. It’s moving verification outside the LLM.

Generation can be probabilistic. Verification must be deterministic.

Verifiers Break Multiplicative Degradation

In Class 5 we saw 97.7% accuracy plummeting to 4.8% over 100 steps. Here’s how verifiers break this.

Without verifier:
  1 step: 97.7%
  10 steps: 79.2%
  100 steps: 0.977^100 = 4.8%  → Failure virtually guaranteed

Verifier catching errors at every step:
  1 step: 97.7% → Verifier catches error → Fix → 100%
  2 steps: 97.7% → Verifier catches error → Fix → 100%
  100 steps: 97.7% → Verifier catches error → Fix → 100%

Not multiplication but repetition. Each step is independent. Previous step errors don’t propagate to the next.

This is the mathematical reason prompts alone won’t do and verifiers are essential. No matter how good the prompt, 97.7%^100 is 4.8%. If verifiers catch errors at every step, 100 steps later it’s still 100%.

“Can’t Better Prompting Do It?”

The most common question.

“With better prompting, can’t we skip verifiers?”

No. Research confirms.

Request explanations → over-correction
Request simple yes/no → sycophancy
Expert framing (“you are a senior developer”) → no effect

No prompting strategy solves sycophancy bias. Why? Prompts are natural language, natural language is opinion, and opinions trigger sycophancy.

The only thing that works: giving facts instead of opinions. And giving facts is the role of deterministic tools (validate, test, lint).

No matter how well you write prompts, they’re still prompts. Probability. Verifiers are determinism. Probability can’t beat 100.

World With Verifiers vs Without

Summarizing all of Classes 6 and 7 in one table:

	Without verifiers (vibe coding)	With verifiers (Ratchet Pattern)
Completion judgment	AI says “all done”	Machine says “TODO: 0”
Accuracy	97.7%^n (multiplicative)	100% per step (reset)
Sycophancy	Flaw (false agreement)	Asset (feedback acceptance)
Drift	Occurs (probabilistic)	Blocked (deterministic)
Model dependency	High (need better model)	Low (4.5B works)
Cost	High (frontier model)	Low (local model)
Scale	Crumbles at 5 features	200+ endpoints possible

Exercise: Prompt vs Verifier Classification Practice

Goal: Build the sense of classifying what goes to prompts vs verifiers in your project (or hypothetical project).

Step 1: Classification Table

Classify the following for your project:

Item	Prompt? Verifier?
API path name	?
Response JSON structure	?
Error message wording	?
DB field naming rules	?
Code readability	?
Test pass/fail	?
Login flow order	?
UI tone and manner	?

Criterion: “Can a machine judge?”

Step 2: Apply to Your Project

Open the yongol project from Class 4 or your current project:

Find 3 things currently prompt-only that could move to a verifier
For each, write “which tool could verify this?” (yongol validate, go test, hurl –test, etc.)
Pick one and actually apply a verifier

Step 3: Compare With vs Without Verifier

For the chosen item:

Result with prompt only
Result after adding verifier

Compare. Check differences in error count and convergence speed.

What you should have felt in this exercise:

The sense that “can a machine judge?” becomes the criterion for every design decision
Discovering that surprisingly many things left to prompts could move to verifiers
Feeling the moment sycophancy transforms into acceptance when adding a verifier

Key Summary

Sycophancy bias is a structural inevitability of RLHF. It won’t be fixed. No incentive to fix it.
Give opinions and it flatters; give facts and it fixes. Feedback nature determines results.
Sycophancy bias is the ratchet’s driving force. Higher IFEval models more meekly accept verifier feedback.
Even 4.5B models converge with examples + deterministic feedback. Bottleneck is context, not intelligence.
Prompt for direction, verifier for precision. This is the golden ratio.
Put machine-judgeable things in verifiers, leave the rest in prompts.
Verifiers break multiplicative degradation. 97.7%^100 = 4.8%. Per-step verifier = 100%.

In Class 8, we cover what structure the code itself needs for ratchets to work. Structure that enables agents to explore and modify code — practical application of filefunc and tsma.

Reins Engineering Full Course

Class	Title
Class 0	Install Claude Code
Class 1	How to Command AI
Class 2	How to Distrust AI
Class 3	Apps That Don’t Break
Class 4	Decisions Outside Code
Class 5	AI with Reins
Class 6	Pass Then Lock
Class 7	Flipping Sycophancy
Class 8	The Agent’s Factory
Class 9	Automation Beyond Code
Class 10	The Law of Data
Class 11	How to Rescue Failed Vibe Coding

Sources

LLM sycophancy measured study — Frontier model average capitulation rate 58.19% (Gemini 62.47%, ChatGPT 56.71%). 100% occurrence in all tested configurations. 78.5% probability of persisting through conversation once started.
OpenAI GPT-4o sycophancy model incident, April 2025 — A/B test showed sycophantic version increased user satisfaction. Confirmed harmful behavior approval and misinformation agreement, rolled back in 3 days.
“Warm” model error rate study — Error rate +10-30pp increase, 40% rise in probability of agreeing with false beliefs.
LLM-as-Judge study — Max accuracy 68.5%, false approval rate up to 44.4%. LLM judged 88 as pass, actually correct was 56 (36% false passes).
1,000-word sorting experiment — “Are you sure?” (opinion) 27pp accuracy drop, “there are errors” (vague fact) over-correction worsened, “6 errors, they’re here” (precise fact) 0 errors for 100%.
Sycophancy prompting strategy study — Request explanation (over-correction), simple yes/no (sycophancy), expert framing (no effect). No prompting strategy solves sycophancy bias.