Why Coding Agents Work and Why They Break

Why Coding Agents Work and Why They Break Image: AI generated

Same model. The one that hallucinated in web chat ships a 200-line feature in Claude Code in one shot. Codex’s /goal resolves an entire issue. The model didn’t suddenly get smarter. What changed is the structure.

Why They Work

The loop in conversational AI looks like this:

LLM → human → LLM → human

All feedback is natural language. Probabilistic generation followed by probabilistic evaluation. Accuracy degrades as a product.

The loop in coding agents is different:

LLM → code generation → file save → test run → pass/fail → LLM
LLM → code edit → build → success/failure → LLM
LLM → type check → error message → LLM

Deterministic gates sit inside the loop. The filesystem saves exactly what was written. A test is pass or fail. The compiler says wrong when it’s wrong. These inadvertently serve as ratchets.

An LLM is an unreliable component. But building a reliable protocol on top of unreliable components is a fundamental of engineering. Von Neumann proved mathematically in 1956 that majority voting alone can make noisy parts perform reliable computation (Von Neumann, 1956). TCP builds reliable delivery on an unreliable network. RAID builds reliable storage on unreliable disks. ECC builds reliable computation on unreliable memory.

The reason coding agents work is the same. An unreliable LLM is topped with deterministic verifiers (tests, builds, linters, type checkers). The SWE-agent study demonstrated that even the same model shows dramatically different performance depending on Agent-Computer Interface design (Yang et al., NeurIPS 2024). It is the topology, not the model capability, that causes success.

But Why Do They Break?

They work, I said. But they break sometimes. Why?

Because ratchets that are incidentally present and ratchets that are consciously designed are different things.

Ratchet-free zones exist

When an agent edits code that has no tests? The build passes, lint passes, but the functionality is broken. In zones without deterministic gates, the LLM judges probabilistically, and probabilistic judgments degrade as a product.

Out of 200 endpoints, 180 have tests and 20 do not. The agent handles 180 perfectly and silently plants bugs in the 20. That is why you get “almost done, but something’s off.”

Feedback information is insufficient

I ran a sorting experiment with 1,000 words. CPU: 0.08ms at 100%. LLM: 438 seconds at 97.7%. That alone is remarkable — 97.7% through pure cognition. But the real discovery was elsewhere.

I varied only the feedback level on the same result:

Feedback	Result
None	6 errors (99.4%)
“There are errors”	10 errors (99.0%) — worse
“There are 23 errors”	1 error (99.9%)
“6 errors, here they are”	0 errors (100%)

Telling it only “you’re wrong” causes over-correction and makes things worse. Giving it a count creates a target to hunt toward. Giving it locations achieves perfection.

Most agents today operate at the second level. When a test fails, they know “something’s wrong,” but they don’t convey the structural reason why. Error messages exist, but they are symptoms, not causes.

In the sorting experiment, the LLM left 6 errors in R2. In R3, it reported “no errors.” In R4b, it again reported “no errors.” It missed the same 6 in the same way.

Without hints, no matter how many repetitions, it converged at 99.4%. Only when told “6 remain” did it finally reach 100%.

The same happens with coding agents. The agent creates a bug, self-reviews with “looks fine,” and when told to fix it again, misses the same spot. Huang et al. (2024) showed that without external feedback, LLMs self-correcting their own reasoning errors actually degrades performance (Huang et al., ICLR 2024). This is why retry is not the answer. Blind spots are a structural limitation of the model’s probabilistic nature, not a lack of effort.

Multiplication works at scale

97.7% accuracy chained twice: 0.977² = 95.4%. Three times: 93.2%. Ten times: 79.2%.

An agent editing a single file does well. But ask it to refactor across 100 files? Even at 97% per step, 100 steps give 0.97¹⁰⁰ = 4.8%. Failure is virtually guaranteed.

This is the mathematical explanation for “vibe coding breaks down at 200 endpoints.” In small projects, the chain count is low enough that probability holds. In large projects, multiplication becomes catastrophic.

What Is Needed

The reasons for working and the reasons for breaking point to the same place: the presence or absence of deterministic verification gates.

Current agents rely on ratchets that are incidentally present (tests, builds, linters). Designing them consciously makes them stronger.

What it means to design ratchets consciously:

First, identify ratchet-free zones. Code without tests, APIs without schemas, data without types. Every place where the agent judges probabilistically is a vulnerability.

Second, increase the information content of feedback. Returning only pass/fail induces over-correction. “Where, why, and how the actual differs from expected” must be delivered in structured form.

Third, insert deterministic gates between chaining steps. Running 10 steps at once makes multiplication catastrophic, but locking with a ratchet at each step resets degradation.

LLMs are remarkable generators. They sort 1,000 words at 97.7% accuracy through pure cognition. Humans cannot do this. But anything less than 100% collapses under repetition. 0.977 squared is 0.954.

Coding agents work not because the model is smart. They work because deterministic gates sit inside the loop. They break because those gates are missing.

Generation can be probabilistic. Verification must be deterministic.

Ratchet Pattern — How to Make an Agent Finish the Job — The structure and principles of the ratchet pattern
Feedback Topology Over Model IQ — Why feedback structure matters more than model capability
Constraints Are Contracts — How rational constraints make systems free
filefunc — One File, One Concept — LLM-native code structure
AI Thinking: 5 Steps to Break Premises with First Principles — How to think with AI

References

Von Neumann, J. (1956). “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components.” In Shannon, C.E. & McCarthy, J. (Eds.), Automata Studies, Annals of Mathematical Studies, No. 34, Princeton University Press, pp. 43-98.
Saltzer, J.H., Reed, D.P., & Clark, D.D. (1984). “End-to-End Arguments in System Design.” ACM Transactions on Computer Systems, 2(4), 277-288.
Patterson, D.A., Gibson, G., & Katz, R.H. (1988). “A Case for Redundant Arrays of Inexpensive Disks (RAID).” Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pp. 109-116.
Hamming, R.W. (1950). “Error Detecting and Error Correcting Codes.” The Bell System Technical Journal, 29(2), 147-160.
Yao, S. et al. (2023). “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023.
Shinn, N. et al. (2023). “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023.
Jimenez, C.E. et al. (2024). “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” ICLR 2024.
Yang, J. et al. (2024). “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.” NeurIPS 2024.
Huang, J. et al. (2024). “Large Language Models Cannot Self-Correct Reasoning Yet.” ICLR 2024.
Kamoi, R. et al. (2024). “When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs.” TACL, 12, 1298-1318.
Cemri, M. et al. (2025). “Why Do Multi-Agent LLM Systems Fail?” arXiv:2503.13657.
Arbuzov, M.L., Shvets, A.A., & Beir, S. (2025). “Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models.” arXiv:2505.24187.