
Image: AI generated
One question
Open the longest file in your project. How many functions are in it?
Tell an AI agent to modify one function in that file. The agent reads the entire file. It opened the file because it needed one function, but 19 unnecessary functions came along for the ride.
This is where the problem begins.
Code humans read vs. code agents work with
Until now, code was written for humans to read. Naming variables well, writing comments, producing documentation — all of it was about reducing human cognitive load.
In the age of agents, the question changes. Is code that’s easy for humans to read the same as code that’s easy for agents to work with?
It isn’t.
| Human | AI Agent | |
|---|---|---|
| Navigation | Scans directory trees visually | Searches with grep |
| Opening files | Scrolls in IDE | read file — loads everything |
| Judging context | Intuition + experience | Only knows what’s in the context |
| Irrelevant code | Ignores it | Consumes context budget |
| 2,000-line file | Looks at only what’s needed | Processes all of it |
A human scrolling through a 2,000-line file has an intuition that says “don’t touch this part.” An agent has no such intuition. When it reads 2,000 lines, 1,950 of them are context pollution.
Research confirms this. When irrelevant information is mixed in, AI performance drops 30–85%. Performance degrades even when the unnecessary tokens are whitespace. That shorter context is better isn’t intuition — it’s experimental evidence.
Don’t put a robot in a human office. Build a factory where robots can work.
Three things agents need
For agents to work reliably in a codebase, three things must be in place.
1. It must be readable — without noise
One concept per file. The filename is the concept name.
before: read utils.go → 20 functions, 19 unnecessary
after: read check_one_file_one_func.go → 1 function, exactly what's needed
filefunc solves this problem. It separates code into semantic units with 22 structural rules. Applied to the Hono framework (23k+ stars), it split 186 files into 626. All 4,419 tests passed. Files increased 3.4x, but not a single line of logic changed.
“Won’t there be too many files?” — Agents don’t browse directories. They search. Whether there are 500 or 1,000 files, one grep is all it takes. Not opening 295 unnecessary files matters more than picking the 5 you need.
2. It must be verifiable — mechanically
When you modify a function with no tests, nobody knows what breaks. The agent doesn’t know either. It falls into a doom loop.
before: 0 tests, no way to know what breaks on modification
after: 527 functions with tests, behavior changes detected immediately
tsma solves this problem. It indexes every function in the project, detects test presence, measures coverage, and feeds back uncovered branches with line numbers.
Without feedback, asking an LLM to write tests plateaus at 60–70% coverage. Tell it “line 41, 44, 70 uncovered” and it reaches 100%. Same model. The only difference is the resolution of feedback.
Experimental results on a project with 527 functions: completed to TODO 0. An autonomous agent declared “all done” at 40. Apply the ratchet: 527 completed.
3. Specifications must be cross-verifiable
It must be mechanically verifiable whether API schemas, DB schemas, security policies, and state transitions are consistent with each other. When one changes, you must know before compilation whether it conflicts with the others.
before: 200 endpoints, humans check spec consistency
after: one operationId chains all layers, machines detect drift
yongol solves this problem. It chains 10 SSOTs (OpenAPI, DDL, sqlc, SSaC, Rego, Hurl, etc.) through a single operationId and cross-validates with ~287 rules. user_id is a string in OpenAPI but BIGINT in DDL — existing tools can’t catch cross-layer contradictions like this.
One structure running through all three tools
filefunc, tsma, and yongol are independent tools, but they share a common structure.
filefunc: 22 structural rules → validate → fix → repeat
tsma: measure coverage → feed back uncovered branches → fix → repeat
yongol: cross-validate → detect drift → fix → repeat
All the same loop.
LLM generates → deterministic tool judges → result fed back to LLM → repeat
Symbolic Feedback Loop. A cyclic structure where deterministic tools correct the probabilistic generation of LLMs. Not AI verifying AI — machines verifying AI.
Give it opinions and it flatters. Give it facts and it fixes. Ask “is the code okay?” and it answers “yes, looks great.” Tell it “line 41: field name mismatch” and it fixes it immediately. Feedback with no one to flatter — because numbers and locations aren’t emotions.
From legacy to agent-operable
You don’t need to change an existing codebase all at once. This isn’t foundation work — it’s seismic retrofitting. Reinforcing the building without closing the shop.
Step 1 — Make it readable
Start with the longest files. Run filefunc validate and drive violations to zero. All existing tests must pass.
Step 2 — Make it verifiable
Repeat tsma next. Add tests to untested functions and fill uncovered branches. Even if the agent dies mid-run, progress is preserved. A new agent runs tsma next and picks up where it left off.
Step 3 — Cross-validate
Introduce SSOTs and run yongol validate. Machines catch cross-layer contradictions.
Each step is independent. You can do step 2 without step 1, or step 1 without step 2. But when all three combine, the scope of autonomous agent work expands dramatically.
Changing the operating system
An agent-operable codebase isn’t just linting or tooling. It’s changing the operating system of the codebase.
| human-readable | agent-operable | |
|---|---|---|
| File size | Scrollable range for humans | One concept |
| Tests | Nice to have; intuition fills the gap | Required for every function |
| Specs | Docs, wikis, verbal handoffs | Declarative, cross-verifiable, machine-readable |
| Feedback | PR review (hours) | Verifier run (seconds) |
| Completion check | Human says “looks good” | Machine says “487 remaining” |
Many people are making the train faster. Bigger models, smarter agents, better prompts.
The faster the train goes, the more the tracks matter. Almost no one is laying tracks yet.
Related articles
- filefunc — One concept per file — A code structure convention that eliminates LLM context pollution with 22 structural rules
- tsma — Regression defense line for legacy code — Ratchet-based test automation that completes 527 functions down to TODO 0
- Why coding agents work and why they break — Structural analysis of the Symbolic Feedback Loop
- Feedback topology over model IQ — Why the same model stops at 40 or completes 527
- whyso — What git blame doesn’t show — Automatic per-file change history extraction
References
- Stanford, “Lost in the Middle: How Language Models Use Long Contexts” (2024) — 30%+ performance drop when relevant info is buried in the middle of the context
- Amazon, “Context Length Alone Hurts LLM Performance” (2025) — 13.9–85% performance drop even when unnecessary tokens are whitespace
- Hono framework case study — 186 files → 626 files, all 4,419 tests passed
- tsma 527-function case study — PASS 246 (46.7%), DONE 281 (53.3%), TODO 0
- Ratchet Pattern experiment — autonomous agent 40/527 (7.6%) vs ratchet CLI 527/527 (100%)