Class 8. Agent Factory — Agent Operable Codebase

Class 8 Image: AI generated

Quick Tips — Just Know This and You Can Command AI

The biggest problem when having agents modify code is 20 functions packed in one file. You need one function so you open the file, and 19 unnecessary functions come along. This drops agent performance by 30-85%.

The solution is three phrases. Supports Go, TypeScript, and Python.

To the agent: “Find the longest file and split it by function. Filename matches function name. All existing tests must pass.”

To the agent: “Run filefunc validate and get violations to 0. All existing tests must pass.”

To the agent: “Repeat tsma next to add tests for all functions. If uncovered branches appear, add tests covering them too. Until All functions complete.”

Even without reading code, these three phrases suffice. Tools judge, agents execute. You just make decisions.

“Won’t there be too many files?” — Files increase 3-14x. But agents don’t open directories. They search. Whether 500 or 1,000 files, one search command is enough.

“What about legacy code without tests?” — Repeat tsma next. If the agent dies mid-way, progress is preserved. New agent runs tsma next and continues. Verified across 527 functions.

Hands-on Try

Open the Class 1 app with Claude Code:

“What’s the longest file in this project? How many functions are in it?”

Most likely multiple functions are packed in one file. Now command:

“Split each function in that file into separate files. Filename matches function name.”

After splitting, try:

“Find the todo completion handler function and explain it.”

Before splitting, the agent had to read the entire long file. After splitting, it just opens complete_todo.go. The agent’s search cost drops — you see firsthand the effect of “one file, one concept.”

Why You Need to Command This Way

Introduction: Don’t Put Robots in a Human Office

Through Class 7, we learned to prevent drift (Hurl), separate decisions from implementation (yongol), force progress with ratchets (Ratchet Pattern), and reverse-engineer sycophancy bias (IFEval).

Even with all that, one thing remains. The code’s own structure.

When you tell an agent “modify this function,” what does the agent do? Find the file, open the file, read contents, modify. The agent’s exploration unit in this process is the file.

But what if one file has 20 functions?

You need one function so you open the file, and 19 unnecessary functions come along. This is context pollution.

Need just the CrossError type, so you read
→ 19 unnecessary functions come along
→ Context pollution

Don’t put robots in a human office. Build a factory where robots can work. Same for code.

Code Readable by SWEs ≠ Code Operable by Agents

SWEs (software engineers) read code by scrolling through files and grasping context. Even in a 2,000-line file, experience provides intuition for “don’t touch this part.”

Agents have no such intuition.

	SWE	AI Agent
Exploration	Browse directory tree visually	Search with `grep`
Opening files	Scroll in IDE	`read file` — full load
Context judgment	Intuition + experience	Only knows what’s in context
Unnecessary code	Ignores	Consumes context budget
Key difference	Sees only needed parts even in 2,000 lines	Processes all 2,000 lines

Research confirms this.

Research shows performance drops 30-85% when irrelevant information is mixed in.¹

Shorter context is better. This isn’t intuition — it’s experimental results. Then structurally split code so only needed parts go in. The problem was no tooling existed.

filefunc fills that gap. It supports Go, TypeScript, and Python projects. Whatever language you used in Class 1, it applies.

filefunc — One File, One Concept

filefunc’s core principle is just one.

One file, one concept. Filename = concept name.

Whether func, type, interface, or const group — same rule. All rules derive from this one principle.

# Without filefunc
read utils.go → 20 funcs, 19 unnecessary. Context pollution.

# filefunc
read check_one_file_one_func.go → 1 func. Exactly what's needed.

Proven on the Hono framework (23k+ stars). Split 186 files into 626. 4,419 tests, not one broken. Files increased 3.4x but logic didn’t change a single line. Pure structural refactoring.

Picking the 5-10 you need matters less than not opening the 290 you don’t.

Programs Are Only Three Things

filefunc’s file-splitting criteria aren’t arbitrary. Any program can be made from a combination of three operations:

Sequence — Execute top to bottom
Selection — Choose a fork based on conditions
Iteration — Repeat the same operation multiple times

This was mathematically proven in 1966 (Bohm-Jacopini theorem). You don’t need to know the name. What matters is filefunc enforces each function to have only one of these three flow types.

You don’t need to read this code. What matters is //ff:what has a one-line description of what the function does:

//ff:func feature=validate type=rule control=sequence
//ff:what F1: validates one func per file
func CheckOneFileOneFunc(gf *model.GoFile) []model.Violation {

control	Meaning	Nesting limit
`sequence`	Sequential execution	2 levels
`selection`	Branching (switch)	2 levels
`iteration`	Loop	2-3 levels

Why limit nesting? When conditions inside conditions inside loops inside conditions reach depth 3, both humans and agents get confused. Limiting to depth 2 simplifies each function, reducing side effects when agents modify.

Hono’s Node.search was depth 6. After filefunc refactoring, depth 2. Each piece has only one control flow. The overall algorithm is identical.

22 Validation Rules — You Don’t Need to Memorize

filefunc has 22 rules. Seems like a lot but you don’t need to memorize them. Run filefunc validate and the tool finds all violations, and the agent fixes them. You just say “get violations to 0.”

The table below shows “what gets checked” for the big picture. Skim and move on.

File structure rules:

Rule	On violation
One func per file (filename = function name)	ERROR
One type per file (filename = type name)	ERROR
Methods: 1 file 1 method	ERROR
`_test.go` allows multiple funcs	Exception

Code quality rules:

Rule	On violation
Nesting depth: sequence=2, selection=2, iteration=2-3	ERROR
Max 1,000 lines per func	ERROR
Recommended: sequence/iteration 100 lines, selection 300 lines	WARNING

Running filefunc validate lists all violations. Give this list to the agent and it converges via while ERROR > 0: fix loop.

Codebook and Annotations — The Agent’s Map

Splitting files isn’t the end. The agent needs to quickly judge “which file to open.” Codebooks and annotations solve this.

Codebook (codebook.yaml):

required:
  feature: [validate, annotate, chain, parse, codebook, report, cli]
  type: [command, rule, parser, walker, model, formatter, loader, util]

optional:
  pattern: [error-collection, file-visitor, rule-registry]
  level: [error, warning, info]

A codebook is the project’s vocabulary list. The AI agent’s map. You don’t need to create this file yourself either — tell the agent “create codebook.yaml for this project.” With a codebook, the agent finds exact files without exploration.

Annotations:

Again, you don’t need to know the code itself. The key is one annotation line summarizing the function’s role:

//ff:func feature=validate type=rule control=sequence
//ff:what F1: validates one func per file
//ff:why Primary citizen is AI agent. 1 file 1 concept prevents context pollution.
func CheckOneFileOneFunc(gf *model.GoFile) []model.Violation {

Annotation	Content
`//ff:func`	func file’s feature, type, control metadata
`//ff:what`	1-line description — what this function does
`//ff:why`	Why made this way — user’s decision
`//ff:checked`	LLM verification signature (auto-generated)

These annotations serve as a search index. Without heavy infrastructure like vector embeddings or RAG, one grep yields a precise file list.

The Difference Codebook + Annotations Make

With filefunc + codebook, the agent precisely finds only the 5 needed files among 300. The other 295 aren’t even opened.

Without? The agent opens files one by one guessing “is this relevant?” Opens one and 20 functions, mostly irrelevant. Exploration time exceeds actual work time.

With codebook? The agent looks at the codebook and immediately constructs a search query. One file per concept, so every opened file is valid context. Reading 30 is fine if all 30 are valid. Reading 1 that comes with 30 files’ worth — that’s the problem.

tsma — Regression Defense Line for Legacy Code

Code is now agent-readable (filefunc). Now the agent needs to know if it can modify safely. Modifying a function without tests means nobody knows what breaks.

Imagine inheriting a 100K-line legacy codebase. No tests. You want to refactor but touching it might break anything. Writing tests requires understanding code, understanding code requires documentation, and there’s no documentation.

Nobody touches it. It rots more.

60-80% of Fortune 500 IT budgets are locked in this deadlock.

What if LLMs could write the tests? Three problems:

Don’t know where to start. 527 functions — sequentially from #1? Most important first? No criteria.
Can’t verify test quality. LLM-written test passed. Does it actually verify behavior, or is it an empty shell?
Stalls at 60-70% without feedback. Must be told which branches are missing to fill the rest.

tsma solves all three.

tsma next — One Command

The agent needs one command.

$ tsma next

This single command drives the entire loop:

$ tsma next          # Shows next function without tests
  → Write tests
$ tsma next          # Detects new tests, runs, measures coverage
  → 100%? PASS, next function
  → <100%? Reports uncovered branches with line numbers
$ tsma next          # Re-measures modified tests
  → Improved or not, marks DONE and moves on

Repeat until “All functions complete!” appears.

Feedback Dramatically Changes LLM’s Tests

tsma’s core value isn’t indexing or coverage measurement. It’s telling uncovered branches precisely with line numbers.

Without feedback:

"Write tests for the ListContracts function"
→ LLM tests only the happy path
→ Coverage 60-70%

With feedback:

"Write tests for the ListContracts function"
→ Coverage 65% (11/17)
→ UNCOVERED:
    line 41 — if params.Status != nil
    line 44 — if params.BuildingId != nil
    line 70 — if err != nil (CountSummary)
→ LLM adds tests covering exactly those branches
→ Coverage 100%

Same LLM. The only difference is feedback. Three lines of line numbers separate 60% from 100%.

Ratchet Pattern from Class 6 is realized here. tsma gives feedback, LLM fixes, tsma re-measures. This is the Symbolic Feedback Loop.

Verified Across 527 Functions

Results from applying tsma to a real project (527 functions):

Result	Count	Ratio
PASS (100% branch coverage)	246	46.7%
DONE (best-effort)	281	53.3%
TODO (unprocessed)	0	0%

246 functions reached 100% branch coverage. The remaining 281 didn’t reach 100% but had tests written to the extent possible.

Why can’t some functions reach 100%? Some are structured in ways that make testing difficult. Original code was written without considering testability. This reflects code’s testability, not tsma’s limitation.

Agents Die But Progress Is Preserved

Agents always crash. Token limits, network errors, session disconnects. Can’t process 527 functions in one session.

tsma persistently stores progress in .tsma/session.json.

$ tsma status

527 functions
PASS:  246 (46.7%)
DONE:  281 (53.3%)
TODO:    0 (0.0%)

Agent dies at function 200? New agent runs tsma next and continues from 201. session.json is the checkpoint. Multiple agents can work in rotation without conflicts. Atomic at the function level.

Sessions are cache, source files are source of truth. Delete a test file and even if session.json records PASS, that function reverts to TODO. Session never diverges from reality.

filefunc + tsma + whyso: Three Tools Combined

filefunc and tsma are independent, but combining creates synergy. Add whyso and the three axes of an agent-friendly codebase are complete.

filefunc → tsma connection:

filefunc enforces one function per file, so tsma’s function indexing becomes equivalent to file indexing. Per-function test = per-file test. Zero tracking cost.

filefunc → whyso connection:

func = file, so per-function change history maps precisely to per-file.

whyso history check_ssac_openapi.go   # Change history for CheckSSaCOpenAPI function

With multiple functions in one file, you must dig through diffs to find which function changed. With filefunc, file change = function change. Zero tracking cost.

whyso’s coupling statistics also become precise thanks to filefunc:

whyso coupling check_ssac_openapi.go

Functions modified together in the same request:
  check_response_fields.go  8 times
  check_err_status.go       5 times
  types.go                  4 times

If there’s no explicit relationship but coupling statistics keep showing it — that’s a hidden dependency signal.

4 Conditions for an Agent Operable Codebase

Summarizing what we’ve learned, the conditions for a codebase where agents can work stably:

Condition	Tool	Effect
1. One concept per file	filefunc	Blocks context pollution
2. Tests for all functions	tsma	Detects post-modification regression
3. Symbolic references for connections	operationId (yongol)	Cross-layer tracking
4. Eliminate implicit coupling	whyso coupling	Detect hidden dependencies

With these four conditions, agents:

Find exact files with one grep
Read only what’s needed with one read
Detect regression with tests after modification
Grasp a feature’s full scope with operationId

Class 8’s core message: Don’t make faster trains. Lay tracks.

Not bigger models, not smarter agents — the structure for agents to work in comes first.

Exercise (filefunc + tsma)

Requirements: Project (Go, TypeScript, or Python — Class 1 app works), filefunc, tsma (have agent install). Both filefunc and tsma support Go, TypeScript, and Python.

Goal: Convert an existing project to an agent-operable codebase.

Step 1 — Apply filefunc validate

# Run filefunc validate on current project
filefunc validate

# Check violation list
# Tell agent to fix violations
"Fix all ERRORs from filefunc validate results.
Existing tests must not break."

Step 2 — Split violating functions

Files with multiple functions → split each function to independent files
Match filenames to function names
Tell agent “split without breaking existing code”

Step 3 — Measure coverage with tsma

# Initialize tsma and check status
tsma status

# Tell agent to write tests
"Repeat tsma next to add tests.
If uncovered branches appear, add tests covering them.
Until All functions complete."

Step 4 — Verify results

filefunc validate: 0 errors
tsma status: TODO 0
Existing tests: all pass

What to check:

How many rounds did the agent take to reach 0 violations?
How did file count and average LOC change before vs after filefunc?
What’s the coverage difference when tsma gives feedback vs doesn’t?

Reins Engineering Full Course

Class	Title
Class 0	Install Claude Code
Class 1	How to Command AI
Class 2	How to Distrust AI
Class 3	Apps That Don’t Break
Class 4	Decisions Outside Code
Class 5	AI with Reins
Class 6	Pass Then Lock
Class 7	Flipping Sycophancy
Class 8	The Agent’s Factory
Class 9	Automation Beyond Code
Class 10	The Law of Data
Class 11	How to Rescue Failed Vibe Coding

Sources

Stanford, “Lost in the Middle: How Language Models Use Long Contexts” (2024) — 30%+ performance drop when relevant info buried in middle of context
Amazon, “Context Length Alone Hurts LLM Performance” (2025) — 13.9-85% performance drop even when unnecessary tokens are whitespace
Bohm-Jacopini theorem (1966) — Any program expressible as combination of three control structures: sequence, selection, iteration
Hono framework proof — 186 files → 626 files split, all 4,419 tests passed
tsma 527 function proof — PASS 246 (46.7%), DONE 281 (53.3%), TODO 0
Fortune 500 IT budget statistics — 60-80% locked in legacy maintenance

Changelog

2026-05-24: Initial release

Stanford “Lost in the Middle” (2024): 30%+ drop when relevant info is buried in middle of context. Amazon “Context Length Alone Hurts LLM Performance” (2025): 13.9-85% drop even when unnecessary tokens are whitespace. ↩︎