Reins Engineering

Reins Engineering — AI with Reins Image: AI generated

A Horse Without Reins

AI coding tools got fast. Login in 30 seconds. Payments in 2 minutes. An MVP ships in three weeks.

Three months later, it collapses.

The AI “cleans up” payment logic and changes discount calculations. A refactoring request alters public API field names. Adding a new feature breaks authentication. According to Carnegie Mellon research (MSR 2026), code complexity permanently increases 41% after AI coding tool adoption. The Google DORA Report (2025) shows a 7.2% decrease in delivery stability for every 25% increase in AI adoption.

The problem isn’t that AI is stupid. It’s that there are no reins.

Harnesses Are Fences

The industry answered with “harness engineering.” Linters, formatters, CI/CD, project structure, coding guidelines. Fences that keep the agent from going outside.

Fences don’t set direction. Whatever the agent does inside the fence — overwriting existing logic, changing types, skipping state transitions — the linter passes. The formatter passes. CI passes. Code arrives at production “clean but wrong.”

The saddle is on. The rider is mounted. But without reins, they hold on with their thighs and fall off after three months.

Reins Engineering is an engineering approach that gives AI agents deterministic contracts and blocks progress when contracts are violated.

It consists of three elements:

1. Deterministic Feedback

Give the agent facts, not opinions. Not “this looks weird” but “line 41: field name mismatch, expected ‘user_id’, got ‘userId’.” Feedback with no room for sycophancy. According to the TDAD study (arxiv 2026), procedural “do TDD” instructions worsen regressions (6.08% → 9.94%), while providing specific test files in context reduces regressions by 70% (6.08% → 1.82%).

2. Contract Locking (Ratchet Pattern)

When verification passes, lock it. The verification code written this way is called ratchet code. Hurl tests declare API behavior in plain text, running on every commit in CI. Passing ratchet code cannot be deleted. The agent can freely change code, but cannot change behavior. Drift is structurally suppressed.

3. Separating Decisions from Implementation

Three things mixed in code — user decisions, business logic, implementation details — are separated. Decisions live in declarative specs (OpenAPI, DDL, state diagrams). Implementation is freely generated by AI. AI cannot mistake decisions for details and overwrite them. Decision survival becomes independent of model size.

Evolution

Prompt Engineering      → Say it well and it works
Context Engineering     → Give good context and it works
Harness Engineering     → Contain it with structure
Reins Engineering       → Steer it with direction

Each stage was born from the limitations of the previous one. Prompts alone lacked consistency. Context didn’t stop the agent from going rogue. Fences couldn’t prevent drift inside the perimeter.

Reins Engineering is not a fence — it’s reins. It doesn’t constrain the agent’s freedom; it ensures the agent reaches the destination.

In June 2026, the lineage registered one more name. Loop Engineering — stop being the person who prompts the agent; design loops that generate the prompts instead (Addy Osmani, 2026). The diagnosis is correct. Loops scale generation. But they don’t scale adjudication. Osmani himself wrote down the weak spot — “A loop running unattended is also a loop making mistakes unattended.” As loops become universal, the bottleneck migrates to one place: what do you plug into the loop’s verification slot?

Call that layer verifier engineering, eval engineering, or gate engineering — the substance is one. The loop’s adjudication slot needs a deterministic contract, not an LLM. I call it Reins Engineering. Loops don’t converge without reins.

80 : 20

Reins Engineering does not cover everything. It knows exactly what it covers.

Deque Systems analyzed approximately 300,000 accessibility quality issues across more than 13,000 pages (2021). 57% were fully automatable, 23% required AI assistance, and 20% could only be judged by humans. Accessibility and code are different domains, but they share the same structure: “what proportion can machines judge?”

Through this lens, code quality breaks down as follows:

57% — Ratchet territory. Declare behavior, and machines judge violations without asking. go test, Hurl, yongol check, filefunc validate.
23% — Harness territory. Linters, formatters, CI. The mechanism is deterministic, but verification depth stays at the surface. They cannot catch behavioral correctness, but they enforce structure and style, raising the quality of AI generation.
20% — Human territory. Business fit, UX, architecture direction.

Reins Engineering does not replace the harness. It rides on top of it.

Harness (surface determinism)   23%
+ Ratchet (behavioral determinism)   57%
───────────────────────────────
                                80%

Humans focus on the remaining 20%.

Why Bigger Models Aren’t the Answer

“GPT-6 will fix it.”

It won’t. The problem isn’t model intelligence — it’s the medium. Code as a medium doesn’t distinguish decisions from implementation. Any model reading code sees decisions and details mixed in the same text.

A 4.5B local model (Gemma4) with deterministic feedback + example context edits SSOTs to zero errors. A frontier model editing raw code produces drift. The difference is structure, not intelligence.

Don’t change the model. Add a contract.

Evidence

yongol is the implementation of Reins Engineering. It cross-validates the consistency of 10 declarative specs (SSOTs) with 287 rules and generates code.

ZenFlow benchmark — a multi-tenant workflow automation SaaS. 32 endpoints, 14 tables, 47 Hurl requests. 11/11 stages passed. Adding features didn’t slow down. Existing tests never broke.

A working backend was successfully generated with a local 4.5B model. Cost $0. Offline. Reins bridges the gap that model size leaves.

Not AI Review Automation — Code Review Automation

The mainstream approach is AI review automation. An LLM generates code, and another LLM reviews it. A drunk person asking a drunk friend “Am I drunk?” Frontier models’ sycophancy capitulation rate is 58%. LLM-as-Judge false pass rate is 36%. Multiply probabilistic generation by probabilistic verification and accuracy degrades.

Reins Engineering is code review automation. LLMs generate, deterministic code verifies. validate doesn’t flatter. go test doesn’t hallucinate. Coverage measurement doesn’t lie. Pass is pass, fail is fail.

AI review automation:    LLM → LLM verification → flattery → false pass → drift
Code review automation:  LLM → code verification → facts → pass/fail → convergence

In an era where AI agents generate dozens of lines per second, humans can’t read all code. But delegating reviews to AI means flattery replaces verification. When code handles the mechanically verifiable parts, humans can focus solely on decisions that machines can’t judge — business fit, UX, architectural direction.

Human review doesn’t go to zero. The pain of human review is reduced. What code can review, code does. What only humans can review, humans do.

A Harness Without Reins Is Just a Fence

AI is already powerful enough. What’s missing is direction.

Build higher fences and the agent drifts faster inside them. Hold the reins and the agent runs to the destination.

Reins Engineering — structured deterministic validation for AI agents.

Independent Convergence

Reins Engineering is not a conclusion reached alone. People who don’t know each other hit the same wall and arrived at the same principle.

episteme — A cognitive control plane for AI agents, built by a UIUC researcher. Forces Reasoning Surface creation at the filesystem level before irreversible actions. Same principle as the ratchet, different implementation.

MagLab — A physics research pipeline built by a KAIST spintronics researcher. Declaration: “LLMs only reason and plan. They do not compute numbers, fabricate citations, or generate figure data.” Deterministic tools produce all numerical outputs.

Manifesto — MEL (Manifesto Expression Language) for declaratively defining frontend state transitions. Core principle: “Agent proposes, World verifies.” The agent only proposes intent; state transitions are verified deterministically.

NEKOWORK — A security gate that scans AI-generated code diffs with deterministic rules before merge. Works regardless of whether code was generated by Claude Code, Cursor, or Codex. The LLM does not judge.

oh-my-kamisama — A multi-CLI conductor that orchestrates Claude, Codex, and Gemini. It reads the actual git diff rather than the workers’ claims (“diffs beat claims”), and only declares done after the project’s tests pass. Every run is left on disk as an auditable artifact — not a disappearing chat.

All five projects are summarized by the same sentence: Generation can be probabilistic. Verification must be deterministic.

yongol — The Keel of AI Coding SaaS — The implementation of Reins Engineering.
Hurl Stops Vibe Coding Drift — Hurl + ratchet locks API behavior.
Ratchet Pattern — The theory behind deterministic verification and ratchet locking.
IFEval-Exploiting Ratchet Code — Feedback loops using sycophancy bias.
dry4go — Robert C. Martin’s (Uncle Bob) structural duplication detector for Go. Determines DRY violations deterministically via AST normalization + Jaccard similarity.

References

Cursino, D. et al. (2026). “Speed at the Cost of Quality? The Impact of AI Coding on Software.” MSR 2026. arxiv.org/abs/2511.04427
Google Cloud (2025). DORA Report 2025. cloud.google.com
Wang, Z. et al. (2026). “TDAD: Test-Driven Agentic Development.” ACM AIWare 2026. arxiv.org/abs/2603.17973
Karpathy, A. (2026). “From Vibe Coding to Agentic Engineering.” thenewstack.io
Deque Systems (2021). “Automated Testing Study Identifies 57 Percent of Digital Accessibility Issues.” deque.com
Anthropic (2026). “Demystifying Evals for AI Agents.” anthropic.com
Osmani, A. (2026). “Loop Engineering.” addyosmani.com

Changelog

2026-05-23: Initial publication
2026-05-27: Added “Independent Convergence” section (episteme, MagLab, Manifesto, NEKOWORK)
2026-05-28: “80:20” section — Harness (23%) + Ratchet (57%) = 80%, quantified with Deque empirical data
2026-05-31: Added oh-my-kamisama to Independent Convergence
2026-06-10: Added Loop Engineering paragraphs to “Evolution” — the loop’s adjudication slot, absorbing alias terms (verifier/eval/gate engineering)