reins — Keep Only the Domain in a Quest CLI; Make the Ratchet a Framework

reins — Keep Only the Domain in a Quest CLI; Make the Ratchet a Framework Image: AI generated

how-make-quest was about building a quest CLI with your bare hands. What a ratchet is, how you hang a gate, how you block cheese. Hand an agent that one article and out comes a cobra-based Go CLI.

But what happens when you build a second quest CLI? You write the same one-way state machine again. You write the same scan/next/submit/status/export again. You write the same PASS lock, the same monotonically decreasing remaining, the same JSONL export again. The only thing that changes is the gate, yet every time you rewrite all the rest. This is the boilerplate tax you pay each time you build one more quest.

The pattern was reusable. The code wasn’t. reins closes that gap.

What Is Invariant and What Is Domain

Lay two quest CLIs side by side and take the difference, and the boundary is sharp.

Invariant (shared by every quest)     Domain (differs per quest)
─────────────────────────────────     ──────────────────────────
ratchet: TODO→PASS irreversible        what counts as one quest
command skeleton: scan/next/submit…    what counts as a "fact"
level tally: Fail/Review→verdict        which cheese must be blocked
progress persistence · resumable
export: emit once

The left side is exactly what how-make-quest proved — whether the domain is a company name, an endpoint, or a function, the ratchet’s teeth catch the same way. Only the right side is known by a human. reins supplies the left as a framework and leaves you only the right.

This isn’t a new claim but an old principle that reins enforces in code — the separation of decision from implementation. The gate is the decision (what is true in this domain); the ratchet, the CLI, the tally are the implementation. Rewriting the implementation every time is the failure of binding the decision to the implementation.

You Implement Only the Gate

To make a quest with reins is to fill in the four methods of a single interface.

type Definition interface {
    Seed(args []string) ([]*quest.Item, error)            // input → initial TODO seeds
    Render(s *quest.Session, it *quest.Item) (string, error)                // authoring prompt + verification context that next shows
    Prepare(s *quest.Session, it *quest.Item, raw []byte) (gate.Context, *quest.Verdict, error) // decode the submission
    Rules() []gate.Rule                                   // the gate's violation-rule catalog
}

func main() { cli.NewQuestCmd("myquest", myDef{}, cli.Options{}).Execute() }

That one line of main supplies the ratchet, the six commands, the tally, export, and the resumable session — all of it. What you wrote is just the four pieces of domain. The agent still needs to know only two commands — receive with next, submit with submit. The machine decides the rest.

Quick Start — A Skill Scaffolds Even That Gate

You don’t even have to fill in those four methods by hand. reins ships a skill that teaches your coding agent how to build a quest. Install it first:

npx skills add park-jun-woo/reins

Then ask your agent — Claude Code, Codex, etc. — to build a quest:

/reins-quest Build a quest that summarizes Common Crawl news by the 5W1H principle (who/what/when/where/why/how)

The agent reads SKILL.md and scaffolds the gate.Definition for you — the whole scan/next/submit/status/export/rules command skeleton (and the opt-in loop) comes with it. If you’d rather build it by hand, just add the library:

go get github.com/park-jun-woo/reins@latest

From there it’s only the four methods above and the one line of main. Either way, all you own is your domain’s one gate; the ratchet, the CLI, the tally, and export come from reins.

The Gate Is a Catalog of Cheese-Defense Rules

The core of how-make-quest was “design a gate that can’t be cheesed.” reins turns that design into a data structure — gate = rule catalog. One rule is one cheese detector. When it finds a violation it fires (true) and carries a fact (Fact).

// One cheese-defense rule of a news-event-extraction quest.
// "does the who-anchor actually exist in the source" — if the agent invents a person, it's caught.
var whoAnchorPresent = gate.Rule{
    Meta: gate.RuleMeta{ID: "who-anchor-present", Level: gate.LevelFail, Desc: "required who-anchor exists in source"},
    Check: func(ctx gate.Context) (bool, quest.Fact) {
        sub := ctx.Submission.(*Event)
        if miss := textmatch.MissingTokens(ctx.Source, sub.Who.Anchors); len(miss) > 0 {
            return true, quest.Fact{Where: "who.anchors", Expected: "source substring", Actual: miss[0]}
        }
        return false, quest.Fact{}
    },
}

The virtue of this structure is that it grows. Every time you discover new cheese, you add one rule and the gate gets that much harder. And the catalog documents itself — when the rules command prints the rule list, that is “an audit list of the cheese I’m blocking.” There is no gate that doesn’t know what it blocks.

Severity is not a weight but a level. A single Fail means FAIL. A decisive violation is non-negotiable — nine 99-point violations cannot cover one Fail. Evaluate tallies the fired rules by level: if any is Fail, FAIL; otherwise if there’s a Review, REVIEW; if all pass, PASS.

Authority Asymmetry, Enforced by Type

The single most important line in how-make-quest was “only the machine locks PASS.” reins nails this down not as a convention but as a type.

L1 Machine (deterministic)   the sole authority to lock PASS
L2 AI (skeptic)              REVIEW only — raises doubt but cannot grant completion
L3 Human                     the residue both missed

The machine gate issues PASS. Even if you put an AI verifier into the gate, the most it can do is pull something to REVIEW. It makes the wrong thing impossible in the first place — if the framework offers no API that grants the AI the authority to PASS, you cannot, even by accident, leave the verdict to a drunk friend.

A Second Backend — the defeat Graph

For many gates, a level tally of independent rules is enough. But once the rules begin to contend with one another — “this violation only matters when that one is present,” “the root cause of this failure is actually that one” — hand-written if-else guards erode the gate. It’s not where the weak gate breaks, but where the complex gate rots.

reins’s second gate backend moves this contention into a declarative graph — toulmin h-Categoriser. The Toulmin argumentation model becomes the data structure directly:

Warrant — tautology PASS. The grounds for “passes if there’s no rebuttal.”
Counter — a violation attacks the warrant.
Supersedes — priority among rules. Which rebuttal beats which.

Hand-written guard clauses evaporate into Attacks and Supersedes edges. And when there are zero edges, this graph is exactly equivalent to the level tally — complexity is an opt-in cost that turns on only when needed (it turns on when Definition implements gate.Evaluator).

The real gift the graph gives is not the verdict but the feedback. Graph evaluation hands the agent a direct strategy guide — Verdict.Feedback: “why you lost, and what to change to win.” Not a bare “FAIL” but a root cause computed from the structure of the argument.

Here the paradox of how-make-quest works again. The model flatters — it obediently follows instructions. For opinions, flattery is poison; but for facts, flattery is an asset. The strategy guide isn’t an opinion (“this seems a bit off”) but a fact (“who.anchors isn’t in the source, change this”). The more sycophantic the model, the more readily it accepts that fact and converges. Deterministic graph + sycophantic LLM = a loop where convergence is guaranteed.

Closing the Loop — Unattended Generate-Verify (`loop`)

If the graph returns a strategy guide for “why you lost and what to change to win,” who receives that guide and generates again? Until now, an external agent drove next→submit by hand. reins’s loop command closes that flow inside the CLI — the LLM generates, the gate judges, and on failure it feeds the strategy guide back and retries.

for each remaining TODO:
  system  = global instructions + per-rule coaching for the last FAIL's root cause
  raw     = LLM.Complete(system, authoring prompt + feedback)   # generation (L0)
  verdict = gate verdict → ratchet Apply → export               # same path as submit
  on FAIL feed the strategy guide back and retry (<MaxTries), else lock → next

What’s decisive is that the authority asymmetry is preserved as-is. The LLM is only a generator (L0); locking PASS is still the gate (the machine). pkg/llm is an ollama/xai/gemini adapter that handles generation only, separated by type from the verdict and the ratchet. Exceeding MaxTries locks the item as DONE, so the loop terminates monotonically — it never spins forever.

The coaching is specialized per rule. Verdict.RootCause deterministically points to the rule last left unpassed (in both the flat tally and the graph), and the system instruction tailored to that rule is fed back. Not “wrong again” but “the who-anchor isn’t in the source, fix here,” narrowing every attempt. Local ollama needs no key, and num_ctx is auto-computed from the prompt length.

ccnews run  --max-warcs 1                 # seed (streaming ingestion)
ccnews loop --model ollama:gemma4:e4b     # gemma4 generates remaining TODOs → gate judges

This command is opt-in. If you don’t turn it on with cli.Options{Loop: …}, it isn’t attached — fully backward-compatible. If an external agent wants to drive, it still uses only next/submit. Either way, the authority to lock PASS lies with the machine alone.

Isolate Side Effects — ground and staged Evaluation

For a gate to be deterministic, the network must not live inside the gate. A rule that calls net/http directly can’t be unit-tested, and its verdict shakes with the state of the line.

reins corrals side effects into pkg/ground — primitives like HTTPBody and MXResolves own external lookups via an injectable Resolver and a per-request snapshot. The rule stays pure; ground takes responsibility for the outside world.

And staged evaluation: cheap checks run first, and if they fail the network fetch never happens at all. There’s no reason to do a DNS lookup on a malformed submission. You stand the expensive and shaky behind the cheap and certain.

No N=1 Abstraction

One of reins’s conventions reveals the character of this framework most precisely — do not extract an abstraction from a single consumer. A new abstraction is frozen only after it has been validated by a second consumer.

This isn’t fussiness but first principles. An abstraction extracted from one case mistakes that case’s accident for essence. Only when a second domain demands the same abstraction is it proven to be invariant. The framework applies “verification, not claims” even to its own evolution. Just as the gate doesn’t believe the agent’s claims, the abstraction doesn’t believe a single case’s claim.

The Same Sentence, Made a Library

reins stands on eight packages in pkg/ — textmatch (anti-hallucination primitives), temporal (time normalization), quest (the ratchet core), gate (the gate contract), graph (the defeat graph), ground (network isolation), llm (generation adapter, L0-only), cli (the cobra scaffold). It passes go build and go test, covering every function. And toulmin is coupled one-directionally to the graph backend only, so a consumer that doesn’t use the graph doesn’t even link toulmin.

Code: github.com/park-jun-woo/reins

If how-make-quest was one sentence — generation may be probabilistic, verification must be deterministic — reins is that sentence hardened into a compilable form. The gate re-verifies the domain’s facts, the ratchet locks what passed, the graph returns the reason you lost as a fact, and the sycophantic model complies with that fact.

Next time you need a quest CLI, don’t rewrite the ratchet. Write only your domain’s gate, and borrow the reins.

How to Make a Quest CLI — the methodology reins hardened into a framework. From the principle (why) to the command skeleton (how).
Reins Engineering — AI With Reins — the harness is a fence, the quest is the reins. The separation of decision from implementation that reins nails down in code.
Ratchet Pattern — How to Make an Agent Go All the Way — the main piece on the one-way lock and monotonic decrease that pkg/quest implements.
toulmin — a Rule Engine That Computes Contracts — the h-Categoriser of the defeat graph backend. It treats a claim not as a fact but as a rebuttable claim.
A Triple Is Not a Fact but a Claim — a case applying the same argumentation engine to a knowledge graph. Another stage for Warrant·Counter·Supersedes.
huma — a Ratchet That Doesn’t Skip Endpoints — a domain instance that fills in the four methods of Definition. Proof that swapping only the gate makes it a different tool.
Feedback Topology Over Model IQ — what decides the outcome is not the model but the feedback structure. The theoretical background of the strategy guide the graph returns.
Preconditions for Improving LLM Multi-Agent Accuracy — why L2 AI verification only works when it has independence. The theoretical background of authority asymmetry.

Sources

Toulmin, S. (1958). The Uses of Argument. Cambridge University Press. — the argumentation model from which the defeat graph’s Warrant·Ground·Backing are taken directly.
Dung, P.M. (1995). “On the Acceptability of Arguments and its Fundamental Role in Nonmonotonic Reasoning, Logic Programming and n-Person Games.” Artificial Intelligence, 77(2), 321–357. — the origin of the abstract argumentation framework and the attack (defeat) graph.
Amgoud, L. & Ben-Naim, J. (2013). “Ranking-based semantics for argumentation frameworks.” SUM 2013, LNCS 8078, 134–147. — the weighted h-Categoriser adopted by pkg/graph. The Compensation property by which an attacked node recovers acceptability when defended again, plus convergence guarantees.
Nute, D. (1994). “Defeasible Logic.” In Handbook of Logic in Artificial Intelligence and Logic Programming, Vol. 3. Oxford University Press. — the strict/defeasible/defeater classification. The formal root of reins’s rule levels (Fail/Review) and Supersedes priority.
Modgil, S. & Prakken, H. (2014). “The ASPIC+ Framework for Structured Argumentation: A Tutorial.” Argument & Computation, 5(1), 31–62. — an argumentation system structuring Nute’s classification inside the Dung framework. The lineage of the defeat graph.
Gabriel, V.O. et al. (2020). “Reasoning in BDI agents using Toulmin’s argumentation model.” Theoretical Computer Science, 805, 76–91. — a precedent implementing the Toulmin model in software (BDI agents). reins’s pkg/graph ports this to gate verdicts.
Von Neumann, J. (1956). “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components.” Automata Studies, Princeton University Press. — the principle of putting a reliable protocol on top of unreliable parts (reins’s premise).
Stechly, K., Valmeekam, K., & Kambhampati, S. (2024). “On the Self-Verification Limitations of Large Language Models.” arXiv:2402.08115 — self-verification barely raises performance → why PASS authority must sit with the L1 machine.
McKee-Reid, L. et al. (2024). “Honesty to Subterfuge: In-Context RL Can Make Honest Models Reward Hack.” arXiv:2410.06491 — even an honest model, once it judges its own reward, manipulates it → the grounds for authority asymmetry.
Bondarenko, A. et al. (2025). “Demonstrating Specification Gaming in Reasoning Models.” arXiv:2502.13295 — the more capable, the better it finds gaps in the gate → why gate = rule catalog must grow.
Thaman, K. (2026). “Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use.” arXiv:2605.02964 — deliberately hardening the gate cut exploits by 87.7%.
Fanous, A. et al. (2025). “SycEval: Evaluating LLM Sycophancy.” AAAI/ACM AIES 2025. arXiv:2502.08177 — measuring the sycophancy capitulation rate. The flip side of “for facts, flattery is an asset.”
Shapira, I. et al. (2026). “How RLHF Amplifies Sycophancy.” arXiv:2602.01002 — the theorem that RLHF amplifies sycophancy. The premise of the factual-feedback + flattery = convergence loop.
Deque Systems (2021). “Automated Testing Study Identifies 57 Percent of Digital Accessibility Issues.” — the boundary between the machine-judgeable region (57%) and the human residue (20%).

Changelog

2026-06-17: Added Quick Start — the npx skills add park-jun-woo/reins skill and /reins-quest, by which an agent scaffolds the gate.Definition
2026-06-11: Added the loop unattended generate-verify loop (pkg/llm, ollama/xai/gemini); reflected *quest.Session in the Definition signature; updated package count 7→8
2026-06-05: Initial release