reins — Keep Only the Domain in a Quest CLI; Make the Ratchet a Framework Image: AI generated

how-make-quest was about building a quest CLI with your bare hands. What a ratchet is, how you hang a gate, how you block cheese. Hand an agent that one article and out comes a cobra-based Go CLI.

But what happens when you build a second quest CLI? You write the same one-way state machine again. You write the same scan/next/submit/status/export again. You write the same PASS lock, the same monotonically decreasing remaining, the same JSONL export again. The only thing that changes is the gate, yet every time you rewrite all the rest. This is the boilerplate tax you pay each time you build one more quest.

The pattern was reusable. The code wasn’t. reins closes that gap.

What Is Invariant and What Is Domain

Lay two quest CLIs side by side and take the difference, and the boundary is sharp.

Invariant (shared by every quest)     Domain (differs per quest)
─────────────────────────────────     ──────────────────────────
ratchet: TODO→PASS irreversible        what counts as one quest
command skeleton: scan/next/submit…    what counts as a "fact"
level tally: Fail/Review→verdict        which cheese must be blocked
progress persistence · resumable
export: emit once

The left side is exactly what how-make-quest proved — whether the domain is a company name, an endpoint, or a function, the ratchet’s teeth catch the same way. Only the right side is known by a human. reins supplies the left as a framework and leaves you only the right.

This isn’t a new claim but an old principle that reins enforces in code — the separation of decision from implementation. The gate is the decision (what is true in this domain); the ratchet, the CLI, the tally are the implementation. Rewriting the implementation every time is the failure of binding the decision to the implementation.

You Implement Only the Gate

To make a quest with reins is to fill in the four methods of a single interface.

type Definition interface {
    Seed(args []string) ([]*quest.Item, error)            // input → initial TODO seeds
    Render(it *quest.Item) (string, error)                // authoring prompt + verification context that next shows
    Prepare(it *quest.Item, raw []byte) (gate.Context, *quest.Verdict, error) // decode the submission
    Rules() []gate.Rule                                   // the gate's violation-rule catalog
}

func main() { cli.NewQuestCmd("myquest", myDef{}, cli.Options{}).Execute() }

That one line of main supplies the ratchet, the six commands, the tally, export, and the resumable session — all of it. What you wrote is just the four pieces of domain. The agent still needs to know only two commands — receive with next, submit with submit. The machine decides the rest.

The Gate Is a Catalog of Cheese-Defense Rules

The core of how-make-quest was “design a gate that can’t be cheesed.” reins turns that design into a data structure — gate = rule catalog. One rule is one cheese detector. When it finds a violation it fires (true) and carries a fact (Fact).

// One cheese-defense rule of a news-event-extraction quest.
// "does the who-anchor actually exist in the source" — if the agent invents a person, it's caught.
var whoAnchorPresent = gate.Rule{
    Meta: gate.RuleMeta{ID: "who-anchor-present", Level: gate.LevelFail, Desc: "required who-anchor exists in source"},
    Check: func(ctx gate.Context) (bool, quest.Fact) {
        sub := ctx.Submission.(*Event)
        if miss := textmatch.MissingTokens(ctx.Source, sub.Who.Anchors); len(miss) > 0 {
            return true, quest.Fact{Where: "who.anchors", Expected: "source substring", Actual: miss[0]}
        }
        return false, quest.Fact{}
    },
}

The virtue of this structure is that it grows. Every time you discover new cheese, you add one rule and the gate gets that much harder. And the catalog documents itself — when the rules command prints the rule list, that is “an audit list of the cheese I’m blocking.” There is no gate that doesn’t know what it blocks.

Severity is not a weight but a level. A single Fail means FAIL. A decisive violation is non-negotiable — nine 99-point violations cannot cover one Fail. Evaluate tallies the fired rules by level: if any is Fail, FAIL; otherwise if there’s a Review, REVIEW; if all pass, PASS.

Authority Asymmetry, Enforced by Type

The single most important line in how-make-quest was “only the machine locks PASS.” reins nails this down not as a convention but as a type.

L1 Machine (deterministic)   the sole authority to lock PASS
L2 AI (skeptic)              REVIEW only — raises doubt but cannot grant completion
L3 Human                     the residue both missed

The machine gate issues PASS. Even if you put an AI verifier into the gate, the most it can do is pull something to REVIEW. It makes the wrong thing impossible in the first place — if the framework offers no API that grants the AI the authority to PASS, you cannot, even by accident, leave the verdict to a drunk friend.

A Second Backend — the defeat Graph

For many gates, a level tally of independent rules is enough. But once the rules begin to contend with one another — “this violation only matters when that one is present,” “the root cause of this failure is actually that one” — hand-written if-else guards erode the gate. It’s not where the weak gate breaks, but where the complex gate rots.

reins’s second gate backend moves this contention into a declarative graph — toulmin h-Categoriser. The Toulmin argumentation model becomes the data structure directly:

  • Warrant — tautology PASS. The grounds for “passes if there’s no rebuttal.”
  • Counter — a violation attacks the warrant.
  • Supersedes — priority among rules. Which rebuttal beats which.

Hand-written guard clauses evaporate into Attacks and Supersedes edges. And when there are zero edges, this graph is exactly equivalent to the level tally — complexity is an opt-in cost that turns on only when needed (it turns on when Definition implements gate.Evaluator).

The real gift the graph gives is not the verdict but the feedback. Graph evaluation hands the agent a direct strategy guide — Verdict.Feedback: “why you lost, and what to change to win.” Not a bare “FAIL” but a root cause computed from the structure of the argument.

Here the paradox of how-make-quest works again. The model flatters — it obediently follows instructions. For opinions, flattery is poison; but for facts, flattery is an asset. The strategy guide isn’t an opinion (“this seems a bit off”) but a fact (“who.anchors isn’t in the source, change this”). The more sycophantic the model, the more readily it accepts that fact and converges. Deterministic graph + sycophantic LLM = a loop where convergence is guaranteed.

Isolate Side Effects — ground and staged Evaluation

For a gate to be deterministic, the network must not live inside the gate. A rule that calls net/http directly can’t be unit-tested, and its verdict shakes with the state of the line.

reins corrals side effects into pkg/ground — primitives like HTTPBody and MXResolves own external lookups via an injectable Resolver and a per-request snapshot. The rule stays pure; ground takes responsibility for the outside world.

And staged evaluation: cheap checks run first, and if they fail the network fetch never happens at all. There’s no reason to do a DNS lookup on a malformed submission. You stand the expensive and shaky behind the cheap and certain.

No N=1 Abstraction

One of reins’s conventions reveals the character of this framework most precisely — do not extract an abstraction from a single consumer. A new abstraction is frozen only after it has been validated by a second consumer.

This isn’t fussiness but first principles. An abstraction extracted from one case mistakes that case’s accident for essence. Only when a second domain demands the same abstraction is it proven to be invariant. The framework applies “verification, not claims” even to its own evolution. Just as the gate doesn’t believe the agent’s claims, the abstraction doesn’t believe a single case’s claim.

The Same Sentence, Made a Library

reins stands on seven packages in pkg/textmatch (anti-hallucination primitives), temporal (time normalization), quest (the ratchet core), gate (the gate contract), graph (the defeat graph), ground (network isolation), cli (the cobra scaffold). It passes go build and go test, covering every function. And toulmin is coupled one-directionally to the graph backend only, so a consumer that doesn’t use the graph doesn’t even link toulmin.

Code: github.com/park-jun-woo/reins

If how-make-quest was one sentence — generation may be probabilistic, verification must be deterministic — reins is that sentence hardened into a compilable form. The gate re-verifies the domain’s facts, the ratchet locks what passed, the graph returns the reason you lost as a fact, and the sycophantic model complies with that fact.

Next time you need a quest CLI, don’t rewrite the ratchet. Write only your domain’s gate, and borrow the reins.


Further reading

The principle reins hardened into code — generation is probabilistic, verification is deterministic — is not reins’s own discovery. People who don’t know each other hit the same wall and arrived at the same conclusion. The independent-convergence projects gathered by how-make-quest are the evidence.

  • episteme — forces a Reasoning Surface before irreversible work. The same intuition as reins’s ratchet — PASS verifies before it locks.
  • MagLab — “the LLM reasons only, the numbers come from a deterministic tool.” The same separation as reins isolating side effects into pkg/ground.
  • Manifesto — “Agent proposes, World verifies.” Summarizes reins’s authority asymmetry (only L1 locks PASS) in one sentence.
  • oh-my-kamisama — “diffs beat claims.” The same principle as the gate re-verifying the agent’s fact, not its claim.

And the root of the defeat graph backend is argumentation theory — the Toulmin·Dung·Amgoud lineage in the sources below. reins’s pkg/graph ports that 60-year-old formal logic into a Go data structure.


Sources

  • Toulmin, S. (1958). The Uses of Argument. Cambridge University Press. — the argumentation model from which the defeat graph’s Warrant·Ground·Backing are taken directly.
  • Dung, P.M. (1995). “On the Acceptability of Arguments and its Fundamental Role in Nonmonotonic Reasoning, Logic Programming and n-Person Games.” Artificial Intelligence, 77(2), 321–357. — the origin of the abstract argumentation framework and the attack (defeat) graph.
  • Amgoud, L. & Ben-Naim, J. (2013). “Ranking-based semantics for argumentation frameworks.” SUM 2013, LNCS 8078, 134–147. — the weighted h-Categoriser adopted by pkg/graph. The Compensation property by which an attacked node recovers acceptability when defended again, plus convergence guarantees.
  • Nute, D. (1994). “Defeasible Logic.” In Handbook of Logic in Artificial Intelligence and Logic Programming, Vol. 3. Oxford University Press. — the strict/defeasible/defeater classification. The formal root of reins’s rule levels (Fail/Review) and Supersedes priority.
  • Modgil, S. & Prakken, H. (2014). “The ASPIC+ Framework for Structured Argumentation: A Tutorial.” Argument & Computation, 5(1), 31–62. — an argumentation system structuring Nute’s classification inside the Dung framework. The lineage of the defeat graph.
  • Gabriel, V.O. et al. (2020). “Reasoning in BDI agents using Toulmin’s argumentation model.” Theoretical Computer Science, 805, 76–91. — a precedent implementing the Toulmin model in software (BDI agents). reins’s pkg/graph ports this to gate verdicts.
  • Von Neumann, J. (1956). “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components.” Automata Studies, Princeton University Press. — the principle of putting a reliable protocol on top of unreliable parts (reins’s premise).
  • Stechly, K., Valmeekam, K., & Kambhampati, S. (2024). “On the Self-Verification Limitations of Large Language Models.” arXiv:2402.08115 — self-verification barely raises performance → why PASS authority must sit with the L1 machine.
  • McKee-Reid, L. et al. (2024). “Honesty to Subterfuge: In-Context RL Can Make Honest Models Reward Hack.” arXiv:2410.06491 — even an honest model, once it judges its own reward, manipulates it → the grounds for authority asymmetry.
  • Bondarenko, A. et al. (2025). “Demonstrating Specification Gaming in Reasoning Models.” arXiv:2502.13295 — the more capable, the better it finds gaps in the gate → why gate = rule catalog must grow.
  • Thaman, K. (2026). “Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use.” arXiv:2605.02964 — deliberately hardening the gate cut exploits by 87.7%.
  • Fanous, A. et al. (2025). “SycEval: Evaluating LLM Sycophancy.” AAAI/ACM AIES 2025. arXiv:2502.08177 — measuring the sycophancy capitulation rate. The flip side of “for facts, flattery is an asset.”
  • Shapira, I. et al. (2026). “How RLHF Amplifies Sycophancy.” arXiv:2602.01002 — the theorem that RLHF amplifies sycophancy. The premise of the factual-feedback + flattery = convergence loop.
  • Deque Systems (2021). “Automated Testing Study Identifies 57 Percent of Digital Accessibility Issues.” — the boundary between the machine-judgeable region (57%) and the human residue (20%).

Changelog

  • 2026-06-05: Initial release