Why Your Agent Never Stops Image: AI generated

The 24/7 Brag

“I’ve got my agent running 24 hours a day.”

You see this a lot on X. As if the longer an agent runs, the more work it gets done. As if a person who never sleeps is more productive.

But the feeling this sentence stirs isn’t admiration. It’s a question.

“Why isn’t it done yet?”


A Healthy System Is One That Can Stop

I handed an agent the task of writing tests for 527 functions. The result:

Autonomous agent:  declares "done" after 40 / 527
CLI loop:          finishes all 527 / 527, then exits

The CLI loop took one hour. Not 24. It processes one function, verifies it, moves on when it passes, and stops once everything is finished. The key to this loop isn’t speed — it’s that the termination condition is mechanically defined.

TODO → write test → measure coverage → PASS/DONE → next → ... → all done → stop

finite. measurable. monotonic. So it converges. So it stops.

Being able to stop is not a weakness. It means the system is healthy.


Three Reasons It Never Stops

When an agent runs for a long time, it’s usually one of three things.

1. The verifier is weak

"looks good"
"seems better"
"more scalable"
"clean architecture"

These are not convergence criteria. They are subjective judgments. go test returns pass/fail, but who decides whether something is “clean architecture”? Another LLM? That’s like asking your drunk friend, “Am I drunk?”

The empirical evidence backs this up. LLM judges for code evaluation are biased even by superficial variations of semantically equivalent code, inflating scores or unfairly cutting them (Moon et al. 2025), and models bend their own answers to agree in 58.19% of cases (SycEval, Fanous et al. 2025). “looks good” has nothing to do with correctness. And weak criteria don’t just fail to stop — when you make the measure a target, the measure breaks (Goodhart’s law; Manheim & Garrabrant 2018), and capable reasoning models hack the verification procedure itself instead of solving the task head-on (Bondarenko et al. 2025).

Without a convergence criterion, there is no end.

2. There is no task boundary

"improve the codebase"
"make the architecture cleaner"
"keep optimizing"

These are tasks with no termination condition. Even human developers wander endlessly under goals like these. An agent is no different. “Improvement” is a direction, not a destination.

3. Entropy outpaces the rate of correction

This is the most common and most insidious pattern.

As the agent makes edits, it adds abstractions. It introduces indirection. It creates unnecessary generalizations. The code looks like it’s “getting better,” but in reality new entropy accumulates faster than the verifier can remove it.

the abstraction built today → removed again tomorrow → added again the day after

This is non-monotonic optimization. It looks like it’s moving forward, but it’s standing still. It looks like a perpetual motion machine, but it’s only consuming energy. In this case, the energy is tokens.

Large-scale evidence captures this drift. Adopting Cursor raised short-term velocity, but static-analysis warnings and code complexity rose continuously, and this accumulation was the main cause of the long-term slowdown (He et al. 2025, 807 open-source repositories). Of the issues introduced across more than 300,000 AI-written commits, 22.7% survived as technical debt all the way to the latest version (Liu et al. 2026). Correction can’t catch up with entropy.


Not a Search Problem, a Constraint Satisfaction Problem

This is where a fundamental difference in perspective surfaces.

“Running the agent longer produces better results” is a view that treats software engineering as a search problem — the expectation that searching a wide space long enough will find a better solution.

But software engineering is, in essence, a constraint satisfaction problem.

  • Types must match
  • Tests must pass
  • Coverage must be met
  • Schemas must align
  • Lint rules must be obeyed

Once all these constraints are satisfied, you’re done. There’s nothing more to “search.” Define the constraints, satisfy them, stop. That’s all there is to it.

Code is already a machine-checkable domain. Compilers, type checkers, tests, coverage, linters, schema validation — all of these are deterministic verifiers. With these verifiers in place, why send an agent searching endlessly?

The learning research points in the same direction. When you use a deterministic verifier like a unit test as a reward — a verifiable reward — code correctness improves over open-ended generation (CodeRL, Le et al. 2022; RLTF, Liu et al. 2023). The verifier isn’t a tool for narrowing the search. It’s evidence that the problem was never a search to begin with, but a satisfaction.


The Conditions of a Good Loop

A good agent loop closes in five steps:

1. Define the task   — what must be achieved (a mechanically decidable goal)
2. Limit the scope   — one unit at a time (function, endpoint, file)
3. Symbolic verify   — a deterministic tool decides pass/fail
4. Converge          — pass → next; fail → retry with feedback
5. Terminate         — no items left → stop

In this structure the LLM handles only step 3 (generation). Everything else is done by machines. In particular, the key is that the machine decides “done.” Leave the termination judgment to the LLM, and you’ll hear “done” at 40/527.

The experiments agree. Attach self-critique to an LLM and its performance on reasoning and planning tasks actually collapses; it improves substantially only when you attach a sound external verifier (Stechly et al. 2024). Intrinsic self-correction without external feedback fails — and sometimes gets worse after correcting (Huang et al. 2023). There’s a reason we don’t leave termination to the LLM.


Creative Writing and Code Are Different

There is one exception. Not every domain works this way.

Writing, marketing, design — these domains have weak verifiers. You can’t mechanically decide “is this sentence good?” In domains like these, a long search can be meaningful: the agent generates many variants and a human chooses.

But code is different. Code is already a world full of deterministic verifiers. In this world, prolonged wandering is not search — it’s drift.


The Question

How many hours has your agent been running right now?

Is it converging, or is it drifting?

Can it stop?

If it can stop, then why hasn’t it stopped yet?


Further reading (external)

Sources

Termination decisions · the limits of self-verification

  • Stechly et al. “On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks” (2024, arXiv:2402.08115)
  • Huang et al. “Large Language Models Cannot Self-Correct Reasoning Yet” (2023, arXiv:2310.01798)

LLM-as-judge · the unreliability of self-critique

  • Gu et al. “A Survey on LLM-as-a-Judge” (2024, arXiv:2411.15594)
  • Moon et al. “Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation” (2025, arXiv:2505.16222)
  • Fanous et al. “SycEval: Evaluating LLM Sycophancy” (2025, arXiv:2502.08177)

Drift · rising AI code complexity

  • He et al. “Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects” (2025, arXiv:2511.04427)
  • Liu et al. “Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild” (2026, arXiv:2603.28592)

Verifiable reward · verifier-based code generation

  • Le et al. “CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning” (2022, arXiv:2207.01780)
  • Liu et al. “RLTF: Reinforcement Learning from Unit Test Feedback” (2023, arXiv:2307.04349)

Reward hacking · specification gaming

  • Bondarenko et al. “Demonstrating Specification Gaming in Reasoning Models” (2025, arXiv:2502.13295)
  • McKee-Reid et al. “Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack” (2024, arXiv:2410.06491)
  • Manheim & Garrabrant. “Categorizing Variants of Goodhart’s Law” (2018, arXiv:1803.04585)
  • Amodei et al. “Concrete Problems in AI Safety” (2016, arXiv:1606.06565)