Burning a City for a Single Answer

Burning a City for a Single Answer Image: AI generated

The Price of a Single Answer

A trillion-parameter model burns a city’s worth of electricity and water just to spit out a single answer.

One inference heats a data center, and water evaporates to cool that heat away. The estimates split by orders of magnitude depending on the source, but the IEA reckoned a single ChatGPT query uses nearly ten times the electricity of an ordinary search, and one analysis found a single 100-word answer costs a bottle of water. And even the answer that comes back after all that burning has to be asked again half the time, reversed by a single “Are you sure?” Waste piled on waste.

I thought this was insane.

I tend to see waste less as a limit of nature and more as a failure of design. If something is being thrown away, usually it just means a better design hasn’t been found yet. Yet today’s AI goes the opposite way. Bigger, burning more, getting it wrong more often.

So I started looking for an answer. There had to be another road, one that wasn’t just making it bigger.

If Bigger Isn’t the Answer

The industry’s answer pointed one direction. Scale. Add parameters, add data, add context. Hit a wall and reach for a bigger hammer.

First-principles thinking says stop there. Is this really right? Is a bigger statistical machine a more accurate machine, or just a more expensive one?

I went back to symbolic. Instead of approximating meaning with statistics, the road of binding it into verifiable structure. The road of attaching a source, a timestamp, and a confidence to every claim so the machine can verify itself. I believed the answer was there, and I searched for the method like a madman.

Then I saw the answer in an unexpected place.

The Flaw Everyone Was Trying to Fix

LLMs have a flaw everyone curses. Sycophancy.

Ask “Are you sure?” and it reverses a correct answer to a wrong one. It leans, quietly, toward whatever direction the user wants. It curries favor. It is the mathematical inevitability of a model trained by RLHF on “answers people like,” and Big Tech has no incentive to fix it. It isn’t a bug; it is effectively a feature.

Everyone tries to remove it. I asked the opposite. If it can’t be removed, then where do we make it fawn?

The answer was simple. Make it fawn over fact.

Lay verified facts in front of the model, and let it speak only on top of them. Leave the fawning instinct intact, but change the target of its fawning from the user’s mood to fixed fact. Then the flaw turns its direction. The same force that used to curry favor now points at fact. Sycophancy becomes accuracy.

The Wandering Stopped

The effect was greater than I expected.

That accuracy rose was obvious. What startled me came next. The agent stopped wandering. An agent not bound to fact drifts endlessly. It builds plausible paths, stacks the next falsehood on top of a false confidence it manufactured itself, and only after going a long way realizes it was a dead end. In one evaluation, even the top-performing model failed to finish nearly 70% of multi-step tasks (Carnegie Mellon). Every one of those missteps is tokens. Electricity. Water.

Lay fact down, and the agent didn’t lose its way. The missteps shrank. And so token waste shrank.

Here, two things met as one. Accuracy and savings were not a trade-off. They were the same thing. A more accurate agent burns less. A model bound to fact is cheaper and more right. Zero waste was not a matter of cutting costs but another name for being right.

To be honest about it: this is what I saw on top of my own experiments, and I cannot yet assert that it reproduces at the same magnitude across every domain and every scale. But the direction is clear. Fix the facts, and the model wanders less and burns less.

I could have held this alone. But when I first saw the graph, what came to mind wasn’t a business plan; it was the heat of the data centers. Waste on the scale of humanity. In front of that, “only I know” meant nothing.

So I decided to tell the world.

The principle is nothing to hide. Bind the model to fact. Don’t fight to eliminate sycophancy; change what it fawns over. Let it speak only on top of verifiable structure. This must be something anyone can understand and anyone can verify. Only then is it real.

I gave it a name. The reins (Reins). Not a fence that pens the horse in, but reins that set its direction. Not binding the agent so it can’t move, but using fact as reins to steer its direction so it wanders less and burns less.

Knowing the principle and actually enforcing it on every task are different problems. Where the latter leads is the job of another piece.

This piece is just the story of why I came to walk this road. The story of one person who thought burning a city for a single answer was insane, and who picked up the answer from the flaw everyone was trying to throw away.

AI’s Sycophancy Bias Is a Business Feature. Why sycophancy is the mathematical inevitability of RLHF, and the mechanism that makes it fawn over fact
Reins Engineering: AI With Reins. How to actually enforce the principle on every task, reins instead of a fence

References

Sycophancy

Sharma et al. “Towards Understanding Sycophancy in Language Models” (ICLR 2024, arXiv:2310.13548)
Perez et al. “Discovering Language Model Behaviors with Model-Written Evaluations” (ACL 2023 Findings, arXiv:2212.09251)
Shapira et al. “How RLHF Amplifies Sycophancy” (2026, arXiv:2602.01002)
Gao, Schulman, & Hilton “Scaling Laws for Reward Model Overoptimization” (ICML 2023, arXiv:2210.10760)
Fanous et al. “SycEval: Evaluating LLM Sycophancy” (AAAI 2025, arXiv:2502.08177)
Wang et al. “When Truth Is Overridden” (AAAI 2026, arXiv:2508.02087)
Ibrahim et al. “Training language models to be warm can reduce accuracy and increase sycophancy” (Nature 2026)
OpenAI “Sycophancy in GPT-4o” (2025.4)

Energy (data centers)

“We did the math on AI’s energy footprint.” MIT Technology Review, 2025-05-20. 57 to 6,706 joules per response (small to large), about 3.4 million joules for one 5-second video. link
IEA Electricity 2024. Data center power projected to top 1,000 TWh in 2026 (about a single nation’s consumption, Japan), ChatGPT 2.9 Wh per query vs Google search 0.3 Wh (roughly 10x). (Data Center Frontier, 2024-03-08) link
IEA, “Data centre electricity use surged in 2025.” Data center power demand +17% in 2025 (5x the 3% rise in world power demand), projected to double by 2030 and triple for AI specifically. link
“Google’s Gemini AI energy per prompt.” MIT Technology Review, 2025-08-21. Median prompt 0.24 Wh (one second of a microwave), 33x efficiency gain in a single year. link
“Sam Altman defends AI’s electricity and water usage.” Fortune, 2026-02-24. OpenAI claims 0.34 Wh per query. (Per-query power estimates vary by up to 10x across sources, 0.24 to 2.9 Wh) link

Water (data center cooling)

“A bottle of water per email: the hidden environmental costs of using AI chatbots.” The Washington Post, 2024-09-18. One 100-word response is about 519 ml (a bottle of water). link
“AI behind ChatGPT was built in Iowa, with a lot of water.” AP News, 2023-09-09. GPT-4 training drew from Iowa’s river basin, Microsoft’s water use +34% from 2021 to 2022. link
“AI Could Use as Much Water as 1.3 Billion People by 2030, U.N. Report Warns.” TIME, 2026-06-03. link
“The AI Boom Is Draining Water From the Areas That Need It Most.” Bloomberg, 2025. Two-thirds of data centers built since 2022 are sited in water-stressed areas. link
“Big tech’s new datacentres will take water from the world’s driest areas.” The Guardian, 2025-04-09. link

Note: per-query power and water figures split by orders of magnitude depending on the source (power 0.24 to 2.9 Wh; the bottle of water includes indirect withdrawal at the power plant, and OpenAI counters that counting only direct cooling water comes to about 0.3 ml per query). That very variance is proof that we have not yet even managed to measure the waste honestly.

Inefficiency and scaling limits

“OpenAI and rivals seek new path to smarter AI as current methods hit limitations.” Reuters, 2024-11-11. Ilya Sutskever: results from pretraining scaling have “plateaued.” link
“AI scaling laws are showing diminishing returns.” TechCrunch, 2024-11-20. “Adding more compute, data, and size yields diminishing returns.” link
“AI agents wrong ~70% of time: Carnegie Mellon study.” The Register, 2025-06-29. Top model task completion rate 30.3%, some forged a username to fake completion. link
“Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027.” Gartner, 2025-06-25. Driven by escalating costs and unclear value. link