# Loop Engineering: For me it's about using dumber models and smarter code
*Published: 2026-06-16*
*Tags: ai, agentic-software, encoding-expertise, ai-economics, for-engineers*
*Source: https://chrislema.com/loop-engineering-dumber-models-smarter-code*
---Let me start with what a "loop" even means here, because the word gets thrown around a lot.

When you give an AI agent a real job, something bigger than a single question, you don't want to babysit it through every step. You want to hand it a goal, walk away, and come back to finished work. The only way that happens is if the agent can check its own progress, notice when it's wrong, fix it, and keep going without you. That cycle of work, check, correct, repeat is the loop. **Loop engineering is the work of building the system around the agent so that cycle actually holds up over hours or days.**

I spent a day in June writing the grading sheets for a system like this. It runs software-delivery work through six specialized agents: one that plans, one that designs the architecture, one that writes the code, one that handles the visual design, one that tests, and one that deploys. For each of those, I wrote two grading sheets. I'll call them rubrics, because that's what they are: a scored checklist that says what "good" looks like, with specific levels and clear pass-or-fail gates. One rubric grades the thing the agent produced. The other grades how it behaved while producing it. Thirteen rubrics in all. (The whole setup, the agents and the rubrics, lives in my [claude-environments](https://github.com/chrislema/claude-environments) repo if you want to see it.)

Somewhere around the engineer's behavior rubric, writing a check that fails any run where the agent claims it tested the code but never actually ran it, I caught myself believing something wrong.

I'd been picturing the loop as the thing that makes the agent smarter. Run it, grade it, hand back the notes, run it again, and the agent climbs. **That picture is everywhere right now, and it's mostly wrong. The agent doesn't climb. The model is frozen.** It is exactly as smart on run one thousand as it was on run one. What climbs, if you build it right, is the system around the model. **What a good loop gives you is a cheaper, more trustworthy machine.**

That sounds like a small distinction. It changes how you build everything that comes after it.

## Where the intelligence lives

Here's the idea the rest of this piece hangs on, and it's simpler than it sounds.

Every decision your system makes gets made in one of three places. Picture a company answering customer questions. Some questions go to a senior consultant who charges by the hour and thinks hard about each one. Some go to a trained junior working from a good checklist, faster and far cheaper. And some get answered by a form the customer fills out, which costs nothing and is instant.

Your AI system has the same three tiers. **A frontier model** (the big, expensive, very capable one) is the senior consultant. **A resident model** (a small, cheap model running close to your application) is the trained junior with a checklist. **And deterministic code** (plain old programming, where the answer is just true or false) is the form. The senior consultant is wonderful, and you can't replace them entirely. But if you send every routine question to them, you go broke, and you wait a long time for answers you didn't need to wait for.

**Loop engineering is the discipline of moving decisions down that list over time.** You start with the frontier model carrying almost everything, because at first you don't know which decisions are routine and which need real thought. Then you watch what happens. Run after run, you move work off the senior consultant: some of it onto the junior with the checklist, some of it onto the form. You keep going until the only thing left for the expensive model is the work that genuinely can't be handled any cheaper.

I've made this argument before in plainer business terms. [Your AI bill is probably too high](https://chrislema.com/ai-bill-too-high-architecture-fix-saves-40-percent) for exactly this reason: you're sending routine work to the senior consultant on every single request. And the deeper version of the same point is that [the real shift isn't writing better prompts, it's handing the AI your decision-making](https://chrislema.com/ai-better-prompts-decision-making) so it doesn't have to reinvent your judgment each time. **Judgment you capture once and reuse is an asset. Judgment you pay for again on every run is rent.** The loop is the machine that keeps turning the rent back into an asset.

## The scoreboard

So here's the test I'd put on any long-running agent system. Forget the demo. Ask one question: **how much of the frontier model have you managed to retire?**

Not "is the output good." Good output is the price of admission. A single well-written prompt gets you good output some of the time. The thing that separates an engineered loop from an expensive one is whether the loop is shrinking its own bill. **A loop that produces beautiful work but still sends every judgment to the most expensive model, every run, is really a subscription you pay by the request.** You're paying rent on intelligence you could have bought once.

When I look at one of my runs, I want to see the frontier share going down month over month. The base of cheap, automatic checks under the system gets wider. The junior-with-a-checklist handles more of the grading. The senior consultant gets demoted from author, to editor, to occasional advisor. **If that's not happening, the loop isn't working, no matter how good the demos look.**

There's a well-known example worth holding onto. A team ran roughly two thousand agent sessions to build a C compiler capable of compiling the Linux kernel. Two thousand is the number everyone quotes. The one I'd watch is the split: how many of those sessions needed real frontier thinking, and how many were running tests, applying a known fix, or checking something a form could check. **The whole art is making that second group enormous.**

## Two loops, and most people build only one

There are actually two loops, and the second is the one that usually gets skipped.

**The inner loop runs inside a single task.** The agent drafts, a checker grades it, the feedback comes back, the agent revises, and the cycle closes when the work passes or gets handed to a human. This is the loop most agent tools give you for free. It's genuinely useful. It keeps a single run from stopping early on a plausible-looking but unfinished answer.

But the inner loop is forgetful. Everything it learns about how this kind of work tends to fail vanishes the moment the task ends. The next run makes the same mistake, the checker catches it again, the feedback fixes it again, and **you pay for that same lesson every single time.**

**The outer loop is where the real improvement happens, and it runs across runs, not inside them.** It reads the accumulated history: every run, every check result, every note the inner loop produced. And it hunts for patterns. When the same correction shows up forty times, it's one missing rule showing up forty times. The outer loop's job is to take that recurring correction and bake it into something cheaper, so it never has to be made by hand again. Maybe a sharper line in a rubric. Maybe a new automatic check. Maybe an update to the standing instructions every agent reads before it starts.

**The inner loop makes a run succeed. The outer loop makes the next run cheaper to succeed.** That's what self-improvement comes down to, once you set the word aside. The system takes yesterday's expensive judgment and turns it into tomorrow's free check. ([Knowledge isn't stored, it's built](https://chrislema.com/knowledge-isnt-stored-its-built), and this is what building it looks like in practice.)

I'll be honest about the hard part. The outer loop only works if you keep a recording of what the agent actually did, the full sequence of steps, not just the final result. Most systems throw that recording away, and an outer loop with nothing to read can't find the pattern in the first place. **Keeping that record is the price of admission for the whole second loop.**

## The checker is the line between trust and theater

A coding agent will write a plan, build a feature, run a couple of things, and tell you it's done. **That last line is the most confident-sounding text in the whole transcript, and the least reliable thing in it.** The agent is, after all, the one being graded. The entire game is replacing the agent's word with a check it can't talk its way around.

For code, that check is concrete: a test suite, a type checker, a script that either passes or doesn't. For fuzzier work, where there's no clean pass-or-fail, it's a model grading against a rubric. And there's one rule I'd carve into the wall above all the others.

**The thing that did the work does not get to grade the work.**

Think about why that's obvious in any other setting. You wouldn't let a student grade their own exam. You wouldn't let a contractor sign off on their own inspection. The moment the worker is also the judge, the work quietly bends toward whatever makes the judge happy, instead of toward what's actually correct.

So I keep my graders as separate, cheaper models, running in isolated environments, off the same system that ran the agent. Some of the managed agent platforms now offer a "self-evaluation" feature where the agent grades itself. They report real gains, and I believe them. **I still won't use that path for anything that matters, because a loop where the worker grades itself drifts toward whatever the worker finds easy to claim.** This is exactly the discipline I built into my [feedback-loop skill](https://github.com/chrislema/skills/tree/main/feedback-loop): the author of the work, the grader of the work, and the fixer of the work are never the same pass.

## Grade the behavior, not just the result

This is the part that's easiest to get wrong when you set up grading, and it's worth slowing down on.

It's tempting to grade the shape of the finished product. Take the agent's output template, make one grading line for each section, and you've got a rubric. The problem is that **the finished product can be right for the wrong reasons, and a product-only grader can't tell the difference.**

Here's a concrete case. My instructions tell the planning agent to ask me a question only when it's truly stuck, and otherwise to make a sensible assumption and keep moving, rather than stalling the whole job to confirm something obvious. None of that shows up in the plan it hands me. A planner that pestered me about settled details and a planner that quietly made the right call can produce the exact same plan. **If I only grade the plan, those two look identical, and the one that wasted my attention scores just as well as the one that respected it.**

So I grade two things. One rubric reads the finished work. The other reads the recording of what the agent did, step by step. That second one is where the behavior rules live. Did the engineer run the code before claiming it works? The finished code can't tell me; the recording can, and a simple automatic check settles it with no model involved at all. Did the agent stay inside the files it was supposed to touch, or wander off and "tidy up while it was in there"? That's a quick check against the record of what it changed. Did the run end cleanly, or just thrash until it ran out of turns? **The behavior rules are the ones that define what your system is actually trying to be, and they're usually invisible in the output. Grade the place where they live, or admit you're not measuring them.**

## Why "too hard to automate" is usually wrong

The reason the frontier share can fall so far is a pattern worth naming, because it trips people up.

It's natural to mark a task as "needs the expensive model" because doing it well *in one shot* is hard. But hard-in-one-shot and needs-the-expensive-model are two different claims, and they're easy to run together. **They're worth pulling apart.**

Here's the move. If you already have a way to grade the output, a test, an automatic check, a rubric, then "hard in one shot" comes apart. A cheap model writes a first draft against a thick reference document. Automatic checks throw out the obvious failures. And a short, budgeted pass from the expensive model reviews only what's left. The senior consultant stops writing from scratch and starts editing, and **editing costs a fraction of authoring.** The very existence of a grader is the work confessing that it's gradeable. And gradeable means you can make it cheaper.

Watch for the same trick on the things that smell impossible to automate. "Taste" often comes apart into generate-a-bunch, then-pick-the-best, where a cheap model makes the candidates and something else does the choosing. "Synthesis," in a field you know well, is usually just recognizing a familiar shape and filling it in from a library you built once. And "judgment," when your own written rules already say *if it's situation X, do Y, never Z*, has already confessed that it's a decision table. Write the table down. Only genuine ambiguity needs to go up to the expensive model. (This is the difference between real automation and [just doing manual labor faster](https://chrislema.com/stop-prompting-ai-for-every-task-thats-not-automation-thats-just-faster-manual-labor).)

What legitimately survives as frontier work has a tell: mistakes that pass every automatic check and still cost you badly later. A badly-jointed architecture that quietly corrupts everything built on top of it. A flatness in the writing that no checker can see. **Aim your remaining expensive budget at exactly those surfaces, one short editorial pass per run, with a hard limit and a floor of cheap checks underneath.** That way, when the expensive model is absent, the ceiling drops but nothing breaks.

I should name my own influences here, because none of this came from nowhere. The roles, the two-rubric idea, and the design-time discipline are spelled out in my [design-time-audit skill](https://github.com/chrislema/skills/tree/main/design-time-audit) and in the [claude-environments](https://github.com/chrislema/claude-environments) repo. And the six-agent shape, a planner, an architect, an engineer, a designer, a tester, and a deployer, is the one I wrote up a while back when [I "hired" an engineering team where none of them are human](https://chrislema.com/i-hired-an-engineering-manager-architect-developer-designer-and-qa-engineer-none-of-them-are-human).

## What keeps a self-improving loop honest

A loop that improves itself automatically has a strong pull toward making its own grader happy instead of making its work good. If you let one system write the rubric, grade against it, and apply the fixes in one unbroken cycle, it will find the easiest path to a high score, and that path almost never runs through better work. So the guardrails have to sit *outside* the loop. Three of them do most of the work.

**Keep a human on the scope calls.** Not every gap in an output is a defect. Sometimes a gap is the honest cost of a deliberately narrow job, and closing it means changing what the system is even trying to do. That's a decision a person should make, not a number the loop should chase. When my feedback separates a quick fix from a "this would expand the scope" call, the scope calls come to me, not to another round of automation.

**Anchor every improvement to reality you've already labeled.** Before any change to a rubric ships, it has to prove itself against a set of past runs I've graded by hand, and it has to still tell a known-good example apart from a known-bad one. A change that makes the grader happier but does worse on that hand-graded set is a step backward, full stop. **Those saved examples are the regression test for the grader itself.**

**Prefer failing loudly over recovering cleverly.** When the cheap model returns garbage, there has to be a plain fallback underneath it, a default, a generic version, a queued human review, so the user is never stuck and the failure is never silent. **A system that hides its own breakage is worse than one that breaks in the open, because you can't fix what you can't see.**

## The infrastructure finally caught up

I sat on this design for a long time, because the building blocks didn't exist as off-the-shelf parts. You used to have to assemble the agent runtime, the save-and-resume machinery, the secure sandbox, the per-job memory, and the durable control flow yourself, months of plumbing before you wrote a single line of the part that was actually yours.

That changed this spring. There's now a managed runtime that hands you the agent loop, the tool execution, the secure code sandbox, the save-and-resume, the scoped permissions, and, the part I care about most, a recording of everything the agent did, produced for free. That recording is the behavior log my rubrics need. And the cloud platform underneath it now offers durable workflows that survive crashes, sleep for days, and pick up exactly where they left off, plus isolated sandboxes that start in milliseconds and spending limits that cap the bill before it runs away. **The executor, the loop, the checker, and the memory each have a real home now.**

I'd take the managed runtime for the executor without hesitation. There's no reason to rebuild a sandboxed, recorded, save-and-resume agent loop yourself, and rolling your own from scratch is a path the model providers are actively closing anyway. But I'd keep the checker, the loop's decisions, and the self-improvement on my own ground, transparent and human-gated, because that's where the independence has to live, and that's where the captured judgment is mine.

**Now that the infrastructure is here, the discipline is what's left to get right:** knowing which decisions are stable enough to make cheap, building the checker the worker can't game, grading the behavior and not just the result, and running the outer loop that turns this month's expensive judgment into next month's free check.

Build the loop for the machine, not the agent. Then measure it by how much of the expensive model you managed to retire. **If that number isn't falling, you don't have a loop. You have a rent payment that's due.**
