# Encoded Judgment: How a Cheap Model Performs Like an Expensive One
*Published: 2026-06-18*
*Tags: ai, encoding-expertise, ai-economics, for-engineers*
*Source: https://chrislema.com/encoded-judgment-cheap-model-performs-like-expensive*
---From 2005 to 2013 I ran a central product team. The setup was unusual, and I've been thinking about it again lately because it turns out we were doing something everyone in AI is trying to do right now. We just didn't have the words for it, and it took us nine months at a stretch instead of an afternoon.

Here's how it worked. The company had a handful of business units, each serving its own market. When one of them needed a new product, my central team built it. We made the investment, did the engineering, and got it working. But we didn't hand it over the day it worked. We kept it. We onboarded the first customer ourselves. Then the second. Then the third. Only after three customers were live did we hand the whole thing to the business unit.

Three wasn't a magic number. It was the smallest number that let us tell a pattern from a fluke. With one customer, every question feels unique. With three, you start to hear the same question twice, and the second time you hear it, you know it's not an accident. It's a rule you haven't written down yet.

So we wrote it down. The entire time those first three customers came through the door, we were recording. Every question a prospect asked before they bought. Every objection that came up during the sale. Every problem that surfaced the week they went live. Every issue that showed up a month later when the newness wore off. We caught all of it, and we built it into a playbook. When we finally handed the product to the business unit, we didn't just hand them software and three customers. We handed them the playbook for everything that was going to happen next.

The business unit could then run the product without us. Not because they'd lived through those nine months, but because we'd written down what we learned during them. The playbook let a team that had never sold this product before answer questions like a team that had sold it a hundred times.

That's the whole idea I keep coming back to. And I keep coming back to it because I got to watch it work slowly, before any of the fast tooling existed, which is the thing that made the shape so clear.

## Three places a decision can live

Every decision your system makes gets made in one of three places, and they cost wildly different amounts.

Picture a help desk. Some questions go to a senior expert who charges by the hour and thinks hard about each one. She's almost always right, and she's never cheap. Some questions go to a trained junior working from a good checklist — faster, and a fraction of the cost. And some questions get answered by a form the customer fills out, which costs nothing and answers instantly.

Your AI system has the exact same three tiers. The expensive frontier model is the senior expert. A small, cheap model is the junior with the checklist. And plain code, where the answer is just yes or no, is the form. All three are useful. The trouble starts when you send every single question to the senior expert, because that's the obvious thing to do and it works. The output is good. Nobody warns you about the bill until it arrives.

I want to be clear that starting expensive is the right move, not a mistake. At the beginning you don't yet know which questions are routine and which ones need real thought. So you let the expert handle everything while you watch. The playbook gets written from watching. You can't write it first. (This is the split I've called [the two kinds of AI work](https://chrislema.com/there-are-two-kinds-of-ai-work-what-if-youre-missing-one): the thinking you do once, up front, and the running you do cheaply ever after.)

## The expert never gets smarter

This is the part that surprised me when I finally said it out loud.

In the Emphasys days, my central team didn't get better at selling the product over those three customers. We were the same people on customer three as on customer one. What got better was the playbook. The product got easier to run, the answers got faster, the launches got smoother — and none of that was because anyone got smarter. It was because we'd turned what we learned into something a less experienced team could use.

The same is true with models. A frontier model is exactly as capable on its thousandth run as on its first. It doesn't climb. What climbs, if you build it right, is the system around it — the playbook, the checklist, the form. **You're not making the expert smarter. You're making the expert unnecessary for the routine work.** (I wrote the engineering version of this idea, with rubrics and graders and the two kinds of loops, in [Loop Engineering: Dumber Models, Smarter Code](https://chrislema.com/loop-engineering-dumber-models-smarter-code). This is the same idea told as a story instead of a system.)

The playbook is the thing that does the lifting. Give a cheap model a thick, well-built reference document and it can produce a first draft that's good enough to edit, on work you'd have sworn needed the expensive model. The reference document is your playbook. It's judgment you encoded once, and it's what lets a junior perform like a veteran — or a cheap model perform like an expensive one. The first time this clicked for me with AI, I'd [stopped writing better prompts and started writing down how I actually make decisions](https://chrislema.com/ai-better-prompts-decision-making) — the written-down decisions did more than any clever prompt ever had.

And that points at the difference that actually shows up on your invoice. **Judgment you capture once and reuse is an asset. Judgment you pay for again on every single request is rent.** The senior expert answering the same routine question for the thousandth time is rent. The playbook is the asset. The work of moving decisions from the first column to the cheaper ones is the work of turning rent back into something you own.

## Watching the handoff actually work

The one part of this that's genuinely hard to eyeball is the handoff itself. Does the judgment you encoded actually survive when a cheaper model picks it up? You can reason about it all day. At some point you just want to see it.

So I built a small tool for exactly that. I call it Benchmark, and it does one thing: it takes a single prompt and runs it across a wall of models at once, cheap ones and expensive ones, side by side, so I can read their answers next to each other. There's a slot to attach a reference file, and that slot is the important part. It's where the playbook goes, so every model in the run is working from the same encoded judgment. Then I read what the cheap models did against the bar I already carry in my head for what the expensive model gives me.

![Comparing multiple models using benchmark](https://cms.chrislema.com/api/media/file/benchmark.png)

That's the whole test. In the run above, the fastest answer came back from Llama 4 Scout — a small, cheap model running on Cloudflare — in a little over two seconds, and the paragraph it wrote is one I'd be comfortable shipping. That's the moment the abstract idea turns concrete. When a cheap model, handed the right reference material, lands close enough to the expensive one, that decision is ready to move down a tier. When it doesn't, that one stays with the expert. The tool just lets me see which is which, in the time it takes to read a screen.

I'm not selling it, and it isn't trying to be a product. It's a bench, in the woodworking sense — a place to test whether a piece holds weight before I build on it.

## Why this matters more now than it did then

I sat on this pattern for years, because for a long time the building blocks didn't exist as off-the-shelf parts. You had to assemble the runtime, the memory, the recording, the secure place to run the cheap model, all of it, before you got to the part that was actually yours. That changed this spring. The infrastructure finally showed up, and the part I care about most — a recording of everything the system did — now comes for free.

Which collapses the calendar in a way that's hard to overstate. The Emphasys loop took nine months and a funded team to run once. You had to hold a finished product back from a business unit that wanted it, fund the central team the whole time, and trust that the playbook would be worth more than shipping early. That took real nerve. Most of the cost was patience.

The AI version costs almost none of that. The runs are cheap. The recording is automatic. The pattern that used to take three quarters to surface can surface in a few days. The discipline that used to require holding your nerve for most of a year now requires holding your attention for an afternoon.

So if you're watching your token spend climb and wondering where it's going, you're not doing anything wrong. You're doing the obvious thing, the thing all of us do at the start: sending every question to the senior expert because she gives such good answers. The opportunity is just to start watching. Notice which questions she answers the same way every time. Those are the ones ready to move to the checklist or the form. You don't have to move them all. You just have to stop paying full price for the ones you've already figured out. (I've written about [the two habits that keep my own AI bills flat](https://chrislema.com/stop-burning-ai-budget) without giving anything up, and this is the thinking underneath both of them.)

## What stays true no matter how fast it gets

A few things didn't change when the loop sped up, and they're the things worth getting right.

Keep the recording. The whole second half of this only works if you saved what happened — not just the final result, but the steps along the way. We caught every question those first three customers asked. A system that throws that away has nothing to learn from. The playbook gets written from the recording, and there's no playbook without it.

Don't let the worker grade its own work. When the cheap model hands you a draft, the part of it that says "this is done" is the most confident line in the whole thing, and the least reliable. Check the work with something that didn't produce it — a test, a separate grader, a form. We never let the team that built a product sign off that it was ready; the customers' questions did that.

And when the cheap model gets something wrong, let it fail in the open. Put a plain fallback underneath it so nobody's ever stuck and no failure is ever silent. A breakage you can see is one you can fix.

I know I've made versions of this point before, and I'll probably make them again. It's one idea, and I keep finding new doors into the same room. This time the door was a product team I ran more than fifteen years ago, waiting for a third customer to walk in so we could finish writing down what we'd learned. The pattern hasn't changed. What changed is that the price of running it dropped to almost nothing — and that's worth saying out loud, even one more time.