May 10, 2026

The harness is the craft.

Most engineers pick a model they trust, eyeball a couple of runs, and ship. I don't ship that way. Here's what eval-driven development actually looks like, and the seven principles I'd hand to anyone shipping LLM systems.

Chris Lema · AI

If you've ever built anything with AI, not just AI writing code but when your code is calling AI to do work, you've likely experienced this dynamic. What worked once didn't work again, or got worse, or just went sideways.

So you jump in and tweak the prompt. Maybe it gets better. Maybe it gets much better. But honestly, as long as it doesn't stay bad, you're fine. But how do you know it won't get worse the next day? Or the following week?

I can't live in that world. And I'm guessing you can't either.

Why I run every model through an eval first

This weekend, I was building a job-fit analyzer. Six stages: embed a job description, vector match, blend it into a motivational profile, align against a candidate, score, narrate, generate questions. The blend stage was the most expensive LLM call, so it was the one where model choice mattered most.

I picked gpt-oss-20b. I thought it would do well, the published benchmarks looked strong for this kind of reasoning task. I ran it against a couple of test cases, the outputs looked plausible.

That's the move most engineers make. Pick a model you have reasons to trust, eyeball a couple of runs, ship. Most engineers reading this have done the same thing this week.

I don't ship that way. I run the eval.

What the eval caught in 90 seconds

The eval is simple. Same job description, same candidate profile, four candidate models running in parallel. Compare outputs side by side. Validate structure first, then quality.

gpt-oss-20b returned empty content. Every single time. Not bad content. Empty.

Here's why. It's a reasoning model. Its tokens were going into a reasoning_content field. My code was reading content. The field was empty. The wrapper returned successfully. Nothing logged an error.

If this had hit production, here's what would have happened. Users would get a job analysis with the blend stage silently missing. Downstream stages would chug along with a blank input. The output would feel "off" in a way nobody could quite name. I'd spend hours debugging "something seems wrong with the rationale" while the actual problem was an empty string two layers up.

The eval caught it in 90 seconds.

That's why I do this. Not because I learned a lesson this weekend. Because every weekend has a version of this moment, and the eval is what stands between it and your users.

What eval-driven development actually is

Eval-driven development is to LLM systems what test-driven development is to deterministic code, with one important difference. Your outputs aren't right or wrong. They're better or worse on several axes at once.

You can't unit-test "did the rationale make sense." But you can build a harness that runs the same input through four candidate models and reports pass-rate, latency, structural conformance, and qualitative comparison side by side. The reason that matters, and the reason I never let a single LLM evaluate its own work, is that differently-calibrated systems catch different things. The comparison is where the architectural decision lives. So you let the comparison make the architectural decision instead of your gut.

The shift sounds small. It changes how everything moves.

Instead of choosing a model because you think it'll do well, you choose it because the eval said it earned the slot. And once you've built the harness, every prompt change, every cost optimization, every "this seems better" intuition runs through the same machinery. You stop guessing whether your improvement actually improved anything.

The eval becomes the project's unit of progress.

What the harness then drove

After flagging gpt-oss-20b, the same harness drove the next decision.

I compared three production-viable models on the same five cases: llama-3.1-8b-fast, llama-4-scout-17b, and mistral-3.1-24b.

The 8B model validated 80% of the time at 3 seconds. The bigger models validated 100% at 14 to 15 seconds.

That's not "8B is better" or "8B is worse." That's the data that justified a routing architecture. Cheap fast primary, expensive fallback only when validation fails. After a follow-up prompt change tuned by the same eval, the escalation rate dropped 60 percentage points.

None of that decision tree exists without the eval. It's a direct output of being able to see the comparison.

Seven principles I'd hand to anyone shipping LLM systems

These aren't theory. They're how I work, and what the eval has been teaching me for months.

One. Build the eval before you ship the model. If you can't measure it, you can't tell if a change helped. Write the comparator first, even a crude one. Then start tuning.

Two. Vary one variable at a time. Pin every stage except the one you're testing. Otherwise effects mix. A "rationale model improvement" might actually be a blend-output change feeding it different inputs.

Three. Validate structure before evaluating quality. "Does it parse as JSON with the required fields" is a yes-or-no question with no judgment. Quality comparisons are only meaningful among outputs that pass structure first. You'll be surprised how often the bigger, more expensive model is the one that fails structure.

Four. Look at distributions, not averages. "Average latency 8 seconds" hides "1 in 5 calls returns empty." The interesting story is almost always in the tails. The tails are where production breaks.

Five. Persist results. Models change. Prompts change. Tokenizers silently update. Without a record of what was true on what date, you'll re-run the same eval next month with no way to tell whether the answer changed because you did or because the model did.

Six. Tie evals to production telemetry. An eval is a hypothesis. "Model X is the right choice." Production data is the test of that hypothesis. When real-world failure modes don't match what the eval predicted, the eval is wrong. Update it.

Seven. Make the eval cheap to re-run. If running the comparison takes 20 minutes of manual curl-and-grep, you'll do it once and never again. If it's a single command, you'll run it every time you touch a prompt. That's where the leverage is.

Why this matters for what your job is becoming

There's a version of this story where AI replaces engineering judgment. That's not the version I've lived.

The version I've lived looks like this. AI can write the model call. AI can suggest the prompt. AI can generate a candidate rationale. What AI can't do is decide whether the output earned the slot. That's a judgment about your system, your users, your edge cases, and your tolerance for failure modes.

The eval is where that judgment lives.

Engineers who use AI tools to write code faster are doing fine work. Engineers who ship AI systems are doing different work. They're not writing the call. They're designing the measurement that decides which call gets shipped. They're not picking the model. They're building the harness that picks the model. What I'm describing here, for engineering, is one instance of the 4-part judge-in-the-loop pattern that shows up everywhere AI does real work. The artifact changes. The four parts don't.

The harness is the craft.

And if you're reading this thinking "I should probably have an eval for the LLM call I shipped last month," yeah. You probably should. Start crude. Five cases, four models, structure check, eyeball the rest. You'll know more in 90 seconds than you knew in months of vibes.

A story. An insight. A bite-sized way to help.

Get every article directly in your inbox every other day.

I won't send you spam. And I won't sell your name. Unsubscribe at any time.

About the Author

Chris Lema has spent twenty-five years in tech leadership, product development, and coaching. He builds AI-powered tools that help experts package what they know, build authority, and create programs people pay for. He writes about AI, leadership, and motivation.