# The Key Comes Before the Song
*Published: 2026-05-22*
*Tags: ai, product-work, evals, for-engineers*
*Source: https://chrislema.com/the-key-comes-before-the-song*
---When Melissa sits down at the piano and I'm about to sing, there's a small moment before anything happens. She picks a key. That isn't the song. The song is what comes after. But if she picks the wrong key, the song doesn't work. I'm reaching for notes I can't hit. My voice strains against a register that wasn't built for it. The melody is still right. The lyrics are still right. But nothing about the performance lands, because the foundational constraint — the one decision made before the first note — was off.

I've been thinking about that small moment a lot lately, because I'm building a new product, and I keep running into the same realization about AI that we'd run into if Melissa just sat down and started playing in whatever key felt right that morning.

## What I'm building

The product is a personal report. You take the Motivation Code assessment, and the report helps you think about how *you specifically* should embrace AI — given the way you're wired. Not generic "here are five tips for using ChatGPT" advice. Something that takes your top motivational dimension and your top five motivations and produces a report that reads like it was written for you, because it was.

The AI does the writing. That's the whole point. That's what makes it possible to deliver a personalized report at a price anyone can afford. If a human had to write each one, the product wouldn't exist.

And right there, in that simple architecture, is where I had to slow down.

## What the obvious approach looks like

If you're building anything where AI generates the content, the obvious instinct is to focus on quality control at the end. Generate the report, then check it. Does it sound right? Did it make anything up? Is the tone where we want it? That instinct is reasonable. I had it too. It's the place most of us start.

The trouble is that if checking at the end is your whole strategy, three things start to bite.

The first is speed. Every report has to go through a heavy review, because nothing upstream constrained what the AI was allowed to produce. So now a human (or another expensive AI call) has to read each one and decide if it's okay. Multiply that by thousands of reports and the margins disappear.

The second is [the trap of reaching for tools that look right but don't hold up](https://chrislema.com/the-harness-is-the-craft). The most common one I tried to talk myself into: vocabulary lists. *We don't want the AI mentioning industries, so let's build a list of industry words and flag any report that contains them.* I sat with that idea for a while before I noticed the problem. "Manager" goes on the list, but "product manager" slips through. Add "analyst" and you miss "senior systems architect." Vocabulary lists either catch things they shouldn't or miss things they should, and the maintenance never ends. It's duct tape on a leak that keeps moving. Looked at honestly, the vocabulary-list approach was a gameable measure — exactly the kind that [makes a scoreboard lie to you](https://chrislema.com/scoreboard-lying-to-you).

The third is the expensive one. You start using AI to check the AI. Every report gets sent to another model that reads the whole thing and scores it. Now you're paying for generation *and* paying for evaluation, and it's surprisingly easy for the evaluation bill to grow faster than the revenue per report. I had to do the math on this twice before I trusted my own answer.

So the real question — and this is the one I think any of us building with AI eventually has to sit with — isn't *should we do quality checks?* It's: **what kind of checks, at what cost, catching what kinds of problems, in what order?**

## Where I landed

What started to make sense was thinking about checks the same way you'd think about a security pat-down at an airport. You don't pull every passenger into a private room and search them thoroughly. You walk them through a metal detector first. Most people pass and move on. Only the ones who set off the cheap check get the expensive check. And the order saves the system from collapsing under its own cost.

That ordering principle turned into four kinds of checks, but they don't really feel like four phases to me. They feel like one system with four checks.

**The first check is code.** Just code, doing exact-match work. Did the report use the right person's name? Are the top five motivations from the input actually the top five in the output, in the right order, with the right scores? Are all the required sections present? Is the output even valid structured data? A computer answers these in milliseconds. The cost is zero per report. The catch rate, in my testing, is enormous — most regressions get caught here before anything more expensive even runs.

**The second check is also code, but it's checking *****shape***** instead of *****facts*****.** Word count inside the band of the gold-standard samples. Paragraph length under sixty words on average. No single paragraph over a hundred. Sentence length sane. Headers present. Two sections sharing too many identical five-word phrases (which would mean the AI is looping). None of this tells me if the report is *good*. But it tells me if it's *shaped like* the good ones. And the correlation between "shaped right" and "actually right" is stronger than I expected before I measured it.

**The third check is where AI judgment finally earns its cost.** This is the grounding check — making sure the report didn't invent claims that weren't in the inputs. And this is where the vocabulary list trap really clarified my thinking, because once I'd abandoned the list approach, I had to figure out what to put in its place.

What replaced it was a question reframe. Instead of asking *does this report mention any forbidden words?* you flip it: *is every claim in this report derivable from the inputs?*

That sounds abstract, so here's what the actual prompt looks like for one paragraph of one report:

> You are evaluating whether a paragraph from a generated report is fully grounded in the inputs the generator was given.
> 
> The generator was given ONLY these facts:
> - Person's name: Sarah Chen
> - Top dimension: Visionary
> - Top 5 motivations, in order: Make the Grade (28), Bring to Life (26), Excel (24), Influence Behavior (22), Gain Ownership (21)
> - The MCode data dictionary definition of each of those motivations
> 
> The generator was NOT given the person's industry, role, company, seniority, team size, skill level, age, gender, location, life circumstances, or any motivations or dimensions outside those listed above.
> 
> Paragraph to evaluate:
> "
> {paragraph}
> "
> 
> For each claim the paragraph makes about Sarah, decide whether it's derivable from the facts above. Return JSON with grounded_claims, ungrounded_claims, and a verdict.

That's the whole check. No list of bad words. The judge has the complete universe of allowed facts, and it just asks whether each claim stays inside that universe.

The thing I keep coming back to about this approach is that it's self-maintaining. If next week the AI starts inventing some category of nonsense I never thought to forbid — say, claims about astrological signs or hobbies or political leanings — I don't have to update a list. The check catches it the first time it shows up, because none of those things are in the allowed facts. Anything the report claims about the person that *can't* be traced back to an input is, by definition, ungrounded.

**The fourth check is for tone and feel.** Does the report sound like the gold-standard samples? Is it specific enough that swapping the inputs would make it read wrong, or is it generic enough to apply to anyone? These are judgment calls, and they do need an AI judge. But they only run on a sample of reports, not all of them. The cheaper modules already filtered out the noise.

## Why the ordering carries so much weight

Here's the part I didn't see clearly at first, and it changed how I think about all of this.

The cheap checks aren't there to catch problems. They're there to *prevent the expensive checks from being wasted on broken outputs.* If a report fails the schema check — the name is missing, the motivations are in the wrong order, the structure is malformed — there's no point spending tokens having an AI judge evaluate its tone. The report is broken. Skip the expensive check entirely.

When the layers are stacked this way, almost every report ever generated only touches the free modules. The cheap code catches the loud failures. The shape checks catch the structural ones. The grounding check, when it runs, runs on output that's already passed everything else. The tone judge, the most expensive of all, runs only on a sampled subset of reports that have already passed grounding.

That's how you keep evaluation from eating the product. Not by skipping it. By ordering it.

## What changed in how I build

Three things shifted in my thinking, and they go beyond this one product.

**The first is that constraints come before content.** Before I write a single line of the prompt that tells the AI what to generate, I write down what the AI is *not allowed to know*. For my report, that means: no industry, no role, no seniority, no skill level, no demographic information. The AI gets the person's name, their top dimension, their top five motivations, and the dictionary. Nothing else. Because if I don't give it those facts, it can't reference them — and if it can't reference them, it can't hallucinate them. The constraint is upstream of the generation, not downstream. That's intelligence spent at design time rather than run time, which is exactly the split between [the two kinds of AI work](https://chrislema.com/there-are-two-kinds-of-ai-work-what-if-youre-missing-one).

**The second is that evals are part of the product, not a separate layer bolted on at the end.** The constraints I just described aren't testing decisions. They're product decisions. They shape what the report *can* be. The check that catches a hallucinated industry reference isn't a QA tool — it's the enforcement mechanism for a product promise. *This report will only ever say things that follow from your motivational profile.* That's a real promise to a customer. The eval is how I keep it.

**The third is that I now plan eval cost the same way I plan compute cost.** How many checks run on every report? How many run on a sample? How many only fire when something specific gets flagged? The architecture of *when* and *how often* you check matters as much as *what* you check. Skip that planning and the evaluation budget can quietly outgrow the product.

## A small invitation

If you're building anything with AI doing content generation — reports, articles, emails, summaries, anything — the temptation is to focus on the prompt and the output. The prompt is the fun part. The output is what your customer sees. So that's where the attention goes.

But the prompt and the output aren't where the work really lives. The song lives there — that's the performance everyone sees. What nobody sees is the key that got chosen before the first note. Get the key right and the singing has a chance. Get it wrong and every new verse — every new use case, every new template, every new edge case — makes the strain worse, not better.

Pick the right key and you can sing as long as you want.

That's what guardrails are really for. Not to stop the AI from being creative. To make creativity *possible at scale*, by removing the failure modes that would otherwise eat your margins, your speed, and your trust.

I'd rather spend a week picking the key than a year apologizing for songs that shouldn't have been sung.

The key comes before the song. Every time.