# Inside the Skill: The Eight Steps That Build an Assessment
*Published: 2026-06-07*
*Tags: ai, product-work*
*Source: https://chrislema.com/inside-the-claude-skill-that-builds-assessments*
---[Last time I told you](https://chrislema.com/claude-skill-converts-leads-with-a-quiz) about the round trip — a one-page brief in, a working lead-gen quiz out. Today I want to open the hood.

Because the magic isn't that an AI wrote some quiz questions. The magic is the *system* underneath: eight steps that run in order, each one handing the next a little more structure, until what comes out the other end is a finished diagnostic that a checker has already refused to ship if anything's broken.

Let me walk you through it. But first, the one idea that makes the whole thing make sense.

## The trick: build a spec, not a quiz

I've written before about [the difference between using AI at design time versus run time](https://chrislema.com/there-are-two-kinds-of-ai-work-what-if-youre-missing-one). This skill is that idea taken all the way.

The skill doesn't build a quiz. It builds a single file — a spec — that *describes* the quiz completely: every question, every way to score it, every possible result, and the exact words each result says. Then a plain, boring piece of code reads that spec and renders the actual assessment. No AI runs when your customer takes the quiz. The intelligence was all spent up front, once, writing the spec.

That's the bet, and it's a good one. It means every word a customer will ever see was written and approved in advance. Nothing gets improvised in front of them. And because the working quiz, the design document, and the safety checker are all compiled from that *same* spec, they can't drift apart. Fix the spec, everything regenerates in lockstep.

So the eight steps are really eight passes that fill in that spec, one section at a time:

**Positioning → Case intake → Model induction → Outcome-space → Instrument → Scoring → Content → Assembly.**

Here's what each one actually does.

## Step 0 — Positioning: cut the map where the money is

Before modeling anything, the skill asks a commercial question, not a clever one: *what is this assessment for?* Who's the customer, what do they call their problem before they meet you, what do you sell, and — the single most useful answer — what transition are you actually paid to move people through?

That last one matters because the whole map gets cut to fit it. A model that's academically accurate but whose boundaries don't line up with where you add value is a beautiful diagnostic that leads nowhere. Accuracy on the boundaries, but the boundaries follow the offer.

This step also captures your voice and a list of banned phrases — the hype words, the competitor names, the tells you never want reaching a customer. That list gets enforced later, automatically. (More on that when we get to the gates.)

## Step 1 — Case intake: collect stories, not theories

This is the step everyone wants to skip, and skipping it is why most assessments are mush.

The skill interviews you about *real, specific clients* — one at a time, past tense. What did you notice first? What was actually going on underneath? What would a novice have thought was wrong? What happened to them?

And here's the rule that makes it work: during intake, the skill is *forbidden* from naming a stage, a category, or a pattern. No theorizing. Just stories. Because the moment you ask an expert for "the three phases clients go through," you get the tidy, rehearsed, slightly-wrong version. Ask for Sarah, and Tuesday, and the thing that surprised you — and you get the truth. The clustering into a model comes later, on purpose, in a separate pass. Mixing the two is exactly when an AI starts leading the witness.

## Step 2 — Model induction: hand them a wrong draft on purpose

Now the skill turns those stories into a map. And it uses a counterintuitive move to do it: it shows you a draft that's *deliberately wrong* — right vocabulary, wrong joints; right stages, wrong order — and lets you correct it.

Why? Because experts can't reliably *generate* their own tacit model, but they recognize a wrong one instantly. "No, you've got that backwards, it's actually—" is the sound of someone reaching past their rehearsed theory into what they really know. Correction is a higher-bandwidth channel than generation. You don't have to be a world-builder. You just have to be a good critic of a bad draft.

Out of that comes two things. A **lifecycle** — usually five stages, each named so a leader will say it out loud. For the AI brief, those became *Sidelined → Scattered → Pockets → Coordinated → Embedded.* And a **stuck map**: the specific places people get trapped at each boundary, each one tagged with the *locus* of the problem — is it a capability hole (you can't do the next thing yet), a currency problem (your old skill stopped keeping up), a value collapse (the work still works but the market moved), or courage (you know the pivot and won't make it)? Those are ranked by depth, because the deepest live one is the one worth naming.

## Step 3 — Outcome-space: choose the gap

This is the most consequential step in the whole build, because it decides what the report is allowed to honestly say.

Everything turns on the *gap* — the distance between where you are and a benchmark you don't control. The skill picks which kind of gap to lead with. A **normative** gap (where your tenure says you should be versus where you actually operate — the cleanest, because half of it isn't a self-assessment at all). A **predictive** gap (where a model said you'd struggle versus where you do). Or a **correlational** gap (where two of your own answers should track together but don't).

The AI Adoption Mirror runs two at once. A correlational gap between *hands-on skill* and *organizational spread* — when one races ahead of the other, that imbalance is the finding. And a normative gap between how much a company has *invested* and how it actually *works* day to day. That second one is the "we've paid for this for a year, why are we still stuck?" pain, computed instead of asserted.

This step also sets the *resolution* — how finely the report forks. The discipline here is to only split where a reader would actually feel the difference, and to never let the picture be more precise than the words.

## Step 4 — Instrument: ask the symptom, hide the rung

Now the questions — ten to fifteen of them. Short enough that people finish, long enough that the result feels earned.

Two rules do most of the work. First: **never ask people to rate the thing you're measuring.** They can't see it — that's why they need you. So you ask about observable symptoms and let the scoring infer the cause. Here's a real one from the AI Mirror, on what happens after a bad first answer:

*We decide that kind of task just isn't something AI can do. / We hunt for the 'right' wording, as if there's a magic phrase to find. / Someone who's good at it rewrites and tries again. / We add context and examples and iterate until it lands. / We treat it as a back-and-forth by default, and most people know how.*

Nobody's rating their "AI maturity." They're picking what actually happens. But that ladder runs cleanly from worst to best, and the scoring reads it.

Second: **kill the obvious good answer.** If one option visibly screams "pick me," everyone picks it and your signal collapses. So the options are shuffled per session, and they're written so each one feels like a legitimate way to be — no trophy answer sitting in the same spot every time.

## Step 5 — Scoring: every answer lands somewhere real

Scoring's whole job is to be *total* — every possible combination of answers has to resolve to a valid, fully-specified result. No pattern falls through the cracks. The skill turns the answers into a precise coordinate (in the AI Mirror's case, which of the four quadrants, plus how deep into it, plus where the commitment gap sits) and computes the gaps along the way.

Then it validates itself two ways. It enumerates the *entire* space of possible results to catch any dead cell nobody can reach or any result that's reachable but never written. And it runs an **exemplar test**: you feed it a few people whose answer you already know — "this company is obviously a Stocked Toolbox" — and confirm the math lands them where you'd put them. That single test catches scoring bugs nothing else will.

## Step 6 — Content: write every page before anyone arrives

This is where the report gets written — all of it, for every possible result, in advance.

It's built in layers. A fixed *skeleton* carries the flow and never changes — why this exists, the live problem, what becomes possible, where to point first, the close. Then *atoms* drop into the slots: the right opener for your quadrant, the body that diagnoses your specific imbalance, a one-line lead tuned to how far into the result you are. There are even two or three phrasings of each piece, picked by a stable per-session hash, so two companies with the identical result still read a little differently. The one-of-one feel, with zero live generation.

And three rules guard the writing. **Claim only what you measured** — no slipping in "you used to be better at this" when nothing in the quiz measured history. **Stay in your voice** — those banned phrases from Step 0 get linted out of every variant. And the **honesty rule**: not every path is allowed to conclude "you need to hire me." A company that's genuinely doing well gets told so. A company at the very start gets validated before it gets a gap, because naming a gap that isn't there just demoralizes.

## Step 7 — Assembly: the checker that refuses to ship

Here's my favorite part, and it's the one that connects to everything I've written about [evals](https://chrislema.com/4-part-loop-eliminates-ai-slop) and [never letting AI grade its own work](https://chrislema.com/why-i-never-let-ai-grade-its-own-work).

When the skill compiles the spec into the finished assessment, it doesn't just render it. It runs a standing checker first — a set of mechanical gates — and it *refuses to emit anything* if a blocking gate fails. The build aborts. Nothing ships.

What does it check? That **every** reachable result has a written page, and every written page is reachable (no orphans either direction). That every result actually leans on a measured gap, and none is faking a benchmark it never collected. That no claim runs past the evidence — there's even a lint that hunts for quiet "you've drifted from…" history the quiz never measured. That no banned phrase slipped through. That every question feeds something. And that no path is rigged to force the sale.

This is [the harness](https://chrislema.com/the-harness-is-the-craft). The AI does the writing; a separate, dumb, unforgiving checker decides whether the writing is allowed out the door. The author never grades its own work. That's not a nice-to-have bolted on at the end — it's wired onto the seams between every step, and it's the difference between a system and a person who happens to be fast.

## The through-line

Step back and the shape is simple. Every report turns on a gap between your answer and a benchmark you don't control. Behavior, not self-rating, is what makes that gap both real and honest. The map, the questions, the scoring, and every word of the result are authored once, up front, into a single spec. A boring piece of code serves it instantly from the edge. And a checker on the seams refuses to ship anything broken.

The customer meets a report that locates them, names the gap that moves them, speaks to their exact situation, and rarely repeats — and every word of it was written, approved, and checked before they ever arrived.
