June 7, 2026
Inside the Skill: The Eight Steps That Build an Assessment
I open the hood on the Claude Skill that builds my assessments — the eight-step pipeline, the single spec it compiles, and the checker that won't ship broken.
Last time I told you about the round trip — a one-page brief in, a working lead-gen quiz out. Today I want to open the hood.
Because the magic isn't that an AI wrote some quiz questions. The magic is the system underneath: eight steps that run in order, each one handing the next a little more structure, until what comes out the other end is a finished diagnostic that a checker has already refused to ship if anything's broken.
Let me walk you through it. But first, the one idea that makes the whole thing make sense.
The trick: build a spec, not a quiz
I've written before about the difference between using AI at design time versus run time. This skill is that idea taken all the way.
The skill doesn't build a quiz. It builds a single file — a spec — that describes the quiz completely: every question, every way to score it, every possible result, and the exact words each result says. Then a plain, boring piece of code reads that spec and renders the actual assessment. No AI runs when your customer takes the quiz. The intelligence was all spent up front, once, writing the spec.
That's the bet, and it's a good one. It means every word a customer will ever see was written and approved in advance. Nothing gets improvised in front of them. And because the working quiz, the design document, and the safety checker are all compiled from that same spec, they can't drift apart. Fix the spec, everything regenerates in lockstep.
So the eight steps are really eight passes that fill in that spec, one section at a time:
Positioning → Case intake → Model induction → Outcome-space → Instrument → Scoring → Content → Assembly.
Here's what each one actually does.
Step 0 — Positioning: cut the map where the money is
Before modeling anything, the skill asks a commercial question, not a clever one: what is this assessment for? Who's the customer, what do they call their problem before they meet you, what do you sell, and — the single most useful answer — what transition are you actually paid to move people through?
That last one matters because the whole map gets cut to fit it. A model that's academically accurate but whose boundaries don't line up with where you add value is a beautiful diagnostic that leads nowhere. Accuracy on the boundaries, but the boundaries follow the offer.
This step also captures your voice and a list of banned phrases — the hype words, the competitor names, the tells you never want reaching a customer. That list gets enforced later, automatically. (More on that when we get to the gates.)
Step 1 — Case intake: collect stories, not theories
This is the step everyone wants to skip, and skipping it is why most assessments are mush.
The skill interviews you about real, specific clients — one at a time, past tense. What did you notice first? What was actually going on underneath? What would a novice have thought was wrong? What happened to them?
And here's the rule that makes it work: during intake, the skill is forbidden from naming a stage, a category, or a pattern. No theorizing. Just stories. Because the moment you ask an expert for "the three phases clients go through," you get the tidy, rehearsed, slightly-wrong version. Ask for Sarah, and Tuesday, and the thing that surprised you — and you get the truth. The clustering into a model comes later, on purpose, in a separate pass. Mixing the two is exactly when an AI starts leading the witness.
Step 2 — Model induction: hand them a wrong draft on purpose
Now the skill turns those stories into a map. And it uses a counterintuitive move to do it: it shows you a draft that's deliberately wrong — right vocabulary, wrong joints; right stages, wrong order — and lets you correct it.
Why? Because experts can't reliably generate their own tacit model, but they recognize a wrong one instantly. "No, you've got that backwards, it's actually—" is the sound of someone reaching past their rehearsed theory into what they really know. Correction is a higher-bandwidth channel than generation. You don't have to be a world-builder. You just have to be a good critic of a bad draft.
Out of that comes two things. A lifecycle — usually five stages, each named so a leader will say it out loud. For the AI brief, those became Sidelined → Scattered → Pockets → Coordinated → Embedded. And a stuck map: the specific places people get trapped at each boundary, each one tagged with the locus of the problem — is it a capability hole (you can't do the next thing yet), a currency problem (your old skill stopped keeping up), a value collapse (the work still works but the market moved), or courage (you know the pivot and won't make it)? Those are ranked by depth, because the deepest live one is the one worth naming.
Step 3 — Outcome-space: choose the gap
This is the most consequential step in the whole build, because it decides what the report is allowed to honestly say.
Everything turns on the gap — the distance between where you are and a benchmark you don't control. The skill picks which kind of gap to lead with. A normative gap (where your tenure says you should be versus where you actually operate — the cleanest, because half of it isn't a self-assessment at all). A predictive gap (where a model said you'd struggle versus where you do). Or a correlational gap (where two of your own answers should track together but don't).
The AI Adoption Mirror runs two at once. A correlational gap between hands-on skill and organizational spread — when one races ahead of the other, that imbalance is the finding. And a normative gap between how much a company has invested and how it actually works day to day. That second one is the "we've paid for this for a year, why are we still stuck?" pain, computed instead of asserted.
This step also sets the resolution — how finely the report forks. The discipline here is to only split where a reader would actually feel the difference, and to never let the picture be more precise than the words.
Step 4 — Instrument: ask the symptom, hide the rung
Now the questions — ten to fifteen of them. Short enough that people finish, long enough that the result feels earned.
Two rules do most of the work. First: never ask people to rate the thing you're measuring. They can't see it — that's why they need you. So you ask about observable symptoms and let the scoring infer the cause. Here's a real one from the AI Mirror, on what happens after a bad first answer:
We decide that kind of task just isn't something AI can do. / We hunt for the 'right' wording, as if there's a magic phrase to find. / Someone who's good at it rewrites and tries again. / We add context and examples and iterate until it lands. / We treat it as a back-and-forth by default, and most people know how.
Nobody's rating their "AI maturity." They're picking what actually happens. But that ladder runs cleanly from worst to best, and the scoring reads it.
Second: kill the obvious good answer. If one option visibly screams "pick me," everyone picks it and your signal collapses. So the options are shuffled per session, and they're written so each one feels like a legitimate way to be — no trophy answer sitting in the same spot every time.
Step 5 — Scoring: every answer lands somewhere real
Scoring's whole job is to be total — every possible combination of answers has to resolve to a valid, fully-specified result. No pattern falls through the cracks. The skill turns the answers into a precise coordinate (in the AI Mirror's case, which of the four quadrants, plus how deep into it, plus where the commitment gap sits) and computes the gaps along the way.
Then it validates itself two ways. It enumerates the entire space of possible results to catch any dead cell nobody can reach or any result that's reachable but never written. And it runs an exemplar test: you feed it a few people whose answer you already know — "this company is obviously a Stocked Toolbox" — and confirm the math lands them where you'd put them. That single test catches scoring bugs nothing else will.
Step 6 — Content: write every page before anyone arrives
This is where the report gets written — all of it, for every possible result, in advance.
It's built in layers. A fixed skeleton carries the flow and never changes — why this exists, the live problem, what becomes possible, where to point first, the close. Then atoms drop into the slots: the right opener for your quadrant, the body that diagnoses your specific imbalance, a one-line lead tuned to how far into the result you are. There are even two or three phrasings of each piece, picked by a stable per-session hash, so two companies with the identical result still read a little differently. The one-of-one feel, with zero live generation.
And three rules guard the writing. Claim only what you measured — no slipping in "you used to be better at this" when nothing in the quiz measured history. Stay in your voice — those banned phrases from Step 0 get linted out of every variant. And the honesty rule: not every path is allowed to conclude "you need to hire me." A company that's genuinely doing well gets told so. A company at the very start gets validated before it gets a gap, because naming a gap that isn't there just demoralizes.
Step 7 — Assembly: the checker that refuses to ship
Here's my favorite part, and it's the one that connects to everything I've written about evals and never letting AI grade its own work.
When the skill compiles the spec into the finished assessment, it doesn't just render it. It runs a standing checker first — a set of mechanical gates — and it refuses to emit anything if a blocking gate fails. The build aborts. Nothing ships.
What does it check? That every reachable result has a written page, and every written page is reachable (no orphans either direction). That every result actually leans on a measured gap, and none is faking a benchmark it never collected. That no claim runs past the evidence — there's even a lint that hunts for quiet "you've drifted from…" history the quiz never measured. That no banned phrase slipped through. That every question feeds something. And that no path is rigged to force the sale.
This is the harness. The AI does the writing; a separate, dumb, unforgiving checker decides whether the writing is allowed out the door. The author never grades its own work. That's not a nice-to-have bolted on at the end — it's wired onto the seams between every step, and it's the difference between a system and a person who happens to be fast.
The through-line
Step back and the shape is simple. Every report turns on a gap between your answer and a benchmark you don't control. Behavior, not self-rating, is what makes that gap both real and honest. The map, the questions, the scoring, and every word of the result are authored once, up front, into a single spec. A boring piece of code serves it instantly from the edge. And a checker on the seams refuses to ship anything broken.
The customer meets a report that locates them, names the gap that moves them, speaks to their exact situation, and rarely repeats — and every word of it was written, approved, and checked before they ever arrived.
A story. An insight. A bite-sized way to help.
Get every article directly in your inbox every other day.
I won't send you spam. And I won't sell your name. Unsubscribe at any time.
About the Author
Chris Lema has spent twenty-five years in tech leadership, product development, and coaching. He builds AI-powered tools that help experts package what they know, build authority, and create programs people pay for. He writes about AI, leadership, and motivation.