# How Do You Test Output That Changes Every Time You Run It?
*Published: 2026-06-29*
*Tags: ai, claude-code, vibe-coding, for-engineers*
*Source: https://chrislema.com/how-to-test-ai-generated-code*
---If you've been managing developers for any length of time, you've likely (eventually) landed on this truth — "I don't want to get into a fight over HOW they do it. I just need it to work."

That saves you a lot of grief.

But now, in the world of AI, with Claude Code, or Codex, Factory, or any other number of harness solutions, and tons of models (GPT 5.5, GLM 5.2, Opus 4.8), we're seeing tons of code generated. And the thing is, I see people trying to run their tests like it's deterministic code.

But it's more like a somewhat genius / lazy / absent-minded expert / noobie developer. Crazy definition, right?

So how do you think about testing in this new world?

Ask Claude Code to build a Stripe integration, a payment link plus the webhook handler that gives someone access once their payment clears. You'll get working code. Ask again next week, or let it run twice in a loop, or have a teammate run the same prompt, and you'll get different working code. One version reaches for raw `fetch`. The next pulls in the `stripe` SDK. A third splits the handler across two files, names the function something else, and handles errors its own way. If you run `git diff` between them, it's all noise. Everything changed, and nothing changed.

Go back to that manager's instinct. You don't get to argue over HOW, you just need it to work. So what do you actually check? You can't `assertEqual` the source, because there's no single right answer to compare it against. The thing that changes every run isn't the webhook data, it's the code itself. And generated code is a genuinely useful kind of unpredictable output: you can run it. That's what makes it testable, and it's what sets this apart from every other version of "the AI keeps giving me something different."

Think about that difference for a second. A summary that gets reworded every time only gives you words, and all you can really do is judge them. Generated code gives you something you can actually run and watch. So you stop checking the text it wrote, and you start checking what the code does, against a set of expectations that doesn't care how the code got built.

## You can't pin down the code, so pin down what you check

You're not going to make the generator predictable, and you wouldn't want to. Locking it to one frozen version throws away the whole reason you're using it. What you *can* lock down is what you check the code against.

So spell out the interface in your prompt and treat it as fixed. Not "build a Stripe webhook," but something specific: export `async function handleStripeWebhook(req, env)`; verify the signature; and when a `checkout.session.completed` event comes in marked paid, call `provisionAccess(email, plan)`. Let the inside vary however it wants. The entry point stays put. Now you've got a steady place to connect your tests, no matter which of a thousand versions came back.

That's really the whole move. We let the code itself change, we keep our list of expectations fixed, and everything from here checks against that fixed list. From now on I'll call that list the contract.

## Layer 1: The cheap checks, before you run anything

Start with the fast, exact checks, because a surprising amount of generated code falls over right here, and you don't want to spin up a whole test run to find that out.

Does it compile? Does `tsc` pass? Does the linter? Did it actually export the interface you asked for, with the right shape? These are plain old asserts against the code you got back, and they catch the versions that hallucinated an import or wandered off the spec before you waste time booting a container for them.

Then check the things that simply have to be there in any acceptable answer, which you can confirm without running anything:

> const src = await readGenerated();
> assert(/constructEventAsync?\(/.test(src), 'must verify the webhook signature');
> assert(/provisionAccess\(/.test(src), 'must call the provisioning boundary');

Grep is rough; an AST check is cleaner. Either way, you're checking for something every correct version has in common, even when they have nothing else in common. Code that skips signature verification is wrong no matter how clean it looks, and you can catch that without running a single line.

## Layer 2: The behavioral checks, run it instead of reading it

Now the real work. Load the handler you got back through that fixed interface, throw known events at it, and check what it actually does. Your expectations stay the same every time. The code sitting underneath them is whatever Claude wrote on this run.

> import { handleStripeWebhook } from './generated/handler.js';
> 
> test('paid session provisions access', async () => {
>   const res = await handleStripeWebhook(signedRequest(paidSession), env);
>   assert.equal(res.status, 200);
>   assert.equal(await grantsFor(customer), 1);
> });
> 
> test('forged signature is rejected and provisions nothing', async () => {
>   const res = await handleStripeWebhook(forgedRequest(paidSession), env);
>   assert.equal(res.status, 400);
>   assert.equal(await grantsFor(customer), 0);
> });
> 
> test('unpaid session provisions nothing', async () => {
>   await handleStripeWebhook(signedRequest(unpaidSession), env);
>   assert.equal(await grantsFor(customer), 0);
> });

Any correct version passes this, whatever SDK it picked, however it laid out the files, whatever it named things. Any wrong version fails it, however convincing it looks. You've swapped "does the code match my saved example" for "does the code do what I need," and the second question was the only one that ever mattered.

## Layer 3: Metamorphic checks, the relationships a generator has to hold

Two things turn a set of behavioral checks into a real test of the generator itself.

First, does it stay consistent when you regenerate? Run the same spec ten times, and all ten versions should pass your checks. That's not telling you about one piece of code, it's telling you about the generator. You're confirming your spec is tight enough that getting it right isn't just luck.

Second, does it hold up when you reword the prompt? Ask for "handle the Stripe webhook" one time and "process the payment callback" the next, and both should still pass. If changing a few words tanks your pass rate, your prompt was leaning on specific phrasing instead of actual meaning. It's the same kind of check you'd run on a classifier, just aimed at your own prompt this time.

And this is where those checks really earn their keep, because the relationships I'm describing are exactly the requirements AI tends to quietly skip. Stripe can deliver the same event more than once, so duplicates will show up. Does the code dedupe on `event.id`? Stripe doesn't promise events arrive in order, so if `payment_intent.succeeded` shows up before `checkout.session.completed`, does the handler still end up in the right place? Most generated versions skip both unless your checks look for them:

> test('duplicate delivery provisions once', async () => {
>   await handleStripeWebhook(signedRequest(paidSession), env);
>   await handleStripeWebhook(signedRequest(paidSession), env);   // same event id
>   assert.equal(await grantsFor(customer), 1);
> });

When that fails across most of your versions, you haven't found a flaky test. You've found out that "build a Stripe webhook" didn't tell the model you needed it to handle duplicates. So you add it to the spec, and you keep the test that proves it stuck.

## Layer 4: The pass rate is your real quality number

This is the layer that matters most, and the one a single run can never give you. Generate the code twenty times. Run each version through the full set of checks. The share that passes is your real quality number for that generator.

> const results = await Promise.all(
>   Array.from({ length: 20 }, () => generateThenRunContract(spec))
> );
> const passRate = results.filter(r => r.allPassed).length / results.length;
> assert(passRate >= 0.9);

Eighteen out of twenty, and you can trust this spec inside an automated loop. Eleven out of twenty, and your spec is too loose. You're getting correct code by luck, and the only reason you know that is because you sampled it. A single run would have shown you a green check and hidden a coin flip.

This becomes your tuning loop, too. Low pass rate? Read the failures, find the requirement that keeps getting dropped, tighten the spec, and measure again. It's the same discipline you'd use to evaluate a model. You're just treating the spec and the model together as the unit, because that pair is what you actually ship. You're checking how often you get correct code, not whether one version happened to work.

## Layer 5: The judge, for what the checks can't see

Your checks prove the code works. They say nothing about whether it's good. Is the error handling sensible, or does it just swallow exceptions? Can the next person read it? Is it secure beyond the specific things you checked for? Does it match how your team writes code? None of that turns into a simple assert, so you hand it to a judge: a model scoring the code against a rubric, or a human reviewing the versions that already cleared the behavioral checks.

Treat that judge as the unpredictable thing it is. Calibrate it against code you've rated by hand, give it concrete criteria instead of "rate this 1 to 10," and run it a few times so its own variation evens out. And yes, notice the loop you're in: you're using one model to grade another model's work. That's fine, as long as the judge isn't your only gate. The behavioral checks catch what the judge misses, and the judge catches what the checks can't measure.

## Putting it into the loop

Compile, contract, duplicate handling, ordering: those are your hard checks. If any generated code fails any of those, we don't move forward, because those are the requirements you can't meet "mostly." Pass rate and judge scores are softer. Those go on a dashboard you watch, with an alert when a prompt change starts dragging them down. And give yourself some margin: if your real pass rate runs around 92%, set the gate at 80%, so it trips on an actual regression and not on the generator just being a generator.

If you're letting an agent write this code on its own, in a pipeline, in a loop, behind a "ship it" button, the contract is what keeps that safe. It's the part that holds still while everything it's checking keeps moving.

## Where this leaves us

The old version of writing tests was making specific assertions and checking for specific terms, which meant assuming there was a single right approach. A code generator doesn't give you that. And if you try to force it, by freezing the model or diffing the source, you throw away the value you were paying for, and you don't even get safety back in return. So you stop asking "did it write the exact right code" and start asking "does the code it wrote do the right thing, often enough, while never missing the things that can't be wrong."

Here's the good news: code runs. Most of what AI generates, you can only judge. Generated code, you can run, and that hands you four layers of fast, behavioral, real-world checking before a judge ever has to weigh in. So figure out which kind of output you've got in front of you. That tells you how much the judge has to carry, and how much you can simply prove.
