April 4, 2026
Why I Never Let AI Grade Its Own Work
Most people accept their first AI draft because it exceeded their expectations. But your expectations of AI aren't your standards. Here's what happened when I made Claude Code and Codex evaluate each other using my own voice profile and audience segments as the rubric.
The first time an AI writes something for you, you're going to think it's good.
And you'd be wrong. Not because it's bad. Because you're grading on the wrong curve.
Here's what I mean. When I ask Claude Code to write a 3,000-word podcast episode draft, it comes back in about four seconds. And my brain does something interesting: it compares the output to my expectation of what a machine should produce. That's a low bar. "This is way better than I expected" is not the same as "this is as good as I could write on my best day with unlimited time."
That gap between "surprisingly decent" and "actually excellent" is where most people stop. They accept the draft, maybe tweak a few sentences, and move on. I did that for a while too.
Then I stopped doing that.
The Experiment
I wanted to see what would happen if I treated AI-generated content the way a manager treats a developer's code: with a structured review process and a second pair of eyes.
Here's the setup. I installed a plugin called CC Codex Plugin that lets Claude Code talk to Codex, which runs on GPT-5-4. So now I've got two different LLMs in the same workflow. Claude Code is sitting in my terminal. Codex is connected through the plugin. And I'm telling them: one of you writes, the other one judges.
But before any of that, I gave Claude Code a resource folder. Four files in it: my voice profile, my story structure document, a podcast series architecture doc, and my audience segments. These aren't long documents, but they're specific. They describe how I talk, how I structure narratives, who I'm talking to, and what those people care about.
This matters more than anything else in the process. I'll come back to why.
The Brief
The first thing Claude Code did was create a brief for the podcast episode. Not the episode itself. The brief. And because it had my resource files, the brief was specific: topic, angle, key decision points, hook, target audience from my segments, story architecture, and the takeaway.
That brief was already better than what most people start with when they ask an AI to "write me a blog post about X." A two-sentence prompt produces two-sentence thinking. A detailed brief produces detailed output.
Then Claude Code built a rubric. An evaluation scorecard. And here's where it gets interesting, because the rubric wasn't generic. It pulled from my voice profile, from my story structure preferences, from how I define my audience. The rubric was asking questions like: does this sound like Chris? Does the story follow a therefore/but progression instead of and-then? Is the opening specific enough to create tension?
Those are my standards. Not generic "is this well-written" standards. Mine.
The First Draft
Claude Code sent the brief to Codex. Codex wrote the 3,000-word draft. It came back fast, maybe 20 minutes of content in a fraction of that time.
And it was... pretty good.
That's the trap. It was pretty good. If I'd been the one evaluating it casually, I probably would have said "nice work" and started recording. But I wasn't evaluating it. Claude Code was, using the rubric it had built from my own artifacts.
First eval score: 7.9 out of 10.
Not bad. But 7.9 means there were specific things it missed. Specific places where the voice drifted. Specific story beats that didn't land the way my structure documents say they should. Specific audience considerations that got glossed over.
The rubric didn't just say "try harder." It said exactly where and exactly how.
Why Two LLMs
A buddy asked me the other day: why use two different AIs? Why not just have Claude Code iterate on its own work?
I said: do you manage developers?
He said yes.
I said: do you ever have one of them QA their own work?
He said no.
Right.
LLMs have defaults. Whatever training data shaped them, whatever patterns they learned, those become the water they swim in. When you ask an LLM to critique its own output, it's evaluating against its own biases. Having a completely different LLM, trained differently, with different defaults, look at the work is like having a senior developer review a junior's code. They catch different things.
It's not that one is smarter. It's that they're differently calibrated. And the argument between two intelligent systems produces a sharper result than either one produces alone.
The Improvement
Claude Code took the eval feedback, packaged it with all my original resource files, and sent everything back to Codex. Essentially saying: here's what you wrote, here's how it scored, here's where it fell short, here's the source material that defines what "good" looks like, now try again.
Second eval came back. The scores jumped. Multiple categories hitting 9. Final composite: 8.9.
From 7.9 to 8.9 in one iteration.
I had told it to do up to five rounds, but by the second pass, Claude Code said: we're good here. Save it to the drafts folder.
Here's the thing about that jump. In school, going from a C+ to a B+ doesn't sound dramatic. But in content quality, 7.9 to 8.9 is the difference between "this sounds like AI wrote it" and "this sounds like me on a good day." The last point is always the hardest to earn, and it's the point that matters most.
The Part Nobody Wants to Hear
The two-LLM loop is cool. The automatic iteration is convenient. The scoring system is elegant.
But none of it works without those four files in the resource folder.
The evals, the rubrics, the scoring, all of that is only as good as the artifacts you feed it. Voice profile. Story structures. Audience segments. The more specific those are, the more specific the feedback loop becomes. The more specific the feedback loop, the faster the content improves.
Stop Playing Favorites
Here's what I think most people get wrong about AI tools: they pick one and marry it.
They find Claude or ChatGPT or Gemini, they get comfortable, and they do everything inside that one model. Writing, editing, evaluating, iterating. One brain doing all the jobs.
That's like hiring one person and making them the writer, the editor, and the quality reviewer. No matter how talented they are, they can't QA their own work. Their blind spots follow them from draft to revision.
What worked in this experiment wasn't that Claude Code is great or that Codex is great. What worked is that they're different. Different training, different defaults, different ways of interpreting "make this better." The tension between two differently calibrated systems is what sharpens the output.
Think about it as a council, not a tool. You wouldn't ask one advisor for the strategy, the critique, and the final approval. You'd want people who see things differently, who push back on each other, who catch what the other one missed.
That's what two LLMs in a loop give you. Not twice the power. A different kind of power. The kind that comes from disagreement, iteration, and external review.
So stop falling in love with your favorite model. Start building a council. Give them your artifacts, point them at each other, and let the argument produce something better than either one could alone.
From 7.9 to 8.9. Not because one AI got smarter. Because two AIs held each other accountable.
A story. An insight. A bite-sized way to help.
Get every article directly in your inbox every other day.
I won't send you spam. And I won't sell your name. Unsubscribe at any time.
About the Author
Chris Lema has spent twenty-five years in tech leadership, product development, and coaching. He builds AI-powered tools that help experts package what they know, build authority, and create programs people pay for. He writes about AI, leadership, and motivation.