# The 4-Part Loop That Eliminates AI Slop (in Your Apps and Your Content)
*Published: 2026-04-17*
*Tags: ai, evals, for-engineers*
*Source: https://chrislema.com/4-part-loop-eliminates-ai-slop*
---Here's something I've been watching play out across every AI project I touch.

The people getting genuinely good work out of AI aren't the ones with better prompts. They're the ones with better loops.

I've seen this in story refinement. [I've seen it in code review](https://chrislema.com/the-harness-is-the-craft). I've seen it in content editing. I've seen it in agent evaluation. Same pattern, everywhere. And once you notice it, you can't stop noticing it.

The ceiling on your AI output quality isn't the model. It's whether there's a scoring loop around the model [with an independent judge](https://chrislema.com/why-i-never-let-ai-grade-its-own-work). If you're prompting and hoping, you're doing the work. **If you're designing the loop that scores the work, you're doing the new work.** That's the shift.

This judge-in-the-loop pattern has four components. Skip any one of them and the loop quietly fails in a specific, diagnosable way.

## The Four Components

**One: an explicit rubric with anchored descriptions.** Not a vibe check. Dimensions, a scale, and concrete descriptions of what a 2 looks like in prose, what a 5 looks like, an 8, a 10. Anchored. Specific. But here's the move most people miss: you don't have to write the rubric yourself. You write the *quality standard*, a document describing what good looks like in this kind of work, and the LLM extracts the rubric from it. Your job is knowing what "good" means. The LLM's job is turning that into dimensions and anchors.

**Two: a fresh judge.** The scoring pass runs in a subagent that has never seen this conversation, never seen prior versions, never seen the revision history. It loads only the rubric and the current artifact. If you skip this, the judge starts rewarding "improvement" instead of quality, and you can't tell the difference from the outside.

**Three: targeted revision.** The judge returns a scorecard with dimensions, scores, and the top three priorities for the next pass. You revise against those three. You don't rewrite wholesale. You don't argue with the score. You change the thing the judge named.

**Four: stop criteria, including a regression trigger.** Hard stop at N iterations. Stop early if the total score stops moving. And this one is the quiet genius: stop early if any single dimension drops by 2+ points, and revert to the prior version before stopping. Without that trigger, your loop will gladly ship you a version that's more polished but worse.

## The Prompt

This is where most people get stuck and quit. "I'd build an eval loop but I don't know how to write a rubric," and they're right to hesitate, because bad rubrics produce worse-than-useless scores. The prompt below sidesteps that. It asks you to point at a source document that describes what quality looks like for this kind of work (an essay, a style guide, a principles doc, a playbook) and it builds the rubric from that. You stay in the role you're actually qualified for: knowing what good looks like. The extraction is mechanical. And that's the catch worth naming: [getting that standard out of your head is the hard part](https://chrislema.com/the-hardest-thing-in-ai-right-now-isnt-ai).

Here's a simplified version of the prompt I use. Copy it, point it at your own source document, and watch how the loop behaves.

> **Part A: Build the eval**
> 
> 1. Read the source document I've provided that describes what quality looks like for this kind of work. Extract every distinct quality criterion it describes. Don't collapse criteria that should stay separate.
> 
> 2. Create eval/rubric.md with a scoring rubric: each criterion as its own dimension; a 1 to 10 scale per dimension; anchored descriptions for scores of 2, 5, 8, and 10 on each dimension, concrete and not vague (a "2" should describe what failure actually looks like in prose; a "10" should describe what mastery looks like, which is the most important step, because vague anchors equal noisy scores equal a useless loop); a note at the top: do not reward verbosity, do not reward effort, score the artifact only.
> 
> 3. Create eval/score.md, the prompt a fresh subagent will use to score. It takes an artifact as input, references rubric.md, and outputs: dimension name, score, one-sentence justification, one specific revision suggestion. End with a total score and a "top 3 priorities for revision" list.
> 
> 4. Create eval/stop-criteria.md: hard stop at 10 iterations; stop early if total score improves by less than 3% across two consecutive iterations; stop early if any single dimension regresses by 2+ points (revert to prior version and stop); preserve every iteration.
> 
> 5. Pause and show me the rubric before continuing.
> 
> **Part B: Run the loop**
> 
> 1. Score v0 using a fresh subagent invocation that has not seen this conversation. Load only rubric.md, score.md, and v0. Save the scorecard.
> 
> 2. Revise based on the scorecard's top 3 priorities. Save as v1. Keep revisions targeted: don't rewrite wholesale unless the scorecard demands it.
> 
> 3. Score v1 using another fresh subagent. No memory of prior versions or scores. This matters, because it prevents the judge from rewarding "improvement" instead of quality.
> 
> 4. Repeat until a stop criterion triggers.
> 
> 5. Final report: score progression table, which version had the highest total score, which version you'd actually recommend, and a blind comparison of v0 vs. your recommended version. Ignoring rubric scores, which is the better artifact as an artifact?
> 
> **Ground rules**
> 
> Each scoring pass must be a fresh subagent. No carryover context. Non-negotiable. Do not modify the rubric mid-loop. If you find yourself making changes that feel like they're gaming the rubric rather than improving the work, flag it and stop.

That's it. Two parts. Roughly forty lines. You can read it in a minute and run it in an hour.

The fresh subagent in Part B is the whole ballgame. When the same agent scores every version, it knows v3 came after v2 came after v1. Even without meaning to, it starts grading on the curve, rewarding the trajectory rather than the artifact. A fresh judge doesn't know about v2. It sees v3 alone and scores it cold. That's the only way to know if you're actually improving.

The stop criteria close the loop honestly. Hard ceiling so you don't grind forever. Early stop if the score stops moving. And the regression trigger: if any single dimension drops two or more points, revert and stop. Because if you're gaining overall score while losing on a specific dimension, the loop is gaming the rubric, and you've learned more from stopping than from continuing.

## Where the loop fails without each component

Rubric without a fresh judge: judge rewards momentum. You ship worse, feeling better.

Fresh judge without stop criteria: the loop runs until you run out of patience, tokens, or nerve. Whichever comes first.

Stop criteria without a regression trigger: you confidently ship the version where the prose got more polished but something important went missing. The total score went up. The artifact got worse. Nobody flagged it.

All four together: the loop can improve the artifact, can't reward its own momentum, knows when to stop, and knows when it's gaming the rubric on itself.

This is why the components aren't a menu. They're a system. Each one exists because it prevents a specific failure mode the others can't see.

## What this means for your work

If you're a technical specialist watching AI do a passable version of what used to take you all afternoon, here's the part worth sitting with.

The prompting doesn't go to the human. But neither does the rubric plumbing. What goes to the human is the *quality standard itself*, the document that says what good looks like in this kind of work. The LLM turns that into dimensions and anchored descriptions. The *judge construction* goes to the human. The *stop criteria* go to the human, especially the regression trigger, which requires enough taste to know which dimensions are worth protecting when the overall score is rising.

That's your new job. Not better prompts. Better loops. The AI generates. You define what "good" looks like in prose, and the AI turns it into the scoring instrument. You set the rules for when to stop.

Pick something you've been prompting-and-praying on this week. Find or write the document that describes what good looks like in that kind of work. Hand it to the LLM and have it build the rubric. Run an artifact through a fresh judge. See what happens.

Then tell me whether the ceiling was the model.