May 30, 2026 | Chris Lema

How to Tell If Your Scoreboard Is Lying to You

A pizza shop's flyer contest shows why your AI scoreboard can climb while the work gets worse - and why a measure that can't be gamed is the real job.

AI ·Product Work ·Evals ·For Engineers

Maybe you've heard this story. It's about a pizza place that wanted more people walking through its doors.

Not a super complicated want. If you run a business, you know the exact dynamic. You have bills to pay, so you want customers. More customers.

So this owner did an obvious thing - he hired the neighborhood kids. Take these flyers and get them out there. Bring customers in. Each flyer takes a dollar off the price of a slice.

To make sure he knew which kids were doing a great job and which were tossing the stack of flyers in the trash, each kid got a different color.

When a customer came in and presented the flyer/coupon, he'd make a little mark, tracking the score.

The Scorecard Had a Clear Winner

Blue. Every single week. It wasn't even close. In fact, blue had more representation than all the other colors combined.

You can imagine the owner's thinking. This kid is sharp. Incredible. Hard working. I need 10 more like them. He'd be bragging to all his friends.

What he didn't know?

Where the blue kid stood. Not three blocks away. Not at the bus stop or the school. Not anywhere where they'd have to convince someone to head to the shop.

They stood just feet from the door. "Here you go, dollar off."

That was the whole operation. They weren't bringing people to the pizza. The pizza was bringing them in. They were just there, getting credit for it.

No rules were broken. No cheating. But the owner's intent didn't break thru. The other kids were walking all over town doing hard and honest work.

And the owner? His numbers were looking good. The strategy was being tracked. The measurements were looking good. The scoreboard was accurate. But it was also the problem at the same time.

Why Am I Telling You This Story?

Because I've written about feedback loops and evals. To remind you, it's a way to leverage AI to make improvements - by creating a rubric or scorecard that makes a measurement and then you use the AI harness to loop over things (tool calls, experiments, etc) watching to see if it improves the score.

I was about to tell you about another one called Evo which I have really enjoyed this past week.

But I realize that before we get into evals and loops again, we have to be super clear about the challenge that always appears.

AI Doesn't Know "Good." It Knows Scores.

AI, and software in general, doesn't know "good" or "right." Instead it knows scores. And because it can tell if a score is better than another score, it can take iterations to make it better. But more often than not, we'll see something different. We'll see the boy handing out blue flyers.

These tools can end up lowering the bar to get a better score, changing the entire nature of the way the game is played, rather than actually making serious improvements.

My point?

Your job isn't to create a rubric, create a test. It's to create a way to test, to measure, that can't be gamed. And trust me when I tell you, it's not easy.

Additional Reading

If this is the kind of problem you're wrestling with, I've circled it a few times before. A few worth your time:

The 4-Part Loop That Eliminates AI Slop - The actual mechanism that stops the blue-flyer move. A rubric, a fresh judge, targeted revision, and a stop rule that reverts the moment the loop starts gaming the score instead of improving the work.
Why I Never Let AI Grade Its Own Work - The kid can't mark his own scorecard. I walk through what happens when you put a second, differently-wired AI in the judge's seat, and why that disagreement is what sharpens the output.
The Hardest Thing in AI Right Now Isn't AI - Why building a test that can't be gamed is so hard: the standards live in your head, and nobody has made you write them down until now. The loop is easy. The criteria are the work.
The Key Comes Before the Song - How I'm actually designing the checks for a new product, and why a gameable measure (a list of banned words) had to be replaced with a harder one (is every claim traceable to the inputs?).
The Harness Is the Craft - For the builders: what eval-driven development looks like in practice, and the seven principles I'd hand anyone shipping software that calls AI to do the work.

A story. An insight. A bite-sized way to help.

Get every article directly in your inbox every other day.

I won't send you spam. And I won't sell your name. Unsubscribe at any time.

About the Author

Chris Lema has spent twenty-five years in tech leadership, product development, and coaching. He builds AI-powered tools that help experts package what they know, build authority, and create programs people pay for. He writes about AI, leadership, and motivation.