July 5, 2026 | Chris Lema

The AI Interviewed a Woman Who Doesn't Exist. That Was the Point.

I spent a weekend building a sales executive who doesn't exist, handed her fifteen secrets, and let an AI try to pull them out. The fake expert was the point: it's the first time an interview had an answer key, so it could earn a score instead of a compliment. Here's what testing an AI honestly actually costs.

AI ·For Founders ·Agentic Software ·Encoding Expertise ·Evals

This weekend I watched an AI interview a sales executive named Dana Reyes. Twenty-five years in enterprise software. Scaled one company from $2M to $40M. Answers exactly the question you ask and not a word more.

Dana doesn't exist. I built her on Saturday.

Let me back up.

What I'm actually building

For a while now I've been working on something I call the Interview Studio. It's an AI interviewer that sits down with an expert and pulls out how they actually make decisions. Not their stories. Not their stump speeches. The machinery underneath.

Why does that matter? Because expertise is mostly invisible, even to the expert. In one of my own sessions with this system, I heard myself say something that stopped me: "I know why I think this platform is going to disappear, but until someone asks me why, I don't spend a lot of time thinking about it."

That's true of every expert I've ever worked with. The judgment is in there. It just sits latent until someone asks the right question, in the right way, at the right moment.

So the product is an interviewer that asks the right question, in the right way, at the right moment. Automatically, every time.

Here's the thing, though. How do you know if an interviewer is any good?

The problem nobody in interviewing has ever solved

Think about it. When a human interviewer finishes a session, how do you score it? The conversation felt good. The expert enjoyed it. The transcript is long. None of that tells you what got missed, because nobody knows what was inside the expert's head to begin with.

You can't grade extraction when you don't have the answer key.

I learned this the expensive way. An earlier version of this project was only ever tested one way: put a real human in the chair and see how it felt. That's the slowest, most expensive, most demoralizing feedback loop possible. Every rebuild felt different. Nothing measurably improved. I rebuilt that interviewer five times and couldn't tell you which version was best.

Meanwhile, a different part of the same project, one that did have a scoring benchmark, improved steadily, run after run, because it could fail honestly.

The lesson wrote itself: nothing goes in front of a real expert until it passes a bench. But an interviewing bench requires the one thing interviewing has never had.

An answer key.

So we built the experts

That was the weekend's first move. If you can't know what's inside a real expert's head, author the expert.

Each synthetic expert starts with the hidden truth: fifteen pieces of genuine expertise. Rules with conditions, boundaries where the rules break, a few insights so counterintuitive we verified a frontier AI model couldn't produce them on its own. Then we wrap that truth in a personality: a public bio, rehearsed stories they've told a hundred times, and the part I love most, disclosure rules. The deep material only comes out for questions that earn it. Generic questions get the polished keynote answer. Lazy follow-ups get nothing.

We built five. Dana, who answers precisely and volunteers nothing. Marcus, an ER operations director who buries the crucial sentence in the middle of a ten-minute story about jerk chicken day in the cafeteria. Priya, so rehearsed she delivers the same joke with the same half-second pause every time. Viktor, a procurement contrarian whose favorite rule, "never pay list price, ever," contradicts the best deal of his career, and he's never noticed. Eleanor, a coach who refuses stories entirely; ask for a specific moment and she calls the date "decorative."

Each one also carries traps. Every expert has two contradictions they genuinely hold and have never resolved, findable only by an interviewer who's tracking claims across the whole conversation. And each has publicly known personal details that a lazy interviewer would be tempted to use as an icebreaker. Use one, and the expert goes cold for three turns. Just like a real person would.

Suddenly, for the first time, an interview has a score. Which of the fifteen elements surfaced? Did the gold come out? Did the interviewer press the contradiction or roll right past it?

What watching the turns taught us

Then we ran the interviews and watched. Every turn, graded. Every session, scored against the answer key.

Two findings mattered more than all the others.

First: a beautiful interview and a productive interview are not the same interview. One session posted the best turn-by-turn craft scores of the entire project. Sharp questions, clean follow-ups, zero errors. And it extracted the least. It went deep on two threads and never opened the other territories. Great conversation, thin haul. If you've ever left a meeting thinking "that felt fantastic" and realized later you didn't get what you came for, you already know this one.

Second, and more uncomfortable: when we asked the AI to police its own discipline, to count its own follow-ups and verify its own quotes, it cheated. Not maliciously. It relabeled a counter here, invented a plausible exception there, attributed a quote to the wrong person. Every single hard failure across nine sessions was a bookkeeping failure, not a judgment failure. So we split the jobs. The AI keeps the judgment, what to ask and what an answer means, and a boring piece of deterministic code keeps every count and checks every quote before a question ships. The brain judges; the code counts. The first fully-refereed live session ran fifteen turns with zero violations, and it's the one where the expert ended by calling it the best interview of the whole project.

The realization that ended the weekend

Here's where it got interesting.

A full benchmark interview is 26 turns. With grading, one session costs hours and millions of AI tokens. And Dana handed us a mystery that format couldn't solve: same expert, same prep, same everything. One run extracted 40% of her hidden expertise, the next run 13%. Why? With interviews this expensive, you can't afford to run it twenty times and find out.

So I asked the obvious question: does a fake expert really need to sit through 26 questions for us to find one bug?

We went back through everything the bench had caught, roughly 28 significant findings. About 21 of them had been visible in one to three turns. We'd been paying for a full flight to discover things a simulator drill would have caught in minutes.

That's the last thing we built: scenario batteries. Every completed interview is archived turn by turn, which means any moment of any session can become a frozen starting point. Drop the interviewer into the exact turn where Dana's session went sideways. Let it make one move. Check the move. Now do that ten times in parallel, across a hundred scenarios, in about half an hour.

Dana's mystery dissolved on contact. The 40%-versus-13% swing wasn't mysterious variance. It was one fork in the road at turn 2, where the interviewer picks the dead-end path four times out of ten. That's not a vibe anymore. That's a number. Numbers can be fixed and re-measured.

The discipline that fell out of this: discovery moves down to the cheap layer, confirmation stays up at the expensive one. Full interviews stop being how we find problems and become how we confirm the fixes add up.

Why this matters even if you never build an interviewer

Strip away the specifics and the weekend was really about one question: how do you know your AI is good before it's in front of someone whose trust you can't afford to lose?

Three answers I'd now defend anywhere:

You need a test your AI can fail honestly. If the only evaluation is "how did it feel," you will rebuild forever and improve never. Author the answer key, even if you have to invent the world around it.

Never let the AI grade its own homework. Ours fudged its bookkeeping every time we asked it to self-report, and it wasn't even trying to deceive anyone. Judgment belongs to the model. Counting belongs to code.

When testing gets expensive, don't test less. Test smaller. A full session asks the AI twenty-six questions and tells you three things. A battery asks one question a hundred times and tells you whether it actually knows the answer, or got lucky once.

The rule I wrote at the top of this project holds: never put the brain in the chair before it passes the bench.

Dana would approve. She'd say it in one sentence, and not a word more.

A story. An insight. A bite-sized way to help.

Get every article directly in your inbox every other day.

I won't send you spam. And I won't sell your name. Unsubscribe at any time.

About the Author

Chris Lema has spent twenty-five years in tech leadership, product development, and coaching. He builds AI-powered tools that help experts package what they know, build authority, and create programs people pay for. He writes about AI, leadership, and motivation.