I just had Claude Code build me an entire web application. Thirty thousand lines of code. Eighty files. Seven hours of work. I didn't write a single line.
I wrote words. A few pages explaining the system and how I wanted it to work. Another ten to twenty pages on how I like software authored — my preferred architecture, platforms, patterns, and conventions. Then I handed it off and let the AI do what AI does.
And it did it. The application worked. The code was structured. The whole thing came together in a way that would have taken a small team weeks to produce.
So I should have been done, right?
I wasn't. Because here's the thing nobody's talking about in the rush to celebrate AI-generated software: the code working is not the same as the code being ready.
Working code that has security holes isn't ready. Working code with functions that are never called isn't ready. Working code where the UI calls a function with userId but the backend expects user_id isn't ready. And all three of those problems were hiding in my thirty thousand lines.
The gap between “it works” and “it's ready”
If you've been writing software for any length of time, you know that “it compiles” or “the tests pass” or even “it works in production” doesn't mean the code is sound. Junior developers learn this the hard way. Seasoned engineers build review processes specifically because they know that working code can still be wrong in a dozen ways that matter.
AI-generated code has this problem in a particularly sneaky way. The code looks professional. It follows patterns. It has consistent naming conventions. It even adds comments. It can lull you into thinking it was written with the kind of intentionality that a senior engineer brings, when in reality, the AI was predicting the next token across eighty files over seven hours without maintaining a single mental model of the whole system.
That last part is what makes the verification step non-negotiable. The AI doesn't hold the full picture in its head the way a human architect would. It solves each piece well enough, but the seams between pieces are where problems hide.
One prompt. Three expert reviews.
After all the code was generated, I ran one more prompt. I think of it as the last prompt — not “last” as in you'll never need another one, but “last” as in the one that comes after everything else is done.
Here's what it does: it tells the AI to break into multiple sub-agents and conduct three deep reviews of the entire codebase, each from a different expert perspective.
The first review is security.
I told it to pretend a junior developer wrote the code, and now the best security expert on the team is reviewing it. That framing matters. Without it, the AI tends to trust its own work — it knows it wrote the code with good intentions, so it doesn't look as hard. By resetting the assumption, you get an adversarial review. The kind where every input is suspect, every endpoint needs validation, and every database query is checked for injection.
I asked for issues categorized into critical, high, medium, and low severity. This isn't just about finding problems — it's about knowing which ones need to be fixed before you go live and which ones can wait.
The second review is optimization.
This one looks for code that's never called, functions that do the same thing in slightly different ways across multiple files, implementations that are too verbose or too terse, and performance bottlenecks.
AI-generated code has a particular tendency toward duplication. If the AI loses context between sessions — or even between files — it will solve the same sub-problem slightly differently in three different places. A human developer would notice they already wrote a utility function for that. The AI might not remember.
This review catches all of it. Dead code, redundant logic, and opportunities to consolidate.
The third review is traceability.
This is the one that catches the most bugs, and it's the one most people would never think to ask for.
Traceability means following every user action from the UI all the way through to the database and back. Every button click, every link, every form submission. Does the UI call the right function? Does it pass the right arguments with the right names and the right types? Does that function call the next layer correctly? And the next? All the way down to the database query, and all the way back up to the response the user sees.
I told the AI to use one sub-agent per layer, because a shallow pass across the whole stack misses the details. You need something looking specifically at the API layer, something else looking at the service layer, something else at the data layer. Each one checking that contracts between layers are honored.
This is where AI-generated code breaks most often. The AI might generate a frontend that calls getUserProfile(userId), and a backend that exposes get_user_profile(user_id). Both work in isolation. Neither works together. The parameter naming is different. Or the type is wrong — a string where an integer is expected. Or the response shape doesn't match what the frontend is trying to destructure.
These aren't the kind of bugs that show up in a demo. They're the kind that show up when a real user does something slightly unexpected on a Tuesday afternoon.
The actual prompt
Here's the prompt itself:
Now I want you to create and use as many sub-agents as makes sense to dig into three areas. You may get away with one sub-agent for each of the first two areas, but you'll need several (one per layer) for the last area.
1. Security — I want you to go deep and do a full and complete security analysis — like a jr developer wrote the code rather than a senior engineer did — and now your best security expert is reviewing and looking for any issues. Categorize them into critical, high, medium and low.
2. Optimization — I want you to go deep and do a full and complete analysis of the code — looking for code that's never called, functions that are replicated in a lot of different places, places where the code could be optimized because it's either too long or too short, performance tweaks and more.
3. Traceability — This is where you'll likely want one sub-agent per layer. I want every link or button in the UI that calls our code, to be reviewed to make sure it actually calls our code. And it calls it with the right signature, passing the right arguments, and that those arguments are the right name / right type. From that layer, to the next, and so on, until you're hitting the database, I want to make sure that we can go from every call to every other call, from one layer to another and then all the way back, without any issues. Functions are called correctly, passed data correctly, and deliver what's expected.
Once you have all of this detail, pull it all together into a report that I will understand, but also that will make it easy for you to create a task markdown artifact that sub-agents can read and pick items to work on from there.
What it found: 88 issues
This is the part that matters. I gave Claude Code thorough guidance. Pages of specifications. Detailed architecture preferences. Everything it needed to do excellent work. It produced thirty thousand lines of well-structured code.
And the verification prompt still found eighty-eight issues.
Four critical security vulnerabilities. Five high-severity security issues. Six completely broken traces — UI elements calling functions that either didn't exist or didn't work as expected. Twenty-one signature mismatches across layers. Eight high-priority optimization issues. Thirty-six optimization items total.
Eighty-eight problems in code that worked, that looked clean, and that followed all my guidelines.
If I had shipped that code without this step, I'd have had four critical security holes in production and six features that would have failed the first time a real user clicked the button.
Why the deliverable format matters
The last line of the prompt asks for two things: a report I can understand and a task artifact that sub-agents can work from.
This is the difference between finding problems and fixing them. The human-readable report tells me the state of my codebase — where the risk is, what needs attention first, and how severe each issue is. The task artifact turns the review into a punch list that the AI can work through autonomously.
You run the verification prompt, read the report, decide what to prioritize, and then let the AI start fixing. It's a pipeline: generate, verify, remediate.
Who this is for
If you're a developer using AI to generate code faster, this prompt is the quality gate that keeps you from shipping problems. You already know that code review matters. This just does it at a scale and speed that matches how fast the code is being produced.
If you're a product person or business leader who's started building with AI tools — and many of you are — this prompt is even more important. You might not know what a type mismatch is or why it matters that a function signature doesn't match between layers. You don't need to. You just need to know that this step catches the things you wouldn't see, and that skipping it is how prototypes turn into production incidents.
The prompt works because it mirrors what good engineering teams already do: security review, code optimization, and integration testing. It just does all three in one shot, at a depth that would normally take a team of reviewers days to complete.
The takeaway
AI can write thirty thousand lines of code in seven hours. That's remarkable, and it's only going to get faster. But the excitement about generation speed has outpaced the conversation about verification depth.
Every piece of AI-generated code deserves this step. Not because the AI did a bad job — mine did a genuinely good job. But because eighty-eight issues were still hiding in that good work. The “last prompt” isn't about doubting the AI. It's about respecting the complexity of real software.
Build fast. Then verify deep. That's the process.