May 6, 2026
Why Your AI Bill Is Too High (And the Architecture Fix That Saves 40%)
Most companies run their AI like a doctor's office. The ones that scale run it like an ER. Here's the architecture decision that cuts AI costs by 40%.
Most companies run their AI like a doctor's office. They should be running it like an ER.
Years ago, I took my wife to the emergency room with internal bleeding.
We walked in. Handed over the insurance card. Told the person at the front what was happening. Within sixty seconds, we were in a room. Doctors were in there with us. Everything started immediately.
A different time, different issue, we sat in those same waiting room chairs for hours.
Same hospital. Same staff. Completely different outcome. The reason isn't that one visit got the "good" team and the other got the "bad" team. It's that an ER is built around a specific role most other places don't have: the triage nurse.
The triage nurse doesn't treat anyone. They look at you, ask three questions, and make one decision: how serious is this, and where does it go. Internal bleeding goes straight back. Sprained finger waits. The whole system is built around that single sorting decision happening at the front door.
Now think about a regular doctor's office. You sign in. You wait your turn. The nurse takes you back, weighs you, gets your vitals, sits you in a room. Eventually a doctor comes in, asks questions, examines you. If something looks serious, they refer you to a specialist. Then back to the nurse, then out the door.
Nothing wrong with that process. For most patients, it's exactly right. The problem is what happens when you treat every patient that way. The person with a sore throat gets the same depth of attention as the person who needs a specialist immediately. Either the sore throat gets unnecessary care, or the serious case gets delayed by all the sore throats in front of it. Usually both.
Most companies are running their AI like a doctor's office
Here's what that looks like in practice. A company decides to use AI for something — extracting data from contracts, drafting customer emails, summarizing reports, whatever it is. They pick the best model available. They send everything through it. Easy tasks, hard tasks, routine stuff, judgment calls — all of it goes to the same expensive specialist.
Then the bill arrives, and somebody asks why it's so high.
It's high because there's no triage. Every task is being treated like it's the most important task in the world.
A new research paper from NYU benchmarked four different ways of building these systems on 10,000 SEC filings. They tested everything from simple pipelines to elaborate self-correcting setups. The most accurate setup — the one with the most checking and double-checking — got an F1 score of 0.943, which is excellent. But it cost 2.3 times more than the simplest version.
The setup that won wasn't the most accurate one. It was the one that added a triage layer.
The researchers called it "hierarchical." It works like an ER. There's a supervisor whose only job is to look at each task and decide where it goes. Easy tasks go to a cheaper, faster model. Hard tasks go to the expensive specialist. The supervisor also spot-checks the work and reroutes anything that looks shaky.
The result, in the paper's own words: this setup achieves "97.7% of reflexive architecture accuracy at 60.9% of the cost."
Read that again. Almost the same accuracy. About 40 percent less money. That's not a small optimization. That's the difference between an AI program your CFO approves and one your CFO kills.
What the triage layer actually does
The triage nurse in an ER isn't a worse doctor than the surgeon. They're doing a completely different job. Their value is in the routing decision, not in the treatment.
The supervisor in a hierarchical AI setup is the same. It isn't doing the actual work — it's deciding what each task needs. And that decision-making layer is what makes the whole system efficient, because now the expensive model only sees the cases that actually need it. Everything else gets handled at the right level.
Think about your own company. Your CFO doesn't review the office supply orders. Not because office supplies don't matter, but because spending CFO time on them would be insane when an admin handles them perfectly well. Every well-run organization works this way. Roles get matched to the difficulty of the work.
AI systems should work the same way. But the default — when nobody's thinking about it — is to send everything to the best model. That's the doctor's office. That's where the bills come from.
What scale does to the perfectionist setup
Here's the logical conclusion of the perfectionist setup.
The most accurate setup — the one that checks everything multiple times — actually gets worse when volume goes up.
The researchers tested what happens as you scale from 1,000 documents per day up to 100,000. The perfectionist setup held its accuracy advantage at low volumes. But at 50,000 documents per day, it fell behind the hierarchical version. At 100,000, it was the worst performer of all four setups they tested.
The reason, from the paper: "the iterative correction loops create queuing delays under high load; when timeout constraints are imposed, correction iterations are truncated, eliminating the architecture's primary advantage."
In English: when the system gets busy, the perfectionist's careful checking gets cut short. The whole reason it was excellent — all that double-checking — disappears under pressure. You're left with a slow, expensive system that no longer has the quality advantage that justified being slow and expensive.
The ER doesn't have this problem. When an ER gets busy, the triage nurse adapts. The bar for "go straight back" gets a little higher. Sprained fingers wait longer. But the person who's actually bleeding still goes first. The system has a built-in mechanism for prioritizing under pressure, because triage is already happening at the front door.
The takeaway for anyone building AI into their business
Three things worth holding onto.
First, the most accurate setup is rarely the best setup. Accuracy is one variable. Cost is another. Speed is another. Reliability under load is another. The setup that wins in production is the one that balances all of them, not the one that maxes out a single number on a benchmark. The hierarchical setup gives up about 1.4 points of accuracy and saves 40 percent on cost. Almost any business will take that trade.
Second, the triage layer is the thing. If your AI system doesn't have something whose entire job is sorting tasks by difficulty and routing them, you're running a doctor's office. Every well-run AI system at scale needs this layer, even if they don't call it that. Adding it is usually the single biggest improvement you can make.
Third, scale changes everything. A setup that wins at 1,000 documents a day can be the worst option at 100,000. If you're building AI into a process that's going to grow, you have to design for the scale you're heading toward, not the scale you're starting at. The doctor's office doesn't become an ER by adding more doctors. It becomes one by adding triage.
Three things to do this week
If this is landing, here's where to start.
- Audit where your work is going. Are your teams pushing all their tasks to the same model? Is everything going to Claude Opus 4.7, for example? Start figuring out which parts of your work don't need to.
- Instrument your failures. Are you collecting data on when the system is performing poorly or making mistakes that require a re-work effort? Categorizing the nature of those issues will help the routing layer know when to inspect more closely and when not to.
- Build your triage nurse. The main work of the routing layer is figuring out which engines are needed for which jobs. Take the results of step 1 and step 2, and use that to build it.
Last quick note — the triage layer in this research wasn't a super intelligent system. It doesn't need to be. I wrote about the difference between design-time and runtime AI usage before. We tend to think the triage agent would be the smartest and most expensive to run. But it's a routing layer, and those can be relatively inexpensive. The real benefit, of course, is the 40% savings they bring.
A story. An insight. A bite-sized way to help.
Get every article directly in your inbox every other day.
I won't send you spam. And I won't sell your name. Unsubscribe at any time.
About the Author
Chris Lema has spent twenty-five years in tech leadership, product development, and coaching. He builds AI-powered tools that help experts package what they know, build authority, and create programs people pay for. He writes about AI, leadership, and motivation.