# How to Stop Burning AI Budget Without Stopping What AI Lets You Do
*Published: 2026-05-24*
*Tags: ai, ai-economics, for-executives*
*Source: https://chrislema.com/stop-burning-ai-budget*
---Friday, in a Slack group I'm part of, came the question I've gotten in emails and seen online. Different people. Different companies. Different playbooks.

"Are you guys using different models? Or doing everything in Opus?"

The question is showing up in enough places, from enough different angles, that the pattern is hard to miss.

That question is the cleanest signal I've seen that the conversation about AI budgets is shifting. And it's shifting toward something the headlines aren't quite capturing yet.

## The visible problem

You've probably seen the public version. [Uber burned through its entire 2026 AI budget by April](https://www.thestateofbrand.com/news/ai-subscription-price-subsidiation-ending). Four months. Their CTO went on record about being "back to the drawing board." ServiceNow's CIO described the same situation. KPMG's data has U.S. organizations projecting AI spending of $207M over the next 12 months, nearly double a year ago, and Goldman Sachs surveys show large companies already overrunning those projections by orders of magnitude.

The standard reading goes something like this: Anthropic and OpenAI are raising prices, the subsidy era is ending, the bill is coming due. That reading isn't wrong. Pricing is shifting. Agentic usage is being decoupled from base subscriptions. Tokenizer changes have quietly added cost. All real.

But it's not the whole picture, and it's not the part you can actually do something about.

## The other thing happening

Here's what I've been noticing in my own work, and what the Slack thread surfaced.

I haven't been running out of credits on Codex with GPT-5.5 or on Claude Code. Two specific habits are doing the work, and neither one is complicated.

One: I only call an LLM when the work needs judgment. If a task is deterministic, parsing a file, transforming data, hitting an API, running a check, that's code. Not a prompt. Code runs reliably, costs almost nothing, and doesn't consume any model capacity. The temptation when you're moving fast is to [throw everything at the LLM](https://chrislema.com/there-are-two-kinds-of-ai-work-what-if-youre-missing-one) because it can technically do it. But "can do it" and "should do it" are different questions.

Two: I match the model to the work. Most tasks don't need Opus on extended thinking. They need Sonnet without thinking, or Haiku, or a quick model with no reasoning overhead at all. Opus with extended thinking is for the hard judgment calls, architecture decisions, debugging something I don't understand, weighing trade-offs. Routine generation, summarization, structured output, classification? Smaller models, no extended thinking, done.

That's it. Two decisions, made deliberately rather than by default.

## Why the question is finally surfacing

For the first year-plus of agentic AI, almost nobody was thinking this way. The flagship model was the safe choice. Extended thinking was on by default because why wouldn't you want the best reasoning. Every step in an agent loop was an LLM call because that's what agents did.

That was the right behavior for the moment. The tools were new, the workflows were experimental, and the bill hadn't shown up yet.

The bill has shown up.

[One developer instrumented 430 hours of his own Claude Code usage](https://x.com/Mnilax/status/2050261839653556522) and found 73% of his tokens were going to overhead he didn't realize was there. Everyone assumes the waste is in obvious places like bad prompts or oversized models for trivial tasks. The patterns he found sit underneath that layer, and the fixes are mostly 30-second commands:

- **CLAUDE.md bloat.** His project rules file had grown to 4,800 tokens over six months, loading on every turn. Run `wc -w` on yours. Cut anything you can't remember writing.
- **Conversation history re-reads.** Every follow-up message re-tokenizes the full conversation. By message 30, each turn is paying for messages 1-29 to be read again. Edit prior messages instead of stacking new ones, cap conversations at around 20 messages, and use `/compact` instead of `/clear` when you need continuity.
- **Plugin hook injection.** Plugins quietly inject context on every prompt before the model sees what you asked. Audit what your hooks are doing. Kill anything you can't justify on every prompt.
- **MCP tool schema overhead.** Every connected MCP ships its full tool schema on every request, whether the task uses it or not. Disable the ones you don't reach for in 80% of your work.
- **Extended thinking on by default.** Burns reasoning tokens on tasks that don't need any reasoning. Default it off. Turn it on per task when the work warrants it.

His piece is worth reading in full because it goes deep on the tactical layer most people haven't audited yet.

That layer sits underneath the layer I'm focused on. He's optimizing the cost within a given model. The two questions I started with are about whether you're in the right model in the first place, and whether the task needs an LLM at all. They stack. His tips reduce waste inside a workflow. The two habits I described decide which workflow runs.

The companies running out of budget aren't running out because AI is too expensive. They're running out because they [treated AI as one category](https://chrislema.com/ai-bill-too-high-architecture-fix-saves-40-percent). Every task got the same tool. The category got defined as broadly as the tool could handle. And the tool happened to be the most expensive option available.

## The false choice

There's a framing I want to push back on, the one where you have two options. Either you let the bill keep climbing because the productivity is worth it. Or you scale back AI usage and lose what made it valuable in the first place.

This framing makes the bill and the productivity look like the same dial. Turn it up, the bill grows and the productivity grows. Turn it down, the bill shrinks and the productivity shrinks. That's the assumption underneath every "we need to cut our AI spend" conversation I've watched play out this year.

The assumption is wrong, but quietly. The bill and the productivity aren't on the same dial. They look like they are when every task gets routed to the most expensive option, because then every unit of productivity does cost the same amount. When the same workflow runs through Opus with extended thinking and through Haiku without thinking, the productivity numbers are often comparable. The bill numbers are not. The dial people think they're turning isn't the dial that actually moves the cost.

The agentic capability is what you want. What you want to cut is the assumption that delivering it requires the flagship model on every step. Once you stop overpaying for raw generation, the only expensive resource left is your judgment about what's worth building — [which is its own strategy shift](https://chrislema.com/how-should-your-strategy-change-when-coding-costs-shrink-to-nothing).

## What to actually do

If your team's AI bill is climbing faster than expected, three questions in order:

**Where is code doing the work of an LLM?** Walk through one workflow. For each step, ask whether the task actually requires judgment. Parsing structured data, formatting output, running a check against a rule, hitting an API, these don't need a model. They feel faster to prompt because writing the prompt is faster than writing the code. But the prompt runs forever; the code runs once. If you can't articulate the judgment the LLM is making, replace the call with a function.

**Where is Opus doing the work of Sonnet?** Walk through your prompts. For each one, ask what would break if you ran it on a smaller model. A lot of work, classification, extraction, structured generation, routine summarization, runs fine on Sonnet or Haiku. Opus earns its price on hard reasoning, ambiguous situations, and judgment calls. If the task has a clear right answer that a smaller model could produce reliably, you're paying for capability you're not using.

**Where is extended thinking on by default?** Extended thinking is for genuinely hard problems. If your team has it on globally, you're burning reasoning tokens on tasks that don't have anything to reason about. Default it off. Turn it on per task when the work warrants it.

These three questions are the entire intervention. None of them require giving up agentic workflows. None require renegotiating contracts. None require waiting for pricing to stabilize.

## What I'd watch for next

Two things.

First, the question that's been showing up in my Slack group and inbox is going to surface in more places. Inside teams, between vendors and customers, in budget meetings. The people who already have answers will look like they were ahead of something, but the only thing they did differently was make two decisions deliberately while everyone else was running defaults.

Second, the pricing changes will keep coming. Anthropic's June 15 split, GitHub's June 1 move to usage-based billing, OpenAI's two free months of Codex, these are the start of a year where the gap between teams that route AI work intentionally and teams that don't will widen visibly on the P&L.

The lever everyone is reaching for, more discipline about AI usage, sits one layer below where the cost is actually being created. The cost is created when work gets routed without thinking about which model the task needs or whether the task needs a model at all. Get the routing right, and the bill stops being the thing you have to manage.