Architecting for Token Efficiency

Token economics is not a prompt engineering problem. It is a system design problem.

Token Economics Is Not a Prompt Engineering Problem. It Is a System Design Problem.

Most enterprises can build AI. Very few can control what it costs to run it.

Uber burned its entire 2026 AI budget in four months. Claude Code adoption jumped from 32% to 84% of 5,000 engineers in three months. Individual monthly API costs reached $500 to $2,000 per engineer. CTO Praveen Neppalli Naga told The Information: "I am back to the drawing board because the budget I thought I would need is blown away already."

Uber is not alone. The FinOps Foundation's 2026 State of FinOps Report found that an unprecedented 98% of enterprises are now scrambling to manage AI spend, making AI cost management the number one forward-looking priority and skill gap in the industry today.

Some organizations are reporting monthly AI bills in the tens of millions of dollars. Deloitte noted that agentic AI's continuous inference is sending token costs spiraling, with usage-based pricing that looks manageable in pilots becoming ruinous at company-wide scale. Enterprise finance cycles move on annual rhythms.

Developer adoption moves on enthusiasm. That mismatch is producing budget ruptures across industries. Not from model failures. Not from poor use cases. From architectures that were never designed with cost discipline at their core.

Token economics is not a prompt engineering problem. It is a system design problem. And organizations treating it as the former will keep solving the wrong thing at the wrong layer.

The Problem with How Most Teams Approach Cost

Most AI teams apply the same optimization logic across every workload. Write shorter prompts. Reduce output length. Cap tokens globally. These are blunt instruments applied to a problem that requires precision.

Different AI architectures create different cost pressures. The pattern determines the hotspot. The hotspot determines the lever.

The Uber story is not a Claude Code story. It is an architecture governance story. The people who designed the adoption leaderboard were almost certainly not the same people responsible for the AI services budget line. That organizational gap, between the teams driving adoption and the teams managing spend, is the root cause of the overrun.

That gap exists in most enterprises right now.

"Token economics will be the new cloud cost management."

Four Patterns. Four Hotspots. One Engineering Discipline.

Each pattern has one dominant cost pressure. Identify it. Target the hotspot. Apply the right lever.

Pattern 1: Chatbot and Conversational AI (Hotspot: Conversation History)

The Hotspot: Conversation history. Prior turns accumulate with every exchange, inflating the context window session by session. Without active management, per-session cost compounds across every user at every turn.

The Levers: Prompt caching for stable system prompts and tool schemas, rolling summaries that compress prior context, raw history capped at the last two to four turns, and token budget enforcement before the model call.

Pattern 2: RAG and Knowledge AI (Hotspot: Retrieved Chunks)

The Hotspot: Retrieved chunks. The retrieval trap is intuitive: retrieve more to answer better. Economically, this fails. More chunks inflate the prompt payload regardless of relevance. The model processes more tokens. The answer is not necessarily better.

The Levers: Metadata pre-filtering at retrieval not after, strict Top-K limits, semantic reranking before context assembly, context compression before the model call, and chunk deduplication in multi-query systems.

Pattern 3: Agentic and Tool-Using AI (Hotspot: Execution Loops)

This is the Uber pattern. Agentic systems plan, call tools, observe results, and plan again. Each cycle adds tokens. Each loop feeds the next. At 5,000 engineers running parallel agent sessions across an enterprise codebase, the cost does not grow linearly. It compounds.

The Hotspot: Execution loops and raw tool outputs. By step five, the model carries the cumulative weight of every prior observation. Unstructured tool responses make this significantly worse.

The Levers: Permission and risk classification before any tool is called, hard step limits and stop conditions, structured tool schema selection and dynamic state compression (such as schema extraction) between planning cycles, and parallel tool calling where multiple independent tools can execute simultaneously to reduce total loop iterations.

Without these controls, agentic AI is the fastest path from pilot approval to budget crisis.

Pattern 4: Batch and Document AI (Hotspot: Volume and Retry Rate)

The Hotspot: Volume multiplied by retry rate. A ten percent validation failure rate on one million monthly documents generates 100,000 additional LLM calls. The retry multiplier becomes the dominant cost line at scale.

The Levers: Preprocess before the model sees the document, route simple cases to cheaper models, apply deterministic validation to catch exceptions before triggering LLM retries, and audit and monitor retries continuously to catch failure patterns before they compound.

"The prompt is where teams start optimizing. The architecture is where the cost actually compounds."

The Universal Rule Across All Four Patterns

Minimize what you send. Cache what repeats. Cap what comes back.

Then measure it like the production metric it is: cost-per-inference, token budget adherence, retry rate, output length, and latency. These belong on the same operational dashboard as availability and error rate. They are architecture health signals, not finance inputs.

The shift from seat-based to token-based pricing is not a billing detail. It is a governance model change. Seat-based software costs are predictable and linear. Token-based AI costs are consumption-driven and compound with adoption. Most enterprise finance teams are still using the wrong model to budget for the wrong pricing structure.

What Executive Leaders Must Recognize

Consumption-based pricing that looks manageable in pilots becomes ruinous at company-wide scale. Uber's engineers were not doing anything wrong. They were using a productive tool exactly as designed. The problem was that nobody connected adoption velocity to budget exposure before the bill arrived.

AI programs without token budgets are not programs. They are open-ended spending commitments.

The CFO conversation has arrived. The questions are specific: What is the cost-per-inference per workload? What is the token budget per use case? Which teams own AI spend and what are their limits? Organizations that cannot answer these questions will not just face budget scrutiny. They will face budget freezes.

Token cost accumulates pattern by pattern, workload by workload, team by team, until the quarterly infrastructure report surfaces a number that requires an explanation nobody prepared. The organizations that built cost discipline into architecture early will be in a materially better position when that conversation happens.

Six Decisions AI Leaders Need Before the CFO Asks

Map every active AI workload to its architecture pattern: chatbot, RAG, agentic, or batch. Treat each differently. Agentic workloads carry the highest budget risk at scale.
Identify the dominant token hotspot for each workload using production telemetry, not pilot estimates. Pilot data consistently underestimates production cost.
Establish cost-per-inference baselines before the next budget cycle. Without a baseline, optimization has no reference point and no defensibility.
Implement token budgets per request, session, and use case at the architecture level, not through manual review after spending occurs.
Add retry rate, output length, and token consumption to your AI operations dashboard alongside latency and error rate.
Assign cost ownership by use case and team. The Uber lesson: the team driving adoption and the team managing spend must be in the same room. Without explicit chargeback or showback, no one is accountable.

Which architecture pattern in your AI portfolio has no token budget defined today? That answer tells you where to start.

I expand this framework in the full Substack article, including deeper implementation guidance for each pattern, telemetry design, and the operating cost implications at enterprise volume. Link in bio.

"AI is not a technology problem. It is an operating model problem."

Read & discuss on LinkedIn ↗