List price per token is published. Realized cost per useful output is not. This report reads the gap across the five major model vendors and gives buyers a working benchmark for the 2026 AI bill.
List price per token is published. Realized cost per useful output, after gateway markup, model routing, retries, and fine tuning, is not. This report reads the gap across the five major model vendors and gives buyers a working benchmark.
About this report
The Redress Compliance AI Token Economics Report is a directional benchmark, not a price list. It draws on three inputs.
We report bands and directions, not precise rates. Where a single number appears, treat it as the middle of a range rather than a guarantee. Where vendor list moves, the bands move with it.
The published API list is the floor, not the middle of the band. At enterprise volume on frontier tier models, realized blended cost lands between $3 and $12 per million tokens once input and output are mixed in production ratios and the gateway layer is included.
The spread is wide for a simple reason. Two buyers can pay the same rate card and end up two or three times apart on cost per outcome because routing, prompt size, retries, and tier choice do more work than the headline rate ever does.
The vendor list cards have converged in shape and parted in detail. Tier mix, context window, and per token output rate now matter more than the headline input rate. The published rates are the input, not the answer.
At production volume, the realized cost per blended million tokens runs in three bands: roughly $0.50 to $2 on small or mini models for chat, $2 to $5 on midweight reasoning models for most enterprise tasks, and $5 to $12 on frontier tier for agentic and complex reasoning workloads.
These are blended bands. A single workload can sit cleanly inside one. A mixed estate that uses small models for retrieval and frontier tier for synthesis will span all three and weight the bill toward the most expensive layer.
Two buyers on the same vendor and the same model rarely pay the same realized rate. The drivers are not the rate card. They are token count discipline, model selection per task, retry policy, and whether the gateway layer is a thin pass through or a managed wrapper with its own markup.
Buyers who treat each request as a costed transaction land in the lower half of the band. Buyers who treat tokens as free until the invoice arrives land in the upper half, often double their plan, and rarely connect the two until a routing audit forces the math into the open.
Most enterprises do not call the model APIs directly in production. A gateway, a reseller, or a managed wrapper sits in between. That layer has its own price model, and it almost always adds a measurable percentage to the raw API rate.
In the contracts we read, gateway and reseller markups cluster in three bands. A thin pass through sits at 5 to 10 percent above raw API. A standard managed gateway lands at 15 to 25 percent. A managed wrapper with a feature layer can run 30 to 50 percent or more above raw API, sometimes much more.
A thin pass through bills at API rate plus a small platform fee. The buyer keeps the routing decisions and the prompt structure. Markup is low because the layer adds little beyond authentication and observability.
This is the cleanest structure for an enterprise that already has the engineering capacity to manage models directly. The cost lever stays with the buyer rather than the platform vendor.
A managed gateway adds routing, retries, caching, and policy on top of the raw API. The 15 to 25 percent markup pays for those features and for an opinionated platform. Buyers who use the features see real return. Buyers who do not pay the same markup for a wrapper they rarely call.
Read the actual call distribution before signing. If 80 percent of traffic flows through one route with no advanced policy, the managed layer is providing less than the rate suggests, and a thinner gateway would be cheaper without losing capability.
The highest markups sit on managed wrappers that bundle the model with a UI, a workflow, or a vertical capability. Markup ratios above 50 percent are common on these contracts, sometimes much higher when the feature layer is priced per outcome rather than per token.
That is not always wrong. A workflow vendor that handles compliance, audit, and grounding for a regulated industry can earn the premium. The buyer side question is whether the premium is being paid for the workflow or being silently applied to underlying token volume that does not need it.
Hyperscaler enterprise agreements bury part of the markup inside committed spend tiers and credit mechanics rather than a published rate. The cost still moves, but it shows up as slower credit burn or tighter commit utilization rather than a per token figure on an invoice.
Treat these mechanisms as part of the realized rate. A commit that consumes faster than planned is a price increase even if the rate card never changed, because the buyer reaches the next overage tier sooner.
Gateway and reseller markup bands above raw API list
| Gateway shape | Typical markup band | What the markup pays for | Where the buyer sits |
|---|---|---|---|
| Thin pass through | 5 to 10% | Auth, logging, basic policy | Buyer keeps routing and prompt control |
| Standard managed gateway | 15 to 25% | Routing, retries, caching, observability | Buyer trades flexibility for features |
| Managed wrapper with workflow | 30 to 50%+ | UI, workflow, vertical features, compliance | Vendor controls most of the stack |
| Hyperscaler EA, embedded | Variable, in credit burn | Commit tier mechanics and credits | Hidden in commit utilization math |
Commit tiers differ more in structure than in headline rate. Each vendor offers a discount path against a multi year, multi million dollar commit, but the shape of the commit and the flexibility across models inside it can move the realized cost by a wider margin than the published rate.
The right question is not which vendor offers the lowest list. It is which commit structure best matches the buyer workload mix over the term. A 20 percent discount on a tier the buyer cannot use is worth less than a smaller discount on the tier that holds the workload.
OpenAI commits trade list price for multi year volume. The Azure variant adds enterprise agreement terms, regional residency, and the credit mechanic that comes with a hyperscaler commit. For most enterprises with an existing Microsoft EA, the Azure path is the dominant deal shape.
The trade is flexibility. Once a model and a tier are locked into the commit, swapping into a newer or cheaper model can be slower than on a direct API account, particularly when the commit was structured against a specific deployment family.
Anthropic prices Claude on a direct API and through Bedrock. The economic difference is the commit structure: Bedrock places Claude inside the hyperscaler commit and inherits the AWS commercial framework. Direct contracts are simpler but rarely match Bedrock on overall discount when the buyer already has an AWS commit.
The buyer side test is whether the AWS commit absorbs Anthropic workload at a useful discount, or whether Anthropic direct gives a better headline rate without giving up the cross workload flexibility a cloud commit provides.
Vertex sits inside the broader Google Cloud commit. Discounts compound across Gemini, third party models, and underlying compute, which can be a real advantage for buyers with multi service workloads. For single workload buyers the picture is closer to the published list.
The Vertex grounding line is the place to pay attention. It is priced separately and can dominate the bill for retrieval heavy workloads regardless of the model rate negotiated.
Bedrock prices per model and per provisioned throughput. The provisioned throughput option changes the unit economics for any workload with predictable peak demand. At consistent load, provisioned can land well below on demand per token rates. At spiky load it almost always loses money.
Buyers who can characterize their workload precisely benefit from provisioned. Buyers who cannot should price the on demand path and accept the higher per token rate as the cost of optionality.
The clauses that move realized cost most are not in the rate card. They sit in the commit terms: reallocation across models, true ups, swap rights into newer model versions, and termination for convenience on the wrapper layer when the underlying model is replaced.
A 5 percent discount with no reallocation right is worth less than a 3 percent discount with full reallocation. The buyer who reads only the headline pays for the difference.
The standard reseller advice is to compare per token list rates and pick the cheapest. We disagree. Across roughly 25 to 35 enterprise AI workloads we benchmarked in 2024 and 2025, realized cost per useful output depended on gateway markup, routing, retry waste, and prompt discipline far more than on the headline rate. We saw 2 to 4x gaps in cost per outcome between stacks that had chased the cheapest list and stacks that paid a higher rate but routed well. The buyer side move is to benchmark per outcome at production volume, negotiate the commit and routing terms that compound, and accept that the cheapest sticker often ships the most expensive bill.
Source: Redress Compliance advisory engagement file, 2024 to 2025.
Fine tuning and routing change the bill more than the list rate does. They are also the two levers least likely to be modeled in a buyer business case, because both are usually treated as engineering decisions rather than commercial ones.
That separation is a mistake. Routing decisions move realized cost by double digit percentages on most workloads. Fine tuning, used well, reduces token consumption per task. Used poorly, it locks the buyer to a vendor with no measurable quality gain.
Routing chooses which model serves which request. A request that needs frontier tier reasoning should not run on a frontier tier model end to end if a small model can do the retrieval and a midweight model can do the synthesis. Stacks that route well typically sit one tier cheaper on average than stacks that do not.
The savings are real and recurring. A routing change that drops average cost per request from $0.012 to $0.006 compounds across millions of requests per quarter. That is more than any discount negotiation will deliver on the same volume, and it costs the buyer nothing but engineering attention.
Retries are the quiet cost. A failed call, a malformed response, or a tool call timeout produces a retry that consumes tokens with no user value. On agent workloads, retry and reroute waste runs 10 to 25 percent of total token volume in the deployments we audit.
Some retry is unavoidable. Most can be cut by better prompt structure, better tool schemas, and stricter output validation. None of it appears on a rate card, and all of it appears on the invoice.
Fine tuning is sold as a quality move. It can also be a cost move. A small model fine tuned for a specific task can outperform a frontier tier model on that task at a fraction of the per token cost, sometimes more than tenfold cheaper at production volume.
The catch is that fine tuning is sticky. A tuned model creates switching cost into the vendor. Buyers who use fine tuning as a cost lever should be deliberate about which workloads earn it and should keep base model parity as a fallback.
Vendor prompt caching, where supported, lowers the realized rate on repeated context substantially. For agent workloads with stable system prompts, this can take 30 to 70 percent of the input token cost out of the bill. It is also the lever most often left switched off in early deployments.
If the vendor offers caching, enable it, measure the hit rate, and design prompts to maximize it. It is the lowest effort, highest return cost lever in the stack.
The vendors publish the rate. The buyer pays the route. In 2026, the bill is set by the route, not the rate.
A token is not a unit a business plans around. A useful output is. Benchmarking on cost per token optimizes a number no executive can act on. Benchmarking on cost per useful output, against a clear quality bar, optimizes the bill the business actually pays.
The unit of useful output depends on the workload. For a support assistant it is a resolved conversation. For an analyst agent it is a completed research brief. For a coding assistant it is an accepted pull request. The unit must be specific to the job, because cost only means anything against value.
Define the unit first. Without an agreed definition of what counts as a useful output, the cost comparison drifts into per token math and ends back where it started. Spend the time on the definition and the benchmark becomes easy to run.
This is the part most enterprises skip. The result is a procurement conversation about per token rates and an operations conversation about per output cost, with no shared number that lets finance, engineering, and procurement agree on the bill.
Cost per output only means something at a fixed quality bar. A 50 cent answer that fails review is more expensive than a one dollar answer that ships. The quality bar is what makes the cost numbers comparable across vendors and stacks.
Buyers who fix the bar first and compare cost second land on the right vendor. Buyers who reverse that order land on the cheapest list and the highest realized bill.
Pilot economics rarely match production economics. Caching hit rates, retry waste, and tier mix all shift at scale. The benchmark only matters at the volume the business will run, which means running it long enough and broadly enough to see the steady state.
This is uncomfortable for vendors. It is the most useful thing a buyer can do, because the only bill that matters is the steady state one. Pilot pricing is a window. Production pricing is the floor.
The realized cost per outcome moves with workload type more than with vendor. Chat, agents, and batch sit in three different bands. Treating them as one category guarantees the buyer pays the agent rate on chat traffic, which is the most common form of overpaying.
Chat is the cheapest workload to run. Inputs are short, outputs are short, retries are rare, and context reuse is high. Cost per session typically sits in single cents on midweight models, less on small models with caching.
This is the workload that hides waste. Because each session is cheap, buyers run frontier tier models that the workload does not need. A move to midweight or small models on chat usually reduces cost by half or more with no measurable quality change.
Agent workloads are the most expensive band. Tool calls multiply token consumption, retries are common, and end to end reasoning chains can run frontier tier model calls many times per outcome. Cost per outcome on agents can sit two orders of magnitude above chat.
The cost lever on agents is structural. Better tool schemas, stricter output formats, prompt caching on stable system prompts, and routing the reasoning step to a frontier model only where needed. The buyer who treats agents as a chat workload with extras will overpay every quarter.
Batch sits in the middle. Long inputs and long outputs raise token volume, but vendor batch tiers and provisioned throughput often reduce per token rate. For predictable batch with high concurrency, this can land below standard on demand rates by a meaningful margin.
The math depends on consistency. Provisioned throughput rewards steady load and punishes spikes. Buyers with truly predictable batch workloads should price it. Buyers with mixed load should not.
Most enterprises run all three. The mistake is to price the estate as one number rather than three. A workload aware view of the bill, with chat, agent, and batch separated, exposes which line is growing fastest and which is the easiest to optimize.
Procurement should ask for the workload split before signing any vendor commit. The split usually tells the buyer which model tier and which discount path is the right one, and where the next quarter cost surprise is going to come from.
Prompt caching is the cheapest cost lever in the stack. When the vendor supports it, repeated input context is billed at a small fraction of the standard rate. For agent workloads with a stable system prompt, it can take 30 to 70 percent of the input token cost out of the bill.
The catch is that caching is not on by default in every stack. Each vendor implements it differently, and the hit rate depends on prompt structure as much as on the cache layer itself. Buyers who design prompts to maximize cache hits see the biggest savings.
OpenAI, Anthropic, and Google publish caching mechanics with different price points and TTLs. The differences matter for routing decisions, because a workload that caches well on one vendor may not cache as well on another, which shifts the realized rate.
The buyer side test is to measure the cache hit rate on representative traffic for each vendor before locking a commit. A vendor with a lower headline rate and a worse caching outcome can be more expensive in practice.
Cache hit rate depends on stable prefixes. Variable content at the start of a prompt invalidates the cache for everything that follows, so the order matters as much as the content. The pattern is stable system context first, then variable user content.
Teams that adopt this pattern see hit rates above 70 percent on common agent workloads. Teams that do not often see hit rates below 20 percent on the same traffic. The same vendor, the same rate card, very different bills.
Cache effectiveness drifts as workloads evolve. A workload that caches well today may not cache well next quarter if the prompt structure changes. Quarterly measurement keeps the lever working and surfaces regressions before the bill does.
Treat the cache hit rate as a KPI alongside cost per outcome. They move together. A falling hit rate without a falling outcome rate is a signal that the workload has changed and the prompts need to be redesigned, not that the model needs to be swapped.
The AI token bill grows quietly. A small change in prompt structure, a new tool call in an agent loop, or a routing decision pushed by an engineering team can move realized cost by double digit percentages without anyone in procurement seeing the change. Governance is the discipline that closes that gap.
The point is not to block engineering. It is to make the cost visible to the people who plan the budget. A governance layer that surfaces cost per outcome in real time keeps the conversation honest, and it lets finance plan AI spend with the same confidence as any other software line.
One person should own the AI token bill the way procurement owns a vendor renewal. The role usually sits in software asset management or a FinOps function, with measurement on realized cost per useful output rather than on shipping speed.
That measure matters. An owner rewarded for keeping engineering happy approves whatever the team asks for. An owner rewarded for realized cost asks the routing and caching questions every quarter, which is how the bill stays controlled.
The bill must split by workload, not just by vendor. Chat, agents, and batch each carry different unit economics, and a single line item that hides the split is a budget that cannot be planned.
Most gateways and platforms support per workload tagging. Turn it on, set the tags before the workload ships, and report by workload to finance. The visibility itself changes behavior, because engineers who can see the cost per workload route differently than engineers who cannot.
AI spend should sit on its own line, not bundled into the cloud bill or the EA. A separate line forces a separate plan, and a separate plan means the budget owner can defend or grow the line on the merits of the workload rather than as a cloud overrun.
This is the discipline that turns AI from a quiet cost growth into a managed software category. Without it, finance approves the cloud bill in aggregate and discovers the AI growth only after the year is closed.
AI commits should run on the same renewal cadence as other major software contracts, with the same 9 to 12 month preparation window. Treating an AI commit as an engineering decision rather than a procurement decision is how buyers lock in rates and structures that were right for last year and not for next.
Bring the calendar forward. Audit the workload mix in month three of the commit, not in month eleven. The earlier the audit, the more options the buyer has at renewal.
Up on volume, calmer on rate, hotter on routing. We expect token volume to keep growing as agent workloads move into production, while published list rates settle into a slower cadence. The growth in the bill will hide inside routing decisions and wrapper layers rather than rate card moves.
That makes 2027 harder to read from a published price tracker alone. The story will be how the workload is routed, not what the model costs per token. Buyers who plan around list will miss the real movement, which is in commit utilization, wrapper fees, and the agent stack.
Headline per token rates are now low enough that further list cuts produce small absolute savings on most workloads. We expect vendors to compete more on capability and on commit structure than on rate, with frontier tier output rates the only line likely to move materially.
This is good news for buyers who already focus on routing and outcome. It is bad news for buyers who are still benchmarking on list, because the lever they are pulling is the one with the least left to give.
Agent workloads will absorb most of the new token volume. They are inherently more expensive per outcome than chat or batch, and they generate retry and reroute waste that compounds with deployment depth. The agent stack is where 2027 cost growth will land.
Buyers who put the discipline of cost per outcome around agent workloads this year will keep the line bounded. Buyers who treat agents as a chat extension will see the bill compound quietly until a renewal forces the math into the open.
Vendor commits will get more complex, not simpler. Reallocation rights, swap rights, and termination clauses will carry more value because the model landscape continues to shift inside the term. The buyers who negotiate flexibility will outperform the buyers who chase the lowest committed rate.
The 2026 lesson holds into 2027. A smaller discount with full reallocation across the vendor catalog usually beats a larger discount on a single model family that is one release away from being replaced.
The managed wrapper market is crowded. We expect consolidation in 2027 as buyers tire of paying 30 to 50 percent above raw API for layers that did not earn the premium. Wrappers that survive will be the ones that own a workflow or a compliance posture, not the ones that mostly route traffic.
Buyers signing a wrapper today should price the exit, not just the entry. A wrapper that is a market leader this year may be a legacy line next year, and the contract that locks in pricing without locking in an exit will look very different in eighteen months.
Sector matters less for the rate card and more for the workload mix. Regulated industries run more retrieval and audit heavy workloads. Technology firms run heavier agent workloads. The same rate card produces different bills.
Financial services workloads concentrate in retrieval and summarization with strict audit and data residency requirements. The realized rate sits above the median because of grounding cost and provisioned throughput on dedicated infrastructure, but the absolute volume per worker is smaller than in technology firms.
The buyer side priority is data terms and residency on the commit, not headline rate. A 5 percent better rate on a contract that does not meet the regulator posture is not a usable discount.
Life sciences workloads pair heavy research summarization with strict data control. Grounding cost and dedicated capacity dominate the bill, and the rate card is rarely the lever that moves the line. The lever is the data control posture and the model choice that meets it.
Sector specific wrappers earn their markup more often here than in other sectors, because the compliance and audit capabilities are real. Buyers should still price the underlying tokens separately, to make the wrapper premium visible.
Technology firms run the heaviest agent workloads and the highest token volume per worker. They also have the most engineering capacity to route, cache, and fine tune, which means realized cost per outcome can land below median despite higher volume.
The 2026 best in class technology buyers pay less per outcome than buyers in any other sector at the same scale, because they treat the token bill as a product engineering problem rather than only a procurement problem. The lesson transfers, but it requires the engineering capacity to apply.
Public sector buyers face procurement structures that do not match short term AI commit cycles. Multi year framework agreements are common, and the residency and compliance terms are heavier than in any other sector. The realized rate often sits above the commercial median because of those constraints.
The right move is to negotiate the residency and compliance terms first and accept that the rate will sit above the broader market. Trying to match commercial pricing without matching the terms is a path that leads to procurement disputes rather than a usable contract.
At enterprise volume, realized cost per million blended tokens runs roughly $3 to $12 on frontier tier models, $2 to $5 on midweight reasoning, and $0.50 to $2 on small models. The API list is the floor of the band. Gateway markup, routing, and retries push the realized number above the rate card.
Because the API list is the raw rate and the bill is the raw rate plus gateway markup, retry waste, routing overhead, and any managed wrapper fee. Most enterprises pay 15 to 35 percent above raw API once the gateway layer is included, and more again once retries and reroutes are counted.
Gateways add a measurable percentage above raw API list. Thin pass through gateways sit at 5 to 10 percent. Standard managed gateways land at 15 to 25 percent. Managed wrappers with workflow features can run 30 to 50 percent or more. Read the gateway contract as a separate cost layer, not as part of the model price.
OpenAI, Anthropic, Google Vertex, Azure OpenAI, and AWS Bedrock each offer a multi year volume commit path. The differences are in structure rather than headline rate. Reallocation across models, swap rights into newer versions, true up terms, and termination for convenience matter more than the published discount.
It can, when used as a cost lever rather than only a quality lever. A small model fine tuned for a specific task often outperforms a frontier tier model on that task at a fraction of the per token cost. The trade is switching cost, because a tuned model is sticky to the vendor that hosts it.
Define the unit of useful output first, then hold the quality bar fixed, then measure cost at production volume. Per token math is misleading because it ignores routing, retries, and wrapper layers. Cost per resolved conversation, completed brief, or accepted change is the only number a business can plan around.
Chat is the cheapest workload band, sitting in single cents per session on midweight or small models. Agents are the most expensive, often two orders of magnitude above chat per outcome because of tool calls and retries. Batch sits in the middle, and provisioned throughput can make it cheaper than on demand at steady load.
Negotiate the structural terms before the headline discount. Reallocation across models, swap rights into newer versions, termination for convenience on the wrapper layer, and a clean exit from the gateway are the clauses that move realized cost the most. A smaller discount with these rights usually beats a larger discount without them.
Hyperscaler enterprise agreements often bury part of the AI markup in commit utilization rather than a published rate. Faster credit burn against a fixed commit is a price increase even when the rate card is unchanged, because the next overage tier arrives sooner. Read the commit math, not just the model line.
Start with routing and prompt caching. Both move realized cost by double digit percentages, neither requires a vendor negotiation, and both compound across every quarter the workload runs. After that, audit retry waste on agent workloads and consider fine tuning where the volume earns the switching cost.
The vendor by vendor commit shape, the gateway markup bands, the routing decision matrix, the prompt caching guide, and the contract clause checklist that holds realized cost down.
Used across more than five hundred enterprise engagements. Independent. Buyer side. Built for procurement, finance, and engineering leaders sizing the 2027 AI budget.
The vendors publish the rate. The buyer pays the route. In 2026, the bill is set by the route, not the rate.