AI Token Economics Report 2026: Real Cost

List price per token is published. Realized cost per useful output, after gateway markup, model routing, retries, and fine tuning, is not. This report reads the gap across the five major model vendors and gives buyers a working benchmark.

The report at a glance

$3 to $12

Realized cost per million blended tokens at enterprise volume, frontier tier

15 to 35%

Typical gateway and reseller markup above raw API list

2 to 4x

Cost per useful output gap between cheapest list and best routed stack

Outcome

The benchmark unit. Not the token.

Key takeaways

Enterprises pay roughly $3 to $12 per million blended tokens at production volume on frontier tier models. The list rate is the floor of that band, not the middle.
Gateway and reseller layers add 15 to 35 percent above raw API list on most enterprise contracts we benchmark, sometimes more on managed wrappers.
Commit tiers across OpenAI, Anthropic, Google Vertex, Azure OpenAI, and AWS Bedrock differ more in discount structure and term than in headline rate.
Fine tuning and routing change the bill more than the list rate does. Retry waste alone can run 10 to 25 percent of token consumption on agent workloads.
The cheapest list price often ships the most expensive bill. Cost per useful output, not cost per token, is the only benchmark a CFO can act on.
Workload type drives most of the realized cost. Chat is the cheapest, agents are the most expensive, batch sits in the middle.
The buyer side move is to benchmark per outcome at production volume, negotiate the commit and routing terms that compound, and treat the gateway layer as a contract, not a tool.

About this report

The Redress Compliance AI Token Economics Report is a directional benchmark, not a price list. It draws on three inputs.

Our advisory engagement file. Anonymized, aggregated ranges from enterprise AI gateway reviews, model contract negotiations, and routing audits supported between 2024 and 2025.
Vendor public pricing. The dated, on the record per token rates and commit tiers published by each model vendor, cited in line through the report.
A benchmarking panel. A rolling set of comparable enterprise AI workloads used to separate raw API list from realized cost per outcome.

We report bands and directions, not precise rates. Where a single number appears, treat it as the middle of a range rather than a guarantee. Where vendor list moves, the bands move with it.

Vera AI · 30 day free trial

See the percentile your price sits at, in minutes.

520 vendor benchmarks, from Microsoft EA to Oracle ULA to Salesforce
Instant percentile standing: market low, median, and high for deals like yours
Renewal uplift exposure modeled over the full term, with the cap to ask for

Start the free Vera AI trial →30 day free trial · no card needed

What do enterprises actually pay per million tokens at scale?

The published API list is the floor, not the middle of the band. At enterprise volume on frontier tier models, realized blended cost lands between $3 and $12 per million tokens once input and output are mixed in production ratios and the gateway layer is included.

The spread is wide for a simple reason. Two buyers can pay the same rate card and end up two or three times apart on cost per outcome because routing, prompt size, retries, and tier choice do more work than the headline rate ever does.

The published list, vendor by vendor

The vendor list cards have converged in shape and parted in detail. Tier mix, context window, and per token output rate now matter more than the headline input rate. The published rates are the input, not the answer.

OpenAI: the published API pricing spans cheap small models in the low cents per million range up through frontier tier output rates in the tens of dollars per million.
Anthropic: the Claude rate card mirrors the same shape, with a clear gap between fast tier and frontier tier output.
Google Vertex: Vertex pricing is published per character and per token across Gemini tiers, with a separate line for grounding and tool calls.
AWS Bedrock: Bedrock pricing publishes per model rates across the catalog, plus a provisioned throughput line that changes the math at scale.
Azure OpenAI: the same OpenAI models priced inside the Azure commercial framework, with enterprise agreement terms that often replace the public rate at scale.

The realized rate at enterprise volume

At production volume, the realized cost per blended million tokens runs in three bands: roughly $0.50 to $2 on small or mini models for chat, $2 to $5 on midweight reasoning models for most enterprise tasks, and $5 to $12 on frontier tier for agentic and complex reasoning workloads.

These are blended bands. A single workload can sit cleanly inside one. A mixed estate that uses small models for retrieval and frontier tier for synthesis will span all three and weight the bill toward the most expensive layer.

Why two buyers on the same vendor pay differently

Two buyers on the same vendor and the same model rarely pay the same realized rate. The drivers are not the rate card. They are token count discipline, model selection per task, retry policy, and whether the gateway layer is a thin pass through or a managed wrapper with its own markup.

Buyers who treat each request as a costed transaction land in the lower half of the band. Buyers who treat tokens as free until the invoice arrives land in the upper half, often double their plan, and rarely connect the two until a routing audit forces the math into the open.

Blended realized cost per million tokens by model tier at enterprise volume, compared to the equivalent raw API list. Realized includes gateway markup, retries, and routing overhead. Source: Redress Compliance advisory engagement file, 2024 to 2025.

How does gateway and reseller markup move the realized rate?

Most enterprises do not call the model APIs directly in production. A gateway, a reseller, or a managed wrapper sits in between. That layer has its own price model, and it almost always adds a measurable percentage to the raw API rate.

In the contracts we read, gateway and reseller markups cluster in three bands. A thin pass through sits at 5 to 10 percent above raw API. A standard managed gateway lands at 15 to 25 percent. A managed wrapper with a feature layer can run 30 to 50 percent or more above raw API, sometimes much more.

Thin pass through gateways

A thin pass through bills at API rate plus a small platform fee. The buyer keeps the routing decisions and the prompt structure. Markup is low because the layer adds little beyond authentication and observability.

This is the cleanest structure for an enterprise that already has the engineering capacity to manage models directly. The cost lever stays with the buyer rather than the platform vendor.

Managed gateways

A managed gateway adds routing, retries, caching, and policy on top of the raw API. The 15 to 25 percent markup pays for those features and for an opinionated platform. Buyers who use the features see real return. Buyers who do not pay the same markup for a wrapper they rarely call.

Read the actual call distribution before signing. If 80 percent of traffic flows through one route with no advanced policy, the managed layer is providing less than the rate suggests, and a thinner gateway would be cheaper without losing capability.

Managed wrappers with feature layers

The highest markups sit on managed wrappers that bundle the model with a UI, a workflow, or a vertical capability. Markup ratios above 50 percent are common on these contracts, sometimes much higher when the feature layer is priced per outcome rather than per token.

That is not always wrong. A workflow vendor that handles compliance, audit, and grounding for a regulated industry can earn the premium. The buyer side question is whether the premium is being paid for the workflow or being silently applied to underlying token volume that does not need it.

The quiet markups inside an EA

Hyperscaler enterprise agreements bury part of the markup inside committed spend tiers and credit mechanics rather than a published rate. The cost still moves, but it shows up as slower credit burn or tighter commit utilization rather than a per token figure on an invoice.

Treat these mechanisms as part of the realized rate. A commit that consumes faster than planned is a price increase even if the rate card never changed, because the buyer reaches the next overage tier sooner.

Gateway and reseller markup bands above raw API list

Gateway shape	Typical markup band	What the markup pays for	Where the buyer sits
Thin pass through	5 to 10%	Auth, logging, basic policy	Buyer keeps routing and prompt control
Standard managed gateway	15 to 25%	Routing, retries, caching, observability	Buyer trades flexibility for features
Managed wrapper with workflow	30 to 50%+	UI, workflow, vertical features, compliance	Vendor controls most of the stack
Hyperscaler EA, embedded	Variable, in credit burn	Commit tier mechanics and credits	Hidden in commit utilization math

How do commit tiers across OpenAI, Anthropic, Google, Azure, and Bedrock compare?

Commit tiers differ more in structure than in headline rate. Each vendor offers a discount path against a multi year, multi million dollar commit, but the shape of the commit and the flexibility across models inside it can move the realized cost by a wider margin than the published rate.

The right question is not which vendor offers the lowest list. It is which commit structure best matches the buyer workload mix over the term. A 20 percent discount on a tier the buyer cannot use is worth less than a smaller discount on the tier that holds the workload.

OpenAI and Azure OpenAI

OpenAI commits trade list price for multi year volume. The Azure variant adds enterprise agreement terms, regional residency, and the credit mechanic that comes with a hyperscaler commit. For most enterprises with an existing Microsoft EA, the Azure path is the dominant deal shape.

The trade is flexibility. Once a model and a tier are locked into the commit, swapping into a newer or cheaper model can be slower than on a direct API account, particularly when the commit was structured against a specific deployment family.

Anthropic on Bedrock and direct

Anthropic prices Claude on a direct API and through Bedrock. The economic difference is the commit structure: Bedrock places Claude inside the hyperscaler commit and inherits the AWS commercial framework. Direct contracts are simpler but rarely match Bedrock on overall discount when the buyer already has an AWS commit.

The buyer side test is whether the AWS commit absorbs Anthropic workload at a useful discount, or whether Anthropic direct gives a better headline rate without giving up the cross workload flexibility a cloud commit provides.

Google Vertex

Vertex sits inside the broader Google Cloud commit. Discounts compound across Gemini, third party models, and underlying compute, which can be a real advantage for buyers with multi service workloads. For single workload buyers the picture is closer to the published list.

The Vertex grounding line is the place to pay attention. It is priced separately and can dominate the bill for retrieval heavy workloads regardless of the model rate negotiated.

AWS Bedrock

Bedrock prices per model and per provisioned throughput. The provisioned throughput option changes the unit economics for any workload with predictable peak demand. At consistent load, provisioned can land well below on demand per token rates. At spiky load it almost always loses money.

Buyers who can characterize their workload precisely benefit from provisioned. Buyers who cannot should price the on demand path and accept the higher per token rate as the cost of optionality.

Cross vendor commit terms that matter

The clauses that move realized cost most are not in the rate card. They sit in the commit terms: reallocation across models, true ups, swap rights into newer model versions, and termination for convenience on the wrapper layer when the underlying model is replaced.

A 5 percent discount with no reallocation right is worth less than a 3 percent discount with full reallocation. The buyer who reads only the headline pays for the difference.

Where the common advice on chasing the lowest list rate is wrong

The standard reseller advice is to compare per token list rates and pick the cheapest. We disagree. Across roughly 25 to 35 enterprise AI workloads we benchmarked in 2024 and 2025, realized cost per useful output depended on gateway markup, routing, retry waste, and prompt discipline far more than on the headline rate. We saw 2 to 4x gaps in cost per outcome between stacks that had chased the cheapest list and stacks that paid a higher rate but routed well. The buyer side move is to benchmark per outcome at production volume, negotiate the commit and routing terms that compound, and accept that the cheapest sticker often ships the most expensive bill.

Editorial photograph of an engineering team measuring AI inference cost against output quality — List price per token is published. Realized cost per useful output is not, and it depends on gateway markup and routing far more than the rate card.

$3 to $12

Per million tokens, realized band

15 to 35%

Gateway markup above raw API

2 to 4x

Cost per outcome spread, same vendor

Source: Redress Compliance advisory engagement file, 2024 to 2025.

What does fine tuning and routing add to the bill?

Fine tuning and routing change the bill more than the list rate does. They are also the two levers least likely to be modeled in a buyer business case, because both are usually treated as engineering decisions rather than commercial ones.

That separation is a mistake. Routing decisions move realized cost by double digit percentages on most workloads. Fine tuning, used well, reduces token consumption per task. Used poorly, it locks the buyer to a vendor with no measurable quality gain.

Routing as a cost lever

Routing chooses which model serves which request. A request that needs frontier tier reasoning should not run on a frontier tier model end to end if a small model can do the retrieval and a midweight model can do the synthesis. Stacks that route well typically sit one tier cheaper on average than stacks that do not.

The savings are real and recurring. A routing change that drops average cost per request from $0.012 to $0.006 compounds across millions of requests per quarter. That is more than any discount negotiation will deliver on the same volume, and it costs the buyer nothing but engineering attention.

Retries and reroute waste

Retries are the quiet cost. A failed call, a malformed response, or a tool call timeout produces a retry that consumes tokens with no user value. On agent workloads, retry and reroute waste runs 10 to 25 percent of total token volume in the deployments we audit.

Some retry is unavoidable. Most can be cut by better prompt structure, better tool schemas, and stricter output validation. None of it appears on a rate card, and all of it appears on the invoice.

Fine tuning, used as a cost lever

Fine tuning is sold as a quality move. It can also be a cost move. A small model fine tuned for a specific task can outperform a frontier tier model on that task at a fraction of the per token cost, sometimes more than tenfold cheaper at production volume.

The catch is that fine tuning is sticky. A tuned model creates switching cost into the vendor. Buyers who use fine tuning as a cost lever should be deliberate about which workloads earn it and should keep base model parity as a fallback.

Prompt caching and context reuse

Vendor prompt caching, where supported, lowers the realized rate on repeated context substantially. For agent workloads with stable system prompts, this can take 30 to 70 percent of the input token cost out of the bill. It is also the lever most often left switched off in early deployments.

If the vendor offers caching, enable it, measure the hit rate, and design prompts to maximize it. It is the lowest effort, highest return cost lever in the stack.

The vendors publish the rate. The buyer pays the route. In 2026, the bill is set by the route, not the rate.

How should you benchmark per useful output, not per token?

A token is not a unit a business plans around. A useful output is. Benchmarking on cost per token optimizes a number no executive can act on. Benchmarking on cost per useful output, against a clear quality bar, optimizes the bill the business actually pays.

The unit of useful output depends on the workload. For a support assistant it is a resolved conversation. For an analyst agent it is a completed research brief. For a coding assistant it is an accepted pull request. The unit must be specific to the job, because cost only means anything against value.

Define the unit before the benchmark

Define the unit first. Without an agreed definition of what counts as a useful output, the cost comparison drifts into per token math and ends back where it started. Spend the time on the definition and the benchmark becomes easy to run.

This is the part most enterprises skip. The result is a procurement conversation about per token rates and an operations conversation about per output cost, with no shared number that lets finance, engineering, and procurement agree on the bill.

Hold the quality bar fixed

Cost per output only means something at a fixed quality bar. A 50 cent answer that fails review is more expensive than a one dollar answer that ships. The quality bar is what makes the cost numbers comparable across vendors and stacks.

Buyers who fix the bar first and compare cost second land on the right vendor. Buyers who reverse that order land on the cheapest list and the highest realized bill.

Benchmark at production volume

Pilot economics rarely match production economics. Caching hit rates, retry waste, and tier mix all shift at scale. The benchmark only matters at the volume the business will run, which means running it long enough and broadly enough to see the steady state.

This is uncomfortable for vendors. It is the most useful thing a buyer can do, because the only bill that matters is the steady state one. Pilot pricing is a window. Production pricing is the floor.

Cost per useful output across six stack shapes at production volume, indexed to a band midpoint. Routed stacks beat end to end stacks on the same model tier. Managed wrappers are the most expensive band. Source: Redress Compliance advisory engagement file, 2024 to 2025.

How does token economics vary by workload?

The realized cost per outcome moves with workload type more than with vendor. Chat, agents, and batch sit in three different bands. Treating them as one category guarantees the buyer pays the agent rate on chat traffic, which is the most common form of overpaying.

Chat workloads

Chat is the cheapest workload to run. Inputs are short, outputs are short, retries are rare, and context reuse is high. Cost per session typically sits in single cents on midweight models, less on small models with caching.

This is the workload that hides waste. Because each session is cheap, buyers run frontier tier models that the workload does not need. A move to midweight or small models on chat usually reduces cost by half or more with no measurable quality change.

Agent workloads

Agent workloads are the most expensive band. Tool calls multiply token consumption, retries are common, and end to end reasoning chains can run frontier tier model calls many times per outcome. Cost per outcome on agents can sit two orders of magnitude above chat.

The cost lever on agents is structural. Better tool schemas, stricter output formats, prompt caching on stable system prompts, and routing the reasoning step to a frontier model only where needed. The buyer who treats agents as a chat workload with extras will overpay every quarter.

Batch workloads

Batch sits in the middle. Long inputs and long outputs raise token volume, but vendor batch tiers and provisioned throughput often reduce per token rate. For predictable batch with high concurrency, this can land below standard on demand rates by a meaningful margin.

The math depends on consistency. Provisioned throughput rewards steady load and punishes spikes. Buyers with predictable batch workloads should price it. Buyers with mixed load should not.

Mixing workloads inside one estate

Most enterprises run all three. The mistake is to price the estate as one number rather than three. A workload aware view of the bill, with chat, agent, and batch separated, exposes which line is growing fastest and which is the easiest to optimize.

Procurement should ask for the workload split before signing any vendor commit. The split usually tells the buyer which model tier and which discount path is the right one, and where the next quarter cost surprise is going to come from.

How does prompt caching change the math?

Prompt caching is the cheapest cost lever in the stack. When the vendor supports it, repeated input context is billed at a small fraction of the standard rate. For agent workloads with a stable system prompt, it can take 30 to 70 percent of the input token cost out of the bill.

The catch is that caching is not on by default in every stack. Each vendor implements it differently, and the hit rate depends on prompt structure as much as on the cache layer itself. Buyers who design prompts to maximize cache hits see the biggest savings.

Vendor support varies

OpenAI, Anthropic, and Google publish caching mechanics with different price points and TTLs. The differences matter for routing decisions, because a workload that caches well on one vendor may not cache as well on another, which shifts the realized rate.

The buyer side test is to measure the cache hit rate on representative traffic for each vendor before locking a commit. A vendor with a lower headline rate and a worse caching outcome can be more expensive in practice.

Design prompts for the cache

Cache hit rate depends on stable prefixes. Variable content at the start of a prompt invalidates the cache for everything that follows, so the order matters as much as the content. The pattern is stable system context first, then variable user content.

Teams that adopt this pattern see hit rates above 70 percent on common agent workloads. Teams that do not often see hit rates below 20 percent on the same traffic. The same vendor, the same rate card, very different bills.

Measure the hit rate quarterly

Cache effectiveness drifts as workloads evolve. A workload that caches well today may not cache well next quarter if the prompt structure changes. Quarterly measurement keeps the lever working and surfaces regressions before the bill does.

Treat the cache hit rate as a KPI alongside cost per outcome. They move together. A falling hit rate without a falling outcome rate is a signal that the workload has changed and the prompts need to be redesigned, not that the model needs to be swapped.

What governance should sit around the AI token bill?

The AI token bill grows quietly. A small change in prompt structure, a new tool call in an agent loop, or a routing decision pushed by an engineering team can move realized cost by double digit percentages without anyone in procurement seeing the change. Governance is the discipline that closes that gap.

The point is not to block engineering. It is to make the cost visible to the people who plan the budget. A governance layer that surfaces cost per outcome in real time keeps the conversation honest, and it lets finance plan AI spend with the same confidence as any other software line.

A single accountable owner

One person should own the AI token bill the way procurement owns a vendor renewal. The role usually sits in software asset management or a FinOps function, with measurement on realized cost per useful output rather than on shipping speed.

That measure matters. An owner rewarded for keeping engineering happy approves whatever the team asks for. An owner rewarded for realized cost asks the routing and caching questions every quarter, which is how the bill stays controlled.

Cost visibility per workload

The bill must split by workload, not just by vendor. Chat, agents, and batch each carry different unit economics, and a single line item that hides the split is a budget that cannot be planned.

Most gateways and platforms support per workload tagging. Turn it on, set the tags before the workload ships, and report by workload to finance. The visibility itself changes behavior, because engineers who can see the cost per workload route differently than engineers who cannot.

A separate AI line in the software plan

AI spend should sit on its own line, not bundled into the cloud bill or the EA. A separate line forces a separate plan, and a separate plan means the budget owner can defend or grow the line on the merits of the workload rather than as a cloud overrun.

This is the discipline that turns AI from a quiet cost growth into a managed software category. Without it, finance approves the cloud bill in aggregate and discovers the AI growth only after the year is closed.

Renewal cadence on AI commits

AI commits should run on the same renewal cadence as other major software contracts, with the same 9 to 12 month preparation window. Treating an AI commit as an engineering decision rather than a procurement decision is how buyers lock in rates and structures that were right for last year and not for next.

Bring the calendar forward. Audit the workload mix in month three of the commit, not in month eleven. The earlier the audit, the more options the buyer has at renewal.

What does AI token economics do in 2027?

Up on volume, calmer on rate, hotter on routing. We expect token volume to keep growing as agent workloads move into production, while published list rates settle into a slower cadence. The growth in the bill will hide inside routing decisions and wrapper layers rather than rate card moves.

That makes 2027 harder to read from a published price tracker alone. The story will be how the workload is routed, not what the model costs per token. Buyers who plan around list will miss the real movement, which is in commit utilization, wrapper fees, and the agent stack.

List rates settle

Headline per token rates are now low enough that further list cuts produce small absolute savings on most workloads. We expect vendors to compete more on capability and on commit structure than on rate, with frontier tier output rates the only line likely to move materially.

This is good news for buyers who already focus on routing and outcome. It is bad news for buyers who are still benchmarking on list, because the lever they are pulling is the one with the least left to give.

Agent workloads dominate growth

Agent workloads will absorb most of the new token volume. They are inherently more expensive per outcome than chat or batch, and they generate retry and reroute waste that compounds with deployment depth. The agent stack is where 2027 cost growth will land.

Buyers who put the discipline of cost per outcome around agent workloads this year will keep the line bounded. Buyers who treat agents as a chat extension will see the bill compound quietly until a renewal forces the math into the open.

Commit structures get more complex

Vendor commits will get more complex, not simpler. Reallocation rights, swap rights, and termination clauses will carry more value because the model landscape continues to shift inside the term. The buyers who negotiate flexibility will outperform the buyers who chase the lowest committed rate.

The 2026 lesson holds into 2027. A smaller discount with full reallocation across the vendor catalog usually beats a larger discount on a single model family that is one release away from being replaced.

Wrapper consolidation

The managed wrapper market is crowded. We expect consolidation in 2027 as buyers tire of paying 30 to 50 percent above raw API for layers that did not earn the premium. Wrappers that survive will be the ones that own a workflow or a compliance posture, not the ones that mostly route traffic.

Buyers signing a wrapper today should price the exit, not just the entry. A wrapper that is a market leader this year may be a legacy line next year, and the contract that locks in pricing without locking in an exit will look very different in eighteen months.

How does AI token economics vary by sector?

Sector matters less for the rate card and more for the workload mix. Regulated industries run more retrieval and audit heavy workloads. Technology firms run heavier agent workloads. The same rate card produces different bills.

Financial services

Financial services workloads concentrate in retrieval and summarization with strict audit and data residency requirements. The realized rate sits above the median because of grounding cost and provisioned throughput on dedicated infrastructure, but the absolute volume per worker is smaller than in technology firms.

The buyer side priority is data terms and residency on the commit, not headline rate. A 5 percent better rate on a contract that does not meet the regulator posture is not a usable discount.

Life sciences and healthcare

Life sciences workloads pair heavy research summarization with strict data control. Grounding cost and dedicated capacity dominate the bill, and the rate card is rarely the lever that moves the line. The lever is the data control posture and the model choice that meets it.

Sector specific wrappers earn their markup more often here than in other sectors, because the compliance and audit capabilities are real. Buyers should still price the underlying tokens separately, to make the wrapper premium visible.

Technology and digital native firms

Technology firms run the heaviest agent workloads and the highest token volume per worker. They also have the most engineering capacity to route, cache, and fine tune, which means realized cost per outcome can land below median despite higher volume.

The 2026 best in class technology buyers pay less per outcome than buyers in any other sector at the same scale, because they treat the token bill as a product engineering problem rather than only a procurement problem. The lesson transfers, but it requires the engineering capacity to apply.

Public sector

Public sector buyers face procurement structures that do not match short term AI commit cycles. Multi year framework agreements are common, and the residency and compliance terms are heavier than in any other sector. The realized rate often sits above the commercial median because of those constraints.

The right move is to negotiate the residency and compliance terms first and accept that the rate will sit above the broader market. Trying to match commercial pricing without matching the terms is a path that leads to procurement disputes rather than a usable contract.

What should a buyer do next?

Define a unit of useful output for each AI workload, agreed across finance, engineering, and procurement.
Measure realized cost per useful output at production volume, not per token in a pilot.
Separate the gateway markup from the underlying API rate on every contract, and price each on its own.
Map workloads to model tiers. Chat to small or midweight. Frontier tier only where it earns its rate.
Enable prompt caching where the vendor supports it and measure the hit rate every quarter.
Audit retry and reroute waste on agent workloads. Cut it before negotiating the rate.
Treat fine tuning as a cost lever, not only a quality lever. Use it where it pays back.
Negotiate commit terms on reallocation, swap rights, and termination for convenience, not just headline rate.
Engage independent benchmarking and advisory before the first vendor proposal, not after the spend has compounded.

GenAI Knowledge Hub. The full set of buyer side research on the model vendors.
Enterprise Software Price Increase Index 2026. The blended view across eleven vendors, including the AI premium.
Shadow AI Spend Report 2026. The off books bill on corporate cards across the same workloads.
Vendor Benchmark Program. Subscription benchmarking across the model vendors and AI wrappers.

White Paper · GenAI

Enterprise AI Contract Negotiation Guide

How to lock better enterprise AI contract terms in 2026: cross vendor commitment scope, output indemnity, data residency, and model price ceilings. Read it free.

Read the white paper

Need help? Try our AI agents. Ask the GenAI vendor AI agent → Scoped to one vendor and one problem. Runs in your browser.

Frequently asked questions

What does enterprise AI really cost per million tokens?

At enterprise volume, realized cost per million blended tokens runs roughly $3 to $12 on frontier tier models, $2 to $5 on midweight reasoning, and $0.50 to $2 on small models. The API list is the floor of the band. Gateway markup, routing, and retries push the realized number above the rate card.

Why is realized AI cost higher than the API list price?

Because the API list is the raw rate and the bill is the raw rate plus gateway markup, retry waste, routing overhead, and any managed wrapper fee. Most enterprises pay 15 to 35 percent above raw API once the gateway layer is included, and more again once retries and reroutes are counted.

How does gateway markup affect AI cost?

Gateways add a measurable percentage above raw API list. Thin pass through gateways sit at 5 to 10 percent. Standard managed gateways land at 15 to 25 percent. Managed wrappers with workflow features can run 30 to 50 percent or more. Read the gateway contract as a separate cost layer, not as part of the model price.

How do commit tiers work across the model vendors?

OpenAI, Anthropic, Google Vertex, Azure OpenAI, and AWS Bedrock each offer a multi year volume commit path. The differences are in structure rather than headline rate. Reallocation across models, swap rights into newer versions, true up terms, and termination for convenience matter more than the published discount.

Does fine tuning save money?

It can, when used as a cost lever rather than only a quality lever. A small model fine tuned for a specific task often outperforms a frontier tier model on that task at a fraction of the per token cost. The trade is switching cost, because a tuned model is sticky to the vendor that hosts it.

How should I benchmark AI cost per outcome?

Define the unit of useful output first, then hold the quality bar fixed, then measure cost at production volume. Per token math is misleading because it ignores routing, retries, and wrapper layers. Cost per resolved conversation, completed brief, or accepted change is the only number a business can plan around.

How does token cost vary by workload?

Chat is the cheapest workload band, sitting in single cents per session on midweight or small models. Agents are the most expensive, often two orders of magnitude above chat per outcome because of tool calls and retries. Batch sits in the middle, and provisioned throughput can make it cheaper than on demand at steady load.

How do I negotiate model vendor commits?

Negotiate the structural terms before the headline discount. Reallocation across models, swap rights into newer versions, termination for convenience on the wrapper layer, and a clean exit from the gateway are the clauses that move realized cost the most. A smaller discount with these rights usually beats a larger discount without them.

What is the AI premium hiding in my hyperscaler bill?

Hyperscaler enterprise agreements often bury part of the AI markup in commit utilization rather than a published rate. Faster credit burn against a fixed commit is a price increase even when the rate card is unchanged, because the next overage tier arrives sooner. Read the commit math, not just the model line.

Where should we focus first to reduce token spend?

Start with routing and prompt caching. Both move realized cost by double digit percentages, neither requires a vendor negotiation, and both compound across every quarter the workload runs. After that, audit retry waste on agent workloads and consider fine tuning where the volume earns the switching cost.