Vertex AI and Gemini: the buyer side framework for your next AI commitment
Gemini 2.5 Pro lists at $1.25 input and $10.00 output per million tokens, and the rate doubles the moment a request crosses 200,000 tokens of context. Token tiering, the committed use overlay, and five clauses decide whether your AI budget holds.
Prepared by Redress Compliance · June 2026 · Representative Vertex AI estate scenario (benchmark scenario, not a quote)
Executive summary
The Vertex AI and Gemini bill is decided in three places, and the discount is the least important of them.
The first is model tiering. Gemini 2.5 Pro lists at $1.25 input and $10.00 output per million tokens, while Gemini 2.5 Flash Lite lists at $0.10 and $0.40. Routing the wrong workload to Pro is a larger overpay than any commit can recover.
The second is the consumption shape. Batch prediction runs at a 50 percent discount, context caching at up to 90 percent off cached input, and provisioned throughput reserves capacity that bills whether or not you use it.
The third is the commit. A spend based committed use discount reaches about 40 percent at three years on eligible Vertex spend, and it discounts the dollar, not the token rate.
The trap most teams miss is the long context price step. A request that crosses 200,000 tokens moves Gemini 2.5 Pro to $2.50 input and $15.00 output, doubling the meter without a line item to warn you.
On a representative $6.0M annual Vertex AI estate, a tiering, consumption, and commit program cuts run rate to $4.2M, a saving of $1.8M or about 30 percent, without committing a dollar above verified steady inference. The leverage is in the architecture and five clauses, not the headline discount.
How is Vertex AI and Gemini priced in 2026?
Vertex AI bills generative usage per token, and the token rate depends on the model and the size of the request. The published rates on the Gemini Enterprise Agent Platform pricing page are the ceiling. They are the rates you pay before any commit, and they are the reference points the account team will anchor you to.
Four meters carry most of an enterprise bill: the input token rate, the output token rate, the long context surcharge, and the consumption mode you choose. Output tokens are the line that hurts, because generation is several times the cost of the prompt.
| Model | Input / 1M | Output / 1M | Best fit |
|---|---|---|---|
| Gemini 2.5 Flash Lite | $0.10 | $0.40 | High volume classification, extraction, routing |
| Gemini 3.5 Flash | $1.50 | $9.00 | Mainstream agents and retrieval workloads |
| Gemini 2.5 Pro (to 200K context) | $1.25 | $10.00 | Complex reasoning at standard context |
| Gemini 2.5 Pro (over 200K context) | $2.50 | $15.00 | Long context, the rate doubles automatically |
Chart A. List token price per million tokens by model. The red pair is Gemini 2.5 Pro over 200K context, where input and output both double.
Why does model tiering decide most of the bill?
Model tiering is the single largest lever, and it costs you nothing to negotiate because it is an architecture choice, not a contract term. The gap between Flash Lite and Pro on output is 25 times. Most enterprise traffic does not need Pro.
The discipline is to route by task, not by default. Classification, extraction, routing, and summarization run on Flash Lite or Flash. Reserve Pro for the share of traffic that genuinely needs deep reasoning or long context.
- Commodity tier: Flash Lite for structured extraction and routing, where $0.40 output per million is the right meter.
- Workhorse tier: 3.5 Flash for the bulk of agent and retrieval traffic, balancing quality and cost.
- Reasoning tier: Pro only where the task fails on Flash, and with a hard context cap to avoid the 200K doubling.
The consumption modes that change the rate
Beyond the model, the way you call it changes the price. Three modes move the meter materially, per Google's published mechanics.
- Batch prediction: a flat 50 percent discount against real time, with a 24 hour service window, for any workload that is not interactive.
- Context caching: up to 90 percent off cached input tokens for repeated long prompts, though cache storage carries its own per hour charge.
- Provisioned throughput: reserved capacity in generative AI scale units for latency stability, described in Google's provisioned throughput documentation.
How do committed use discounts and provisioned throughput apply to Vertex AI?
Vertex AI spend folds into Google Cloud committed use, and that is where the contract leverage lives. A spend based committed use discount runs about 20 percent at one year and up to about 40 percent at three years on eligible usage, the same band as compute.
The mechanic that matters: a spend based commit discounts the dollar, not the token rate. If Google cuts the price of a model mid term, that cut flows to you only if your clause pins the discount to net spend at then current published rates. Without that pin, you keep paying the old effective price on a falling market.
| Instrument | What it commits | 1 year | 3 year | Idle risk |
|---|---|---|---|---|
| Spend based CUD (Vertex) | A dollar amount of eligible Vertex usage | ~20% | up to ~40% | Unused commit is billed and lost |
| Provisioned throughput | Reserved GSU capacity, monthly or annual | 20 to 45% | n/a | Idle GSU is pure shelfware |
| Batch prediction | Nothing, per call mode | flat 50% off real time | None, no commit | |
| Context caching | Nothing, cache reuse | up to 90% off cached input | Cache storage per hour | |
What does the fine tuning and customization spend really cost?
Fine tuning is priced in two parts, and buyers routinely scope only the first. The first part is the tuning job itself, billed on the tokens processed during training. The second is the ongoing cost of serving the tuned model, which carries its own inference meter for the life of the deployment.
The serving cost is the one that compounds. A tuned model you stand up in 2026 keeps billing inference every month until you retire it, so a customization that looked cheap as a one off training run becomes a standing line item.
- Training spend: a finite cost, sized to the tuning dataset and epochs.
- Serving spend: a recurring inference cost on the tuned endpoint, the part that needs a retirement plan.
- Alternative first: test prompt engineering, grounding, and context caching before paying to tune, since the cheaper path often closes the quality gap.
For the published per token and per node mechanics, work from the Gemini API pricing reference and confirm the Vertex equivalent before you commit, because the two surfaces meter differently.
How do you build a consumption baseline that survives Google scrutiny?
The baseline is the whole negotiation. A commit sized to a Google growth forecast is sized to Google's interest. A commit sized to your verified consumption floor is sized to yours, and it is the only number you can defend when the account team pushes back.
Build the floor from twelve months of detailed billing export, split by model, by mode, and by project. Commit only the steady inference that runs every day in production. Keep training, experimentation, and seasonal spikes on pay as you go.
Baseline
Export twelve months of Vertex billing. Split by model, mode, and project. Set the committable floor at the p25 to p50 of steady production inference, stripped of training and one off spikes.
Architect and clause
Re tier traffic to Flash where Pro is not needed, move offline jobs to batch, cache repeated context, and size GSUs to the floor. Draft the five clauses before pricing.
Negotiate and stage
Benchmark the opening quote, commit the verified floor at three years, keep the variable tier on one year or pay as you go, and hold the discount counter until the clauses close.
Applied to a representative estate, the program looks like the table below. Each workload is optimized by tier and mode first, then the steady remainder is committed.
| Workload | Unmanaged annual | Primary lever | Optimized annual |
|---|---|---|---|
| Production inference (Gemini serving) | $3.00M | Model tiering plus 3 year spend CUD | $2.04M |
| Batch and offline generation | $1.20M | Batch prediction, 50% off | $0.78M |
| Fine tuning and tuned model hosting | $0.80M | Right size, retire idle endpoints | $0.62M |
| Provisioned throughput (latency critical) | $0.60M | GSU sized to measured floor | $0.45M |
| Embeddings and grounding | $0.40M | Context caching on repeated input | $0.31M |
| Total estate | $6.00M | tier, mode, commit | $4.20M |
Benchmark scenario, not a quote. Benchmark ranges: Redress Compliance advisory engagement file, 2024 to 2025.
Chart B. Representative $6.0M estate. Optimized run rate $4.20M, a $1.8M or about 30 percent reduction, with no commit above verified steady inference.
Which five contract clauses decide whether the commitment protects the budget?
The discount is one line on the order form. The five clauses below are what stand between you and a year end true up or a renewal at list. Negotiate them before you trade the discount, because Google concedes terms more easily while the headline rate is still open.
| Clause | What it protects | Buyer side ask |
|---|---|---|
| Price protection | You keep mid term model price cuts | Discount pinned to net spend at then current published rates |
| Model and mode re mix | Freedom to shift across Gemini tiers, batch, and PT | Quarterly re mix at no penalty, in writing |
| Provisioned throughput true down | Relief on idle GSU as traffic shifts | Quarterly GSU adjustment to measured throughput |
| Renewal rate | No snap to list at the end of term | Renew at the original negotiated rate, defined notice |
| Shortfall and attainment | Grace if the AI ramp slips | Carry forward window, Vertex and Marketplace count to commit |
The auto renewal trap
Spend commitments renew automatically at the end of the term. Many agreements default that renewal to then current list rather than the rate you negotiated, and the cancellation window is short. Put the renewal rate and a clear cancellation notice in writing, or you re sign your discount away by silence.
What discount benchmarks hold across renewal and exit scenarios?
Benchmarks from the engagement file frame what good looks like across scenarios. The deepest savings come from architecture, not the commit, and the worst outcomes come from over committing a volatile AI forecast.
Flat discount on any workload that tolerates a 24 hour window.
Up to this on repeated long prompts, net of cache storage.
On eligible Vertex usage committed to the verified floor.
The worked $6M estate at $4.2M, tier and mode led.
Chart C. Effective discount achieved by scenario. Source: Redress Compliance advisory engagement file, 2024 to 2025.
What buyer side moves neutralize Google's standard tactics?
The account team runs a familiar playbook on AI deals. Each move has a counter that costs you nothing but preparation.
| Google move | What it sounds like | Buyer counter |
|---|---|---|
| Anchor to AI growth | Commit to your projected AI ramp for the deepest rate | Commit verified production floor only, ramp on demand |
| Lead with Pro | Default the architecture to the flagship model | Tier by task, reserve Pro for traffic that needs it |
| Sell PT as savings | Provisioned throughput stabilizes your cost | Size GSU to measured floor, keep burst on demand |
| Discount for silence on terms | Lead with the headline percentage | Hold the discount counter until the five clauses close |
| Bundle the renewal | Roll AI into the broader Google agreement | Keep Vertex terms itemized and separately exitable |
Where the common advice on Vertex AI and Gemini deals is wrong
The standard reseller and account team pitch is to lock a large three year Vertex spend commitment now to capture the 40 percent rate. We disagree. In the AI deals we have benchmarked, a forecast anchored three year AI commit is the most reliable way to overpay.
Two reasons. First, model prices are falling, not rising. New models keep arriving cheaper, so a three year commit on today's rates locks you above a falling market unless the price protection clause is airtight.
Second, AI consumption is the least predictable line in the estate. Committing a ramp you cannot hold turns into unrecoverable waste under the consumption model, because an unused committed dollar is billed and gone.
The buyer side move is to win the architecture first. Tiering, batch, and caching deliver more than the commit, and they need no three year promise. Commit only the verified inference floor, and keep the volatile layer free.
The contrarian line: on AI, the architecture beats the discount. A Flash workload at list costs less than a Pro workload at 40 percent off. Route the traffic before you sign the commit.
What is your BATNA and the side letter language?
An AI negotiation without a credible alternative is a price taker negotiation. Build the BATNA before the first call, and make sure the account team knows it exists. Gemini is strong, but it is not the only model that clears most enterprise tasks.
- Azure AI Foundry: OpenAI and partner models on an Azure consumption commitment, sized to the same workload.
- AWS Bedrock: Anthropic Claude and Amazon Nova for portable inference, priced for comparison.
- Direct model APIs: Anthropic and OpenAI direct as a clean price reference outside the hyperscaler bundle.
- Open weight self host: Llama or Gemma on GKE for the commodity tier, capping the price of the low end.
- Stay on demand: the always available alternative that caps your downside if the terms are wrong.
Side letter language we use
Pin the protections in a side letter when the standard order form will not carry them. Three clauses do most of the work.
- Price protection: the discount applies to net spend at then current published rates, so model price cuts flow to you.
- GSU true down: provisioned throughput adjustable quarterly to measured throughput, no idle penalty.
- Attainment: Vertex spend and eligible Marketplace purchases count toward the commit, with a carry forward window on a shortfall.
Get these in writing or treat the discount as unprotected.
Recommendation. Win the architecture, size the commit to your verified floor, then trade the discount.
- Do this first: re tier traffic to Flash where Pro is not needed, move offline jobs to batch, cache repeated context, and cap context windows below 200K to avoid the silent doubling.
- Do this next: commit only the verified production inference floor, close price protection, re mix, GSU true down, renewal rate, and shortfall before you accept the headline percentage.
We work on the buyer side only, and we are glad to tie a meaningful part of the fee to delivered value.
Frequently asked questions
Why does Gemini 2.5 Pro cost more above 200K tokens?
Gemini 2.5 Pro carries a long context tier that doubles the rate, to $2.50 input and $15.00 output per million tokens, once a single request crosses 200,000 tokens. It is automatic, with no separate SKU and no alert, so a retrieval pipeline that grows its context can double its own bill. Cap context or price the cap.
Does a spend based CUD discount the token price or the dollar?
It discounts the dollar of eligible spend, not the underlying token rate. That means a mid term model price cut only benefits you if your clause pins the discount to net spend at then current published rates. Without that pin, you keep paying the old effective price as the market falls.
Is provisioned throughput cheaper than pay as you go?
Only on traffic that actually uses it. Provisioned throughput reserves capacity in generative AI scale units and bills that capacity whether or not requests arrive, so idle GSU is shelfware. Size it to the measured floor of latency critical traffic and leave burst on pay as you go.
What is the single biggest lever on a Vertex AI bill?
Model tiering, not the discount. Output on Flash Lite is $0.40 per million tokens against $10.00 on Pro, a 25 times gap. Routing commodity traffic off Pro saves more than any commit, and it needs no contract. Win the architecture before you negotiate the rate.
Negotiating a Vertex AI or Gemini commitment?
Talk to a buyer side advisor. Thirty minutes, your token baseline, the model tiering, and the five terms worth pinning, before the account team sets the commit for you.
Buyer side intelligence, monthly
One letter a month. Negotiation moves, audit signals, and price book shifts.