Home/Google Cloud Hub/White Papers/Vertex AI and Gemini Negotiation
Google Cloud | Vertex AI and Gemini White Paper

Vertex AI and Gemini: the buyer side framework for your next AI commitment

Gemini 2.5 Pro lists at $1.25 input and $10.00 output per million tokens, and the rate doubles the moment a request crosses 200,000 tokens of context. Token tiering, the committed use overlay, and five clauses decide whether your AI budget holds.

Prepared by Redress Compliance · June 2026 · Representative Vertex AI estate scenario (benchmark scenario, not a quote)

Executive summary

The Vertex AI and Gemini bill is decided in three places, and the discount is the least important of them.

The first is model tiering. Gemini 2.5 Pro lists at $1.25 input and $10.00 output per million tokens, while Gemini 2.5 Flash Lite lists at $0.10 and $0.40. Routing the wrong workload to Pro is a larger overpay than any commit can recover.

The second is the consumption shape. Batch prediction runs at a 50 percent discount, context caching at up to 90 percent off cached input, and provisioned throughput reserves capacity that bills whether or not you use it.

The third is the commit. A spend based committed use discount reaches about 40 percent at three years on eligible Vertex spend, and it discounts the dollar, not the token rate.

The trap most teams miss is the long context price step. A request that crosses 200,000 tokens moves Gemini 2.5 Pro to $2.50 input and $15.00 output, doubling the meter without a line item to warn you.

On a representative $6.0M annual Vertex AI estate, a tiering, consumption, and commit program cuts run rate to $4.2M, a saving of $1.8M or about 30 percent, without committing a dollar above verified steady inference. The leverage is in the architecture and five clauses, not the headline discount.

$10.00
Gemini 2.5 Pro output per million tokens, standard context, before any commit
40%
Three year spend committed use discount on eligible Vertex AI usage
$1.8M
Annual overpay on a representative $6M Vertex estate left unmanaged
5
Contract clauses that decide whether the AI commitment protects the budget
1.

How is Vertex AI and Gemini priced in 2026?

Vertex AI bills generative usage per token, and the token rate depends on the model and the size of the request. The published rates on the Gemini Enterprise Agent Platform pricing page are the ceiling. They are the rates you pay before any commit, and they are the reference points the account team will anchor you to.

Four meters carry most of an enterprise bill: the input token rate, the output token rate, the long context surcharge, and the consumption mode you choose. Output tokens are the line that hurts, because generation is several times the cost of the prompt.

ModelInput / 1MOutput / 1MBest fit
Gemini 2.5 Flash Lite$0.10$0.40High volume classification, extraction, routing
Gemini 3.5 Flash$1.50$9.00Mainstream agents and retrieval workloads
Gemini 2.5 Pro (to 200K context)$1.25$10.00Complex reasoning at standard context
Gemini 2.5 Pro (over 200K context)$2.50$15.00Long context, the rate doubles automatically
Mechanic the quote will not explain. The long context tier is not a separate SKU you opt into. When a single request crosses 200,000 tokens, Gemini 2.5 Pro silently moves to $2.50 input and $15.00 output, double the standard rate. A retrieval pipeline that grows its context window over time can double its own bill with no config change and no alert. Cap context, or price the cap.

$0 $5 $10 $15 0.10 0.40 2.5 Flash Lite 1.50 9.00 3.5 Flash 1.25 10.00 2.5 Pro std 2.50 15.00 2.5 Pro >200K Input Output

Chart A. List token price per million tokens by model. The red pair is Gemini 2.5 Pro over 200K context, where input and output both double.

2.

Why does model tiering decide most of the bill?

Model tiering is the single largest lever, and it costs you nothing to negotiate because it is an architecture choice, not a contract term. The gap between Flash Lite and Pro on output is 25 times. Most enterprise traffic does not need Pro.

The discipline is to route by task, not by default. Classification, extraction, routing, and summarization run on Flash Lite or Flash. Reserve Pro for the share of traffic that genuinely needs deep reasoning or long context.

  • Commodity tier: Flash Lite for structured extraction and routing, where $0.40 output per million is the right meter.
  • Workhorse tier: 3.5 Flash for the bulk of agent and retrieval traffic, balancing quality and cost.
  • Reasoning tier: Pro only where the task fails on Flash, and with a hard context cap to avoid the 200K doubling.

The consumption modes that change the rate

Beyond the model, the way you call it changes the price. Three modes move the meter materially, per Google's published mechanics.

  • Batch prediction: a flat 50 percent discount against real time, with a 24 hour service window, for any workload that is not interactive.
  • Context caching: up to 90 percent off cached input tokens for repeated long prompts, though cache storage carries its own per hour charge.
  • Provisioned throughput: reserved capacity in generative AI scale units for latency stability, described in Google's provisioned throughput documentation.
3.

How do committed use discounts and provisioned throughput apply to Vertex AI?

Vertex AI spend folds into Google Cloud committed use, and that is where the contract leverage lives. A spend based committed use discount runs about 20 percent at one year and up to about 40 percent at three years on eligible usage, the same band as compute.

The mechanic that matters: a spend based commit discounts the dollar, not the token rate. If Google cuts the price of a model mid term, that cut flows to you only if your clause pins the discount to net spend at then current published rates. Without that pin, you keep paying the old effective price on a falling market.

InstrumentWhat it commits1 year3 yearIdle risk
Spend based CUD (Vertex)A dollar amount of eligible Vertex usage~20%up to ~40%Unused commit is billed and lost
Provisioned throughputReserved GSU capacity, monthly or annual20 to 45%n/aIdle GSU is pure shelfware
Batch predictionNothing, per call modeflat 50% off real timeNone, no commit
Context cachingNothing, cache reuseup to 90% off cached inputCache storage per hour
Lever the quote will not surface. Provisioned throughput is sold as cost stability, but a generative AI scale unit bills the reserved capacity whether or not traffic uses it, the same trap as an over sized compute commit. Size GSUs to the measured floor of latency critical traffic only, and leave burst on pay as you go. Measure first using Google's throughput calculation guidance, never to a forecast.

4.

What does the fine tuning and customization spend really cost?

Fine tuning is priced in two parts, and buyers routinely scope only the first. The first part is the tuning job itself, billed on the tokens processed during training. The second is the ongoing cost of serving the tuned model, which carries its own inference meter for the life of the deployment.

The serving cost is the one that compounds. A tuned model you stand up in 2026 keeps billing inference every month until you retire it, so a customization that looked cheap as a one off training run becomes a standing line item.

  • Training spend: a finite cost, sized to the tuning dataset and epochs.
  • Serving spend: a recurring inference cost on the tuned endpoint, the part that needs a retirement plan.
  • Alternative first: test prompt engineering, grounding, and context caching before paying to tune, since the cheaper path often closes the quality gap.

For the published per token and per node mechanics, work from the Gemini API pricing reference and confirm the Vertex equivalent before you commit, because the two surfaces meter differently.

5.

How do you build a consumption baseline that survives Google scrutiny?

The baseline is the whole negotiation. A commit sized to a Google growth forecast is sized to Google's interest. A commit sized to your verified consumption floor is sized to yours, and it is the only number you can defend when the account team pushes back.

Build the floor from twelve months of detailed billing export, split by model, by mode, and by project. Commit only the steady inference that runs every day in production. Keep training, experimentation, and seasonal spikes on pay as you go.

Weeks 1 to 4

Baseline

Export twelve months of Vertex billing. Split by model, mode, and project. Set the committable floor at the p25 to p50 of steady production inference, stripped of training and one off spikes.

Weeks 5 to 8

Architect and clause

Re tier traffic to Flash where Pro is not needed, move offline jobs to batch, cache repeated context, and size GSUs to the floor. Draft the five clauses before pricing.

Weeks 9 to 12

Negotiate and stage

Benchmark the opening quote, commit the verified floor at three years, keep the variable tier on one year or pay as you go, and hold the discount counter until the clauses close.

Applied to a representative estate, the program looks like the table below. Each workload is optimized by tier and mode first, then the steady remainder is committed.

WorkloadUnmanaged annualPrimary leverOptimized annual
Production inference (Gemini serving)$3.00MModel tiering plus 3 year spend CUD$2.04M
Batch and offline generation$1.20MBatch prediction, 50% off$0.78M
Fine tuning and tuned model hosting$0.80MRight size, retire idle endpoints$0.62M
Provisioned throughput (latency critical)$0.60MGSU sized to measured floor$0.45M
Embeddings and grounding$0.40MContext caching on repeated input$0.31M
Total estate$6.00Mtier, mode, commit$4.20M

Benchmark scenario, not a quote. Benchmark ranges: Redress Compliance advisory engagement file, 2024 to 2025.

$0 $1M $2M $3M 3.00 2.04 Production 1.20 0.78 Batch 0.80 0.62 Tuning 0.60 0.45 Throughput 0.40 0.31 Embeddings Unmanaged Optimized

Chart B. Representative $6.0M estate. Optimized run rate $4.20M, a $1.8M or about 30 percent reduction, with no commit above verified steady inference.

6.

Which five contract clauses decide whether the commitment protects the budget?

The discount is one line on the order form. The five clauses below are what stand between you and a year end true up or a renewal at list. Negotiate them before you trade the discount, because Google concedes terms more easily while the headline rate is still open.

ClauseWhat it protectsBuyer side ask
Price protectionYou keep mid term model price cutsDiscount pinned to net spend at then current published rates
Model and mode re mixFreedom to shift across Gemini tiers, batch, and PTQuarterly re mix at no penalty, in writing
Provisioned throughput true downRelief on idle GSU as traffic shiftsQuarterly GSU adjustment to measured throughput
Renewal rateNo snap to list at the end of termRenew at the original negotiated rate, defined notice
Shortfall and attainmentGrace if the AI ramp slipsCarry forward window, Vertex and Marketplace count to commit

The auto renewal trap

Spend commitments renew automatically at the end of the term. Many agreements default that renewal to then current list rather than the rate you negotiated, and the cancellation window is short. Put the renewal rate and a clear cancellation notice in writing, or you re sign your discount away by silence.

7.

What discount benchmarks hold across renewal and exit scenarios?

Benchmarks from the engagement file frame what good looks like across scenarios. The deepest savings come from architecture, not the commit, and the worst outcomes come from over committing a volatile AI forecast.

50%
Batch prediction vs real time

Flat discount on any workload that tolerates a 24 hour window.

90%
Context caching on cached input

Up to this on repeated long prompts, net of cache storage.

40%
Spend CUD, three year

On eligible Vertex usage committed to the verified floor.

30%
Blended estate saving

The worked $6M estate at $4.2M, tier and mode led.

0% 20% 40% 60% Greenfield AI commit 25 to 40% Renewal re commit 12 to 30% Multi cloud BATNA 25 to 45% Over committed shortfall 5 to 15%

Chart C. Effective discount achieved by scenario. Source: Redress Compliance advisory engagement file, 2024 to 2025.

8.

What buyer side moves neutralize Google's standard tactics?

The account team runs a familiar playbook on AI deals. Each move has a counter that costs you nothing but preparation.

Google moveWhat it sounds likeBuyer counter
Anchor to AI growthCommit to your projected AI ramp for the deepest rateCommit verified production floor only, ramp on demand
Lead with ProDefault the architecture to the flagship modelTier by task, reserve Pro for traffic that needs it
Sell PT as savingsProvisioned throughput stabilizes your costSize GSU to measured floor, keep burst on demand
Discount for silence on termsLead with the headline percentageHold the discount counter until the five clauses close
Bundle the renewalRoll AI into the broader Google agreementKeep Vertex terms itemized and separately exitable
9.

Where the common advice on Vertex AI and Gemini deals is wrong

The standard reseller and account team pitch is to lock a large three year Vertex spend commitment now to capture the 40 percent rate. We disagree. In the AI deals we have benchmarked, a forecast anchored three year AI commit is the most reliable way to overpay.

Two reasons. First, model prices are falling, not rising. New models keep arriving cheaper, so a three year commit on today's rates locks you above a falling market unless the price protection clause is airtight.

Second, AI consumption is the least predictable line in the estate. Committing a ramp you cannot hold turns into unrecoverable waste under the consumption model, because an unused committed dollar is billed and gone.

The buyer side move is to win the architecture first. Tiering, batch, and caching deliver more than the commit, and they need no three year promise. Commit only the verified inference floor, and keep the volatile layer free.

The contrarian line: on AI, the architecture beats the discount. A Flash workload at list costs less than a Pro workload at 40 percent off. Route the traffic before you sign the commit.
10.

What is your BATNA and the side letter language?

An AI negotiation without a credible alternative is a price taker negotiation. Build the BATNA before the first call, and make sure the account team knows it exists. Gemini is strong, but it is not the only model that clears most enterprise tasks.

  • Azure AI Foundry: OpenAI and partner models on an Azure consumption commitment, sized to the same workload.
  • AWS Bedrock: Anthropic Claude and Amazon Nova for portable inference, priced for comparison.
  • Direct model APIs: Anthropic and OpenAI direct as a clean price reference outside the hyperscaler bundle.
  • Open weight self host: Llama or Gemma on GKE for the commodity tier, capping the price of the low end.
  • Stay on demand: the always available alternative that caps your downside if the terms are wrong.

Side letter language we use

Pin the protections in a side letter when the standard order form will not carry them. Three clauses do most of the work.

  • Price protection: the discount applies to net spend at then current published rates, so model price cuts flow to you.
  • GSU true down: provisioned throughput adjustable quarterly to measured throughput, no idle penalty.
  • Attainment: Vertex spend and eligible Marketplace purchases count toward the commit, with a carry forward window on a shortfall.

Get these in writing or treat the discount as unprotected.

Recommendation. Win the architecture, size the commit to your verified floor, then trade the discount.

  • Do this first: re tier traffic to Flash where Pro is not needed, move offline jobs to batch, cache repeated context, and cap context windows below 200K to avoid the silent doubling.
  • Do this next: commit only the verified production inference floor, close price protection, re mix, GSU true down, renewal rate, and shortfall before you accept the headline percentage.

We work on the buyer side only, and we are glad to tie a meaningful part of the fee to delivered value.

11.

Frequently asked questions

Why does Gemini 2.5 Pro cost more above 200K tokens?

Gemini 2.5 Pro carries a long context tier that doubles the rate, to $2.50 input and $15.00 output per million tokens, once a single request crosses 200,000 tokens. It is automatic, with no separate SKU and no alert, so a retrieval pipeline that grows its context can double its own bill. Cap context or price the cap.

Does a spend based CUD discount the token price or the dollar?

It discounts the dollar of eligible spend, not the underlying token rate. That means a mid term model price cut only benefits you if your clause pins the discount to net spend at then current published rates. Without that pin, you keep paying the old effective price as the market falls.

Is provisioned throughput cheaper than pay as you go?

Only on traffic that actually uses it. Provisioned throughput reserves capacity in generative AI scale units and bills that capacity whether or not requests arrive, so idle GSU is shelfware. Size it to the measured floor of latency critical traffic and leave burst on pay as you go.

What is the single biggest lever on a Vertex AI bill?

Model tiering, not the discount. Output on Flash Lite is $0.40 per million tokens against $10.00 on Pro, a 25 times gap. Routing commodity traffic off Pro saves more than any commit, and it needs no contract. Win the architecture before you negotiate the rate.

Prepared by Redress Compliance · redresscompliance.comVertex AI and Gemini Negotiation · June 2026
Corporate towers at dusk

Negotiating a Vertex AI or Gemini commitment?

Talk to a buyer side advisor. Thirty minutes, your token baseline, the model tiering, and the five terms worth pinning, before the account team sets the commit for you.

Buyer side intelligence, monthly

One letter a month. Negotiation moves, audit signals, and price book shifts.