Google Cloud | Vertex AI and Gemini White Paper

Vertex AI and Gemini: the buyer side framework for your next AI commitment

Gemini 2.5 Pro lists at $1.25 input and $10.00 output per million tokens, and the rate doubles the moment a request crosses 200,000 tokens of context. Token tiering, the committed use overlay, and five clauses decide whether your AI budget holds.

Prepared by Redress Compliance · June 2026 · Representative Vertex AI estate scenario (benchmark scenario, not a quote)

Executive summary

The Vertex AI and Gemini bill is decided in three places, and the discount is the least important of them.

The first is model tiering. Gemini 2.5 Pro lists at $1.25 input and $10.00 output per million tokens, while Gemini 2.5 Flash Lite lists at $0.10 and $0.40. Routing the wrong workload to Pro is a larger overpay than any commit can recover.

The second is the consumption shape. Batch prediction runs at a 50 percent discount, context caching at up to 90 percent off cached input, and provisioned throughput reserves capacity that bills whether or not you use it.

The third is the commit. A spend based committed use discount reaches about 40 percent at three years on eligible Vertex spend, and it discounts the dollar, not the token rate.

The trap most teams miss is the long context price step. A request that crosses 200,000 tokens moves Gemini 2.5 Pro to $2.50 input and $15.00 output, doubling the meter without a line item to warn you.

On a representative $6.0M annual Vertex AI estate, a tiering, consumption, and commit program cuts run rate to $4.2M, a saving of $1.8M or about 30 percent, without committing a dollar above verified steady inference. The leverage is in the architecture and five clauses, not the headline discount.

$10.00

Gemini 2.5 Pro output per million tokens, standard context, before any commit

40%

Three year spend committed use discount on eligible Vertex AI usage

$1.8M

Annual overpay on a representative $6M Vertex estate left unmanaged

Contract clauses that decide whether the AI commitment protects the budget

How is Vertex AI and Gemini priced in 2026?

Vertex AI bills generative usage per token, and the token rate depends on the model and the size of the request. The published rates on the Gemini Enterprise Agent Platform pricing page are the ceiling. They are the rates you pay before any commit, and they are the reference points the account team will anchor you to.

Four meters carry most of an enterprise bill: the input token rate, the output token rate, the long context surcharge, and the consumption mode you choose. Output tokens are the line that hurts, because generation is several times the cost of the prompt.

Model	Input / 1M	Output / 1M	Best fit
Gemini 2.5 Flash Lite	$0.10	$0.40	High volume classification, extraction, routing
Gemini 3.5 Flash	$1.50	$9.00	Mainstream agents and retrieval workloads
Gemini 2.5 Pro (to 200K context)	$1.25	$10.00	Complex reasoning at standard context
Gemini 2.5 Pro (over 200K context)	$2.50	$15.00	Long context, the rate doubles automatically

Mechanic the quote will not explain. The long context tier is not a separate SKU you opt into. When a single request crosses 200,000 tokens, Gemini 2.5 Pro silently moves to $2.50 input and $15.00 output, double the standard rate. A retrieval pipeline that grows its context window over time can double its own bill with no config change and no alert. Cap context, or price the cap.

Chart A. List token price per million tokens by model. The red pair is Gemini 2.5 Pro over 200K context, where input and output both double.

Why does model tiering decide most of the bill?

Model tiering is the single largest lever, and it costs you nothing to negotiate because it is an architecture choice, not a contract term. The gap between Flash Lite and Pro on output is 25 times. Most enterprise traffic does not need Pro.

The discipline is to route by task, not by default. Classification, extraction, routing, and summarization run on Flash Lite or Flash. Reserve Pro for the share of traffic that genuinely needs deep reasoning or long context.

Commodity tier: Flash Lite for structured extraction and routing, where $0.40 output per million is the right meter.
Workhorse tier: 3.5 Flash for the bulk of agent and retrieval traffic, balancing quality and cost.
Reasoning tier: Pro only where the task fails on Flash, and with a hard context cap to avoid the 200K doubling.

The consumption modes that change the rate

Beyond the model, the way you call it changes the price. Three modes move the meter materially, per Google's published mechanics.

Batch prediction: a flat 50 percent discount against real time, with a 24 hour service window, for any workload that is not interactive.
Context caching: up to 90 percent off cached input tokens for repeated long prompts, though cache storage carries its own per hour charge.
Provisioned throughput: reserved capacity in generative AI scale units for latency stability, described in Google's provisioned throughput documentation.

How do committed use discounts and provisioned throughput apply to Vertex AI?

Vertex AI spend folds into Google Cloud committed use, and that is where the contract leverage lives. A spend based committed use discount runs about 20 percent at one year and up to about 40 percent at three years on eligible usage, the same band as compute.

The mechanic that matters: a spend based commit discounts the dollar, not the token rate. If Google cuts the price of a model mid term, that cut flows to you only if your clause pins the discount to net spend at then current published rates. Without that pin, you keep paying the old effective price on a falling market.

Instrument	What it commits	1 year	3 year	Idle risk
Spend based CUD (Vertex)	A dollar amount of eligible Vertex usage	~20%	up to ~40%	Unused commit is billed and lost
Provisioned throughput	Reserved GSU capacity, monthly or annual	20 to 45%	n/a	Idle GSU is pure shelfware
Batch prediction	Nothing, per call mode	flat 50% off real time		None, no commit
Context caching	Nothing, cache reuse	up to 90% off cached input		Cache storage per hour

Lever the quote will not surface. Provisioned throughput is sold as cost stability, but a generative AI scale unit bills the reserved capacity whether or not traffic uses it, the same trap as an over sized compute commit. Size GSUs to the measured floor of latency critical traffic only, and leave burst on pay as you go. Measure first using Google's throughput calculation guidance, never to a forecast.

What does the fine tuning and customization spend really cost?

Fine tuning is priced in two parts, and buyers routinely scope only the first. The first part is the tuning job itself, billed on the tokens processed during training. The second is the ongoing cost of serving the tuned model, which carries its own inference meter for the life of the deployment.

The serving cost is the one that compounds. A tuned model you stand up in 2026 keeps billing inference every month until you retire it, so a customization that looked cheap as a one off training run becomes a standing line item.

Training spend: a finite cost, sized to the tuning dataset and epochs.
Serving spend: a recurring inference cost on the tuned endpoint, the part that needs a retirement plan.
Alternative first: test prompt engineering, grounding, and context caching before paying to tune, since the cheaper path often closes the quality gap.

For the published per token and per node mechanics, work from the Gemini API pricing reference and confirm the Vertex equivalent before you commit, because the two surfaces meter differently.

How do you build a consumption baseline that survives Google scrutiny?

The baseline is the whole negotiation. A commit sized to a Google growth forecast is sized to Google's interest. A commit sized to your verified consumption floor is sized to yours, and it is the only number you can defend when the account team pushes back.

Build the floor from twelve months of detailed billing export, split by model, by mode, and by project. Commit only the steady inference that runs every day in production. Keep training, experimentation, and seasonal spikes on pay as you go.

Weeks 1 to 4

Baseline

Export twelve months of Vertex billing. Split by model, mode, and project. Set the committable floor at the p25 to p50 of steady production inference, stripped of training and one off spikes.

Weeks 5 to 8

Architect and clause

Re tier traffic to Flash where Pro is not needed, move offline jobs to batch, cache repeated context, and size GSUs to the floor. Draft the five clauses before pricing.

Weeks 9 to 12

Negotiate and stage

Benchmark the opening quote, commit the verified floor at three years, keep the variable tier on one year or pay as you go, and hold the discount counter until the clauses close.

Applied to a representative estate, the program looks like the table below. Each workload is optimized by tier and mode first, then the steady remainder is committed.

Workload	Unmanaged annual	Primary lever	Optimized annual
Production inference (Gemini serving)	$3.00M	Model tiering plus 3 year spend CUD	$2.04M
Batch and offline generation	$1.20M	Batch prediction, 50% off	$0.78M
Fine tuning and tuned model hosting	$0.80M	Right size, retire idle endpoints	$0.62M
Provisioned throughput (latency critical)	$0.60M	GSU sized to measured floor	$0.45M
Embeddings and grounding	$0.40M	Context caching on repeated input	$0.31M
Total estate	$6.00M	tier, mode, commit	$4.20M

Benchmark scenario, not a quote. Benchmark ranges: Redress Compliance advisory engagement file, 2024 to 2025.

Chart B. Representative $6.0M estate. Optimized run rate $4.20M, a $1.8M or about 30 percent reduction, with no commit above verified steady inference.

Which five contract clauses decide whether the commitment protects the budget?

The discount is one line on the order form. The five clauses below are what stand between you and a year end true up or a renewal at list. Negotiate them before you trade the discount, because Google concedes terms more easily while the headline rate is still open.

Clause	What it protects	Buyer side ask
Price protection	You keep mid term model price cuts	Discount pinned to net spend at then current published rates
Model and mode re mix	Freedom to shift across Gemini tiers, batch, and PT	Quarterly re mix at no penalty, in writing
Provisioned throughput true down	Relief on idle GSU as traffic shifts	Quarterly GSU adjustment to measured throughput
Renewal rate	No snap to list at the end of term	Renew at the original negotiated rate, defined notice
Shortfall and attainment	Grace if the AI ramp slips	Carry forward window, Vertex and Marketplace count to commit

The auto renewal trap

Spend commitments renew automatically at the end of the term. Many agreements default that renewal to then current list rather than the rate you negotiated, and the cancellation window is short. Put the renewal rate and a clear cancellation notice in writing, or you re sign your discount away by silence.

What discount benchmarks hold across renewal and exit scenarios?

Benchmarks from the engagement file frame what good looks like across scenarios. The deepest savings come from architecture, not the commit, and the worst outcomes come from over committing a volatile AI forecast.

50%

Batch prediction vs real time

Flat discount on any workload that tolerates a 24 hour window.

90%

Context caching on cached input

Up to this on repeated long prompts, net of cache storage.

40%

Spend CUD, three year

On eligible Vertex usage committed to the verified floor.

30%

Blended estate saving

The worked $6M estate at $4.2M, tier and mode led.

Chart C. Effective discount achieved by scenario. Source: Redress Compliance advisory engagement file, 2024 to 2025.

What buyer side moves neutralize Google's standard tactics?

The account team runs a familiar playbook on AI deals. Each move has a counter that costs you nothing but preparation.

Google move	What it sounds like	Buyer counter
Anchor to AI growth	Commit to your projected AI ramp for the deepest rate	Commit verified production floor only, ramp on demand
Lead with Pro	Default the architecture to the flagship model	Tier by task, reserve Pro for traffic that needs it
Sell PT as savings	Provisioned throughput stabilizes your cost	Size GSU to measured floor, keep burst on demand
Discount for silence on terms	Lead with the headline percentage	Hold the discount counter until the five clauses close
Bundle the renewal	Roll AI into the broader Google agreement	Keep Vertex terms itemized and separately exitable

Where the common advice on Vertex AI and Gemini deals is wrong

The standard reseller and account team pitch is to lock a large three year Vertex spend commitment now to capture the 40 percent rate. We disagree. In the AI deals we have benchmarked, a forecast anchored three year AI commit is the most reliable way to overpay.

Two reasons. First, model prices are falling, not rising. New models keep arriving cheaper, so a three year commit on today's rates locks you above a falling market unless the price protection clause is airtight.

Second, AI consumption is the least predictable line in the estate. Committing a ramp you cannot hold turns into unrecoverable waste under the consumption model, because an unused committed dollar is billed and gone.

The buyer side move is to win the architecture first. Tiering, batch, and caching deliver more than the commit, and they need no three year promise. Commit only the verified inference floor, and keep the volatile layer free.

The contrarian line: on AI, the architecture beats the discount. A Flash workload at list costs less than a Pro workload at 40 percent off. Route the traffic before you sign the commit.

10.

What is your BATNA and the side letter language?

An AI negotiation without a credible alternative is a price taker negotiation. Build the BATNA before the first call, and make sure the account team knows it exists. Gemini is strong, but it is not the only model that clears most enterprise tasks.

Azure AI Foundry: OpenAI and partner models on an Azure consumption commitment, sized to the same workload.
AWS Bedrock: Anthropic Claude and Amazon Nova for portable inference, priced for comparison.
Direct model APIs: Anthropic and OpenAI direct as a clean price reference outside the hyperscaler bundle.
Open weight self host: Llama or Gemma on GKE for the commodity tier, capping the price of the low end.
Stay on demand: the always available alternative that caps your downside if the terms are wrong.

Side letter language we use

Pin the protections in a side letter when the standard order form will not carry them. Three clauses do most of the work.

Price protection: the discount applies to net spend at then current published rates, so model price cuts flow to you.
GSU true down: provisioned throughput adjustable quarterly to measured throughput, no idle penalty.
Attainment: Vertex spend and eligible Marketplace purchases count toward the commit, with a carry forward window on a shortfall.

Get these in writing or treat the discount as unprotected.

Recommendation. Win the architecture, size the commit to your verified floor, then trade the discount.

Do this first: re tier traffic to Flash where Pro is not needed, move offline jobs to batch, cache repeated context, and cap context windows below 200K to avoid the silent doubling.
Do this next: commit only the verified production inference floor, close price protection, re mix, GSU true down, renewal rate, and shortfall before you accept the headline percentage.

We work on the buyer side only, and we are glad to tie a meaningful part of the fee to delivered value.

11.

Frequently asked questions

Why does Gemini 2.5 Pro cost more above 200K tokens?

Gemini 2.5 Pro carries a long context tier that doubles the rate, to $2.50 input and $15.00 output per million tokens, once a single request crosses 200,000 tokens. It is automatic, with no separate SKU and no alert, so a retrieval pipeline that grows its context can double its own bill. Cap context or price the cap.

Does a spend based CUD discount the token price or the dollar?

It discounts the dollar of eligible spend, not the underlying token rate. That means a mid term model price cut only benefits you if your clause pins the discount to net spend at then current published rates. Without that pin, you keep paying the old effective price as the market falls.

Is provisioned throughput cheaper than pay as you go?

Only on traffic that actually uses it. Provisioned throughput reserves capacity in generative AI scale units and bills that capacity whether or not requests arrive, so idle GSU is shelfware. Size it to the measured floor of latency critical traffic and leave burst on pay as you go.

What is the single biggest lever on a Vertex AI bill?

Model tiering, not the discount. Output on Flash Lite is $0.40 per million tokens against $10.00 on Pro, a 25 times gap. Routing commodity traffic off Pro saves more than any commit, and it needs no contract. Win the architecture before you negotiate the rate.

Prepared by Redress Compliance · redresscompliance.comVertex AI and Gemini Negotiation · June 2026

Vendor Advisory

Cloud & Emerging

Programs

Assessments

Research

Knowledge Hubs