107×
Price Range Across Models
5
Distinct Billing Modes
50%
Batch Inference Discount
90%
Prompt Cache Savings

AWS Bedrock is not a model. It is a billing system that wraps a model marketplace. That distinction matters because Bedrock’s pricing complexity is not about understanding token rates — those are listed on a pricing page. The complexity is in the five separate billing modes, the hidden cost multipliers from ancillary services, the regional pricing variations, and the interaction between Bedrock consumption and your broader AWS committed spend. This guide is written for FinOps teams, cloud architects, and procurement leaders who need to understand not just what Bedrock costs per token, but how Bedrock costs compound at enterprise scale and where the optimisation leverage actually sits.

The unique challenge with Bedrock is that it layers AWS’s traditional cloud pricing complexity on top of AI model economics. You are not just choosing between Claude and Llama — you are choosing between on-demand, batch, provisioned throughput, cached, and fine-tuned variants of each model, each with different unit economics, commitment requirements, and architectural implications. Getting this wrong does not mean overpaying by 10%. It means overpaying by 300%.

The Five Billing Modes: A Decision Framework

Every Bedrock cost decision starts with choosing the right billing mode for each workload. Bedrock offers five distinct pricing mechanisms, and most enterprises should be using at least three of them simultaneously across different workloads.

Mode 1: On-Demand (Pay-Per-Token)

On-demand is the default mode and the one most teams start with. You pay per 1,000 input tokens and per 1,000 output tokens, with no commitment and no minimum spend. Pricing varies by model, with output tokens consistently costing 2–5× more than input tokens.

On-demand is the right choice for variable workloads, development and testing environments, proof-of-concept projects, and any application where usage patterns are unpredictable or still stabilising. It is the wrong choice for production workloads with predictable, high-volume traffic — where provisioned throughput or batch processing almost always delivers better unit economics.

The hidden risk of on-demand: throttling. AWS enforces rate limits on on-demand inference, and during peak demand periods your requests may be queued or rejected. If your application has latency SLAs, on-demand provides no guarantee that those SLAs will be met. This operational risk is invisible in the pricing but material in production.

Mode 2: Batch Inference (50% Discount)

Batch inference processes requests asynchronously within a 24-hour window at exactly half the on-demand per-token rate. For Claude Sonnet 4.5 on Bedrock, batch pricing drops from $3/$15 to $1.50/$7.50 per million tokens. For Amazon Nova Micro, batch drops from $0.035/$0.14 to $0.018/$0.07.

Batch is the single largest cost lever available to any Bedrock customer, and it is dramatically underutilised. In our experience, fewer than 25% of enterprise Bedrock deployments have implemented batch processing for eligible workloads. Any task that does not require a real-time response — document processing, bulk classification, report generation, data extraction, test generation, content moderation queues, nightly analytics — should be running on batch.

The practical constraint: batch has a 24-hour SLA, not a 24-hour average. Some batch jobs complete in minutes; others take the full window. Applications that need results within 2–4 hours cannot reliably depend on batch processing. This gap between “does not need to be real-time” and “needs results within 4 hours” is where many eligible workloads remain on on-demand unnecessarily.

Mode 3: Provisioned Throughput (Reserved Capacity)

Provisioned throughput works like EC2 Reserved Instances for AI inference. You purchase model units (MUs) that guarantee a specific throughput level, billed hourly regardless of actual utilisation. Commitment terms are 1-month or 6-month, with 6-month terms offering lower hourly rates.

Provisioned throughput pricing varies significantly by model. Meta Llama models start at approximately $21 per hour per model unit on a 1-month commitment. Anthropic Claude models and Cohere models are priced higher, with some exceeding $49 per hour per model unit. At $49/hour, a single model unit costs approximately $35,280 per month — $423,360 per year.

When provisioned throughput makes economic sense: Only when your on-demand spend at consistent utilisation exceeds the provisioned throughput cost. The break-even calculation: if your on-demand cost at 70%+ utilisation exceeds the hourly MU cost × 24 hours × 30 days, provisioned throughput saves money. Below 70% utilisation, you are paying for idle capacity.

Provisioned throughput is also the only option for running custom (fine-tuned) models on Bedrock. If you fine-tune a model, you cannot serve it via on-demand — you must purchase provisioned throughput, which creates a minimum cost floor for any fine-tuning project regardless of inference volume.

Mode 4: Prompt Caching (Up to 90% Input Savings)

Bedrock supports prompt caching that stores frequently used prompt prefixes for up to five minutes. Subsequent requests with matching prefixes receive up to 90% cost savings on cached input tokens and up to 85% latency reduction. Cache performance varies by model and prompt characteristics, and all caches are isolated to individual AWS accounts.

Prompt caching on Bedrock mirrors the mechanics available through direct Anthropic API access (for Claude models) but is managed within the Bedrock infrastructure. The architectural implications are identical: front-load static content (system prompts, few-shot examples, knowledge base context) into the prompt prefix, place variable per-request content at the end. The cached prefix pays a small write premium; every subsequent read saves 90%.

For RAG applications making 100+ requests per minute with a shared 4,000-token system prompt, caching saves approximately $0.27 per 1,000 requests on Sonnet input alone. At enterprise scale (1 million requests per day), that is $8,100 per month in input token savings from a single architectural decision.

Mode 5: Model Customisation (Fine-Tuning)

Fine-tuning Bedrock models incurs three separate cost layers: training costs (per token processed during fine-tuning, calculated as tokens × epochs), model storage fees (per month per stored model, typically $1.95/month for text models), and inference costs via mandatory provisioned throughput.

The total cost of a fine-tuning project often surprises teams. A fine-tuning run on a mid-size model processing 10 million tokens across 3 epochs costs approximately $120 in training fees — trivial. But the provisioned throughput required to serve the fine-tuned model costs $15,000–$35,000 per month minimum, depending on the base model. The training cost is the tip of the iceberg; the ongoing inference cost is the iceberg.

Before committing to fine-tuning, benchmark whether few-shot prompting or retrieval-augmented generation achieves comparable quality improvements at a fraction of the cost. In our experience, 70% of enterprise fine-tuning projects could have achieved equivalent results through prompt engineering alone — without the provisioned throughput commitment.

Model Pricing: The 107× Range That Determines Your Bill

Bedrock’s model marketplace spans an extraordinary price range. Amazon Nova Micro output costs $0.14 per million tokens. Claude Opus 4.1 (legacy) output costs $75 per million tokens. That is a 535× difference on output. Even within the current generation, Nova Micro versus Claude Sonnet 4.5 represents a 107× output price difference.

Anthropic Claude on Bedrock

Claude Sonnet 4.5: $3.00 input / $15.00 output per MTok. Claude Haiku 4.5: $1.00 input / $5.00 output per MTok. Claude Opus 4.5: $5.00 input / $25.00 output per MTok. These rates match Anthropic’s direct API pricing exactly — Bedrock does not add a markup to Anthropic models. The value proposition of running Claude through Bedrock is not cheaper per-token pricing but AWS integration: unified billing, IAM controls, VPC endpoints, CloudWatch monitoring, and the ability to fund Claude consumption from existing AWS committed spend.

Amazon Nova Models

Nova Micro: $0.035 input / $0.14 output per MTok. Nova Lite: $0.06 input / $0.24 output per MTok. Nova Pro: $0.80 input / $3.20 output per MTok. Nova Premier: $2.00 input / $8.00 output per MTok. Amazon’s own models are priced aggressively below third-party offerings. Nova Micro is 43× cheaper than Sonnet on input and 107× cheaper on output. For high-volume, low-complexity tasks (classification, routing, entity extraction, simple Q&A), Nova models represent dramatic cost savings versus Claude or Llama.

Meta Llama Models

Llama 3.3 Instruct (70B): $0.72 input / $0.72 output per MTok. Llama 3.2 (90B Vision): $2.00 input / $2.00 output per MTok. Llama 3.1 (405B): $5.32 input / $16.00 output per MTok. The Llama family on Bedrock is priced at a premium over self-hosted alternatives. Running Llama 3 70B on your own EC2 GPU instances through platforms like TrueFoundry or vLLM can reduce inference costs by 60–70% compared to Bedrock on-demand rates — but requires operational expertise and infrastructure management that Bedrock eliminates.

Other Models

Mistral Large (2): $2.00 input / $6.00 output per MTok. Cohere Command R+: $2.50 input / $10.00 output per MTok. AI21 Jamba 1.5 Large: $2.00 input / $8.00 output per MTok. Stability AI (image models): per-image pricing ranging from $0.01 to $0.08 depending on resolution and quality. Each model has different batch, caching, and provisioned throughput availability — not all billing modes are supported for all models.

Compare AI Vendor Costs

Model Bedrock, Azure OpenAI, and Vertex AI costs side-by-side with our free comparison calculator.

Launch the vendor comparison calculator →

The Hidden Cost Multipliers: Services That Double Your Bill

The most common mistake in Bedrock cost forecasting is modelling only inference costs. In production deployments, ancillary Bedrock services routinely add 40–100% to the base inference bill.

Knowledge Bases (RAG Infrastructure)

Bedrock Knowledge Bases provide managed RAG capabilities, but the underlying infrastructure carries significant costs. The default vector store is Amazon OpenSearch Serverless, which has a minimum cost of approximately $345 per month for the required OpenSearch Compute Units (OCUs) — regardless of data volume. This minimum alone often exceeds the inference costs for low-to-moderate RAG workloads.

Additional Knowledge Base costs include: S3 storage for source documents, embedding model inference for document ingestion (charged at the embedding model’s per-token rate), and the retrieval query costs for each user request. A RAG application processing 10,000 queries per day with a 50,000-document knowledge base can incur $800–$1,500 per month in Knowledge Base infrastructure costs before a single inference token is processed.

Bedrock Agents

Agents orchestrate multi-step workflows by chaining model calls with tool use. Each agent invocation generates multiple inference calls — typically 3–8 model calls per user request, as the agent reasons about which tools to use, executes them, and synthesises results. An agent workflow that appears to cost $0.01 per user request in direct inference may actually cost $0.05–$0.10 when accounting for the full chain of model calls, tool result processing, and orchestration overhead.

Agent costs are particularly difficult to predict because the number of model calls per request varies dynamically based on task complexity. A simple agent query might require 2 model calls; a complex one might require 12. This variance makes budget forecasting unreliable without extensive production traffic analysis.

Guardrails

Bedrock Guardrails add content filtering, PII detection, topic avoidance, and hallucination grounding to model responses. Guardrails are priced per 1,000 text units (approximately 1,000 characters) processed. At scale, Guardrails costs can add 5–15% to your total inference bill. For applications with strict compliance requirements where every input and output must be scanned, Guardrails becomes a non-trivial cost layer.

CloudWatch and Monitoring

Bedrock usage monitoring through CloudWatch generates custom metrics, logs, and alarms. At enterprise scale with dozens of model endpoints, hundreds of agent workflows, and detailed token-level logging, CloudWatch costs can reach $500–$2,000 per month. This is often overlooked in cost models because it falls under the general AWS monitoring budget rather than the AI budget.

Bedrock vs Direct API: The Real Cost Comparison

A common question: is Bedrock more expensive than going directly to each model provider’s API?

For Anthropic Claude: Bedrock on-demand pricing matches Anthropic’s direct API rates exactly. There is no Bedrock markup on Claude models. The choice between Bedrock and direct Anthropic API is not about per-token cost — it is about operational integration. Bedrock provides AWS-native billing, IAM, VPC endpoints, and the ability to fund Claude usage from AWS EDPs (Enterprise Discount Programs) or Savings Plans. Direct API provides access to Anthropic’s latest features (which may lag on Bedrock by days to weeks) and direct commercial negotiation for volume discounts.

For Meta Llama: Bedrock charges a 10–40% premium over alternative hosting providers like Together AI, Groq, or self-hosted solutions. Llama 3.3 70B on Bedrock costs $0.72/$0.72 per MTok; the same model on Together AI costs approximately $0.54/$0.54. Self-hosted on EC2 Spot Instances can reduce costs by 60–70%. The Bedrock premium pays for managed infrastructure, but cost-sensitive organisations with GPU operations expertise should evaluate self-hosting for high-volume Llama workloads.

For Amazon Nova: Nova models are exclusive to Bedrock. There is no alternative provider comparison. The pricing is aggressive, positioning Nova as the default choice for cost-sensitive workloads within the AWS ecosystem.

The strategic consideration for enterprises: if your organisation has a significant AWS EDP (Enterprise Discount Program), Bedrock consumption counts toward your committed spend. This means that Bedrock usage can be funded from budget that is already committed to AWS, rather than requiring incremental AI budget. For organisations with $10M+ annual AWS commitments and unspent EDP capacity, Bedrock is effectively “free” relative to direct API alternatives that require new budget allocation.

Need help optimising your AWS Bedrock spend?

Our advisory team specialises in cloud AI cost modelling, committed-spend negotiation, and multi-model deployment strategy. Fixed-fee, vendor-independent engagement.

Learn about our GenAI advisory services →

Enterprise Optimisation Playbook: Eight Steps to Reduce Bedrock Spend

Step 1: Implement Model Routing (Savings: 50–70%). The 107× price range across Bedrock models means that routing decisions dominate your cost structure. Build a classification layer that directs simple requests to Nova Micro ($0.035/$0.14), moderate requests to Sonnet ($3/$15), and complex requests to Opus ($5/$25). In a well-tuned routing system, 50–60% of requests go to Nova or Haiku, 30–40% to Sonnet, and under 10% to Opus. The blended cost drops by 50–70% compared to an all-Sonnet deployment.

Step 2: Shift Eligible Workloads to Batch (Savings: 50% on shifted volume). Audit every Bedrock workload and classify it as real-time-required or batch-eligible. Target 30–50% of total token volume for batch migration. The 50% discount applies uniformly across all batch-eligible models.

Step 3: Enable Prompt Caching (Savings: 60–80% on input tokens). Restructure prompts to front-load static content. Enable 5-minute caching for all applications with more than 2 requests per minute sharing common context. For high-frequency endpoints, input token costs drop by 80–90%.

Step 4: Optimise Output Tokens (Savings: 15–30%). Output tokens are 2–5× more expensive than input. Implement structured JSON output schemas, set explicit max_tokens limits per endpoint, use stop sequences, and design response templates that minimise verbose generation. Every 100 output tokens saved is equivalent to saving 200–500 input tokens.

Step 5: Right-Size Knowledge Base Infrastructure (Savings: 20–40% on RAG costs). Evaluate whether OpenSearch Serverless (minimum $345/month) is necessary for your vector store, or whether a more cost-effective alternative like Pinecone, PostgreSQL with pgvector, or Amazon Aurora with vector search provides adequate performance at lower cost. For smaller knowledge bases (under 100,000 documents), the OpenSearch minimum is often the dominant cost — exceeding inference costs.

Step 6: Monitor Agent Token Consumption (Savings: 10–25%). If you use Bedrock Agents, instrument each agent to track the number of model calls per user request. Identify agents with excessive reasoning loops (8+ model calls per request) and optimise tool descriptions, agent instructions, and orchestration logic to reduce chain length. Shorter agent chains mean fewer inference calls per user interaction.

Step 7: Evaluate Provisioned Throughput at Scale (Savings: variable). For workloads running at 70%+ utilisation on on-demand, model the cost of provisioned throughput. Compare the monthly MU cost against your current on-demand spend. Additionally, provisioned throughput eliminates throttling risk, which may justify the commitment even at slightly higher cost if your application has latency SLAs.

Step 8: Leverage AWS Committed Spend (Savings: 15–30%). If your organisation has an AWS Enterprise Discount Program, Savings Plan, or other committed-spend agreement, confirm that Bedrock consumption counts toward your commitment. Route AI workloads through Bedrock rather than direct API alternatives to consume already-committed budget. For organisations with unspent AWS commitments, this is effectively a 100% discount on the incremental cost of AI — because the money was already committed to AWS regardless.

Cost Modelling: What Enterprise Bedrock Actually Costs

The following model illustrates annual Bedrock costs for a mid-size enterprise running a multi-model AI platform with RAG, agent workflows, and a mix of real-time and batch processing.

Assumptions: 500 million tokens per month total consumption. Model mix: 40% Haiku/Nova (low-complexity), 45% Sonnet (mid-tier), 15% Opus (complex reasoning). 35% of volume eligible for batch processing. Prompt caching enabled with 60% cache hit rate on input tokens. Two Knowledge Bases with OpenSearch Serverless. Five Bedrock Agents in production.

Monthly Cost Breakdown:

On-demand inference (65% of volume, post-routing): $4,200. Batch inference (35% of volume, 50% discount): $1,100. Prompt caching savings (net): –$1,800. Knowledge Base infrastructure (2 × OpenSearch + embeddings): $1,200. Agent orchestration overhead (5 agents, ~4 calls/request average): $1,600. Guardrails (applied to all customer-facing endpoints): $400. CloudWatch monitoring: $300. Total monthly Bedrock spend: $7,000.

Annual: $84,000. Of that, inference is $42,000 (50%) and infrastructure/services are $42,000 (50%). The 50/50 split between inference and infrastructure is typical of production Bedrock deployments and is precisely why cost models that only account for per-token pricing miss half the bill.

Without optimisation (all Sonnet, no batch, no caching, no routing): estimated monthly spend $18,500. Annual: $222,000. The optimised architecture saves $138,000 per year — a 62% reduction.

Client Result

A San Francisco financial institution cut projected Azure OpenAI spend and gained strategic flexibility through independent contract negotiation.

Read the case study →

Need Expert AI Licensing Guidance?

Redress Compliance provides independent ai licensing advisory services — fixed-fee, no vendor affiliations. Our specialists help enterprises negotiate AI vendor contracts, benchmark pricing across providers, and avoid lock-in across OpenAI, AWS Bedrock, Azure AI, and Google Vertex.

Explore AI Licensing Advisory Services →

Cross-Region Inference and Data Residency

Bedrock supports cross-region model inference, allowing requests to be routed to available capacity across AWS regions without additional cross-region charges. Pricing is based on the source region where the request originates, not where the inference executes. This is advantageous for handling traffic spikes but introduces data residency considerations.

For organisations with regulatory requirements around data processing location (GDPR, data sovereignty laws, industry-specific regulations), cross-region inference may route prompts and responses through regions outside your compliance boundary. Enterprise customers should configure Bedrock inference profiles to restrict inference to specific regions, even if this means accepting potential throttling during peak demand rather than routing to non-compliant regions.

Anthropic Claude models on Bedrock also support a US-only inference option at 1.1× standard pricing, ensuring all processing occurs within US data centres. This 10% premium is a compliance cost that should be factored into the total cost model for regulated workloads.

Bedrock vs Google Vertex AI vs Azure AI: Platform Comparison

AWS Bedrock vs Google Vertex AI: Both platforms offer managed access to third-party models (Claude, Llama) alongside first-party models (Nova vs Gemini). Vertex AI’s primary advantage is Gemini model pricing ($1.25/$10 for Gemini 2.5 Pro vs $3/$15 for Sonnet on Bedrock) and native integration with Google Workspace. Bedrock’s advantage is the broader model marketplace, deeper AWS ecosystem integration, and the ability to fund consumption from AWS committed spend. For AWS-centric enterprises, Bedrock typically delivers better total value; for Google Cloud shops, Vertex AI is the natural choice.

AWS Bedrock vs Azure AI (OpenAI): Azure provides exclusive access to OpenAI models (GPT-4o, GPT-4.1, o3) through Azure OpenAI Service, which Bedrock does not offer. If your AI strategy requires OpenAI models, Azure is mandatory for that portion. Bedrock provides access to Anthropic, Meta, Amazon, and other providers that Azure does not host. Many enterprises use both platforms, routing OpenAI workloads through Azure and Claude/Llama workloads through Bedrock. This multi-platform strategy adds operational complexity but avoids single-vendor lock-in for AI model access.

The strategic question is not “which platform is cheapest” but “which platform aligns with your existing cloud commitment and provides access to the models your applications need.” For organisations with $10M+ annual AWS spend, Bedrock consumption from EDP commitments almost always beats the standalone economics of any alternative platform.

The EDP Arbitrage: How AWS Committed Spend Changes the Calculus

Enterprise Discount Programs (EDPs) are the single most important variable in Bedrock pricing economics, yet they are absent from virtually every public pricing guide. An EDP is a multi-year commitment to spend a specified minimum amount with AWS in exchange for a blanket discount (typically 5–15%) across all AWS services. Most enterprises with $5M+ annual AWS spend have some form of EDP or committed-spend agreement.

The critical Bedrock implication: Bedrock consumption counts toward your EDP commitment. This means that every dollar spent on Bedrock inference, Knowledge Bases, Agents, and Guardrails reduces your remaining EDP obligation. For organisations that are on track to underspend their EDP commitment (a common scenario when cloud migrations slow or workloads are optimised), routing AI workloads through Bedrock rather than direct API alternatives effectively converts “wasted” committed spend into productive AI consumption.

Consider a practical scenario: an enterprise with a $20M annual EDP commitment that is forecasting $17M in organic AWS consumption. The $3M gap represents spend that the organisation is contractually obligated to make regardless. By routing AI workloads through Bedrock, the organisation can consume $3M in AI inference from budget that was already committed — making the effective incremental cost of that AI consumption zero relative to going direct to Anthropic, OpenAI, or other providers that would require new budget allocation.

This EDP arbitrage fundamentally changes the vendor selection calculus. An enterprise evaluating Claude through Bedrock versus Claude through Anthropic’s direct API sees identical per-token rates. But if Bedrock consumption fills an EDP gap, the economic argument for Bedrock is overwhelming regardless of any feature or latency differences.

How to Model the EDP Benefit

To quantify the EDP arbitrage for your organisation, you need three numbers: your total annual EDP commitment, your projected organic AWS consumption (excluding AI), and your projected annual AI API spend across all providers. If organic consumption falls short of the EDP commitment, the gap represents the maximum AI spend that can be absorbed by the EDP at zero incremental cost. Any AI spend beyond that gap is priced at standard Bedrock rates minus your EDP discount percentage.

For example: $20M EDP commitment, $17M organic spend, $1.5M projected AI spend. The $3M EDP gap fully absorbs the $1.5M AI budget. Effective AI cost: $0 incremental (already committed). Savings versus direct API: $1.5M. Even if the EDP gap only partially covers AI spend — say $1M gap against $1.5M AI budget — you still save $1M and pay only $500K at Bedrock rates (minus EDP discount).

This analysis should be the first step in any enterprise AI platform evaluation, before comparing per-token rates, model quality, or feature availability. If the EDP math works, Bedrock wins on economics regardless of the per-token comparison.

📊 Free AI Spend Benchmarking Assessment

How does your AWS Bedrock spend compare to industry benchmarks? Our free assessment analyses your billing mode mix, model selection, and ancillary service costs — and identifies the optimisation moves that cut enterprise AI spend by up to 62%.

Take the Free Assessment →

Governance and Cost Controls: Preventing Bedrock Bill Shock

Enterprise Bedrock deployments require proactive cost governance. Without controls, AI workloads exhibit a “tragedy of the commons” dynamic: multiple teams share an AWS account, each team experiments aggressively with models, and the consolidated bill grows faster than any single team anticipates.

Budget Segmentation by API Key and Tag

Use AWS IAM policies and resource tags to segment Bedrock consumption by team, application, and environment. Assign separate API keys or IAM roles to each application, then track spending per key through AWS Cost Explorer. Tag every Bedrock invocation with application name, team owner, and environment (dev/staging/production). This enables cost attribution at the granularity needed to hold teams accountable for their consumption.

CloudWatch Alarms and Spending Thresholds

Configure CloudWatch alarms on Bedrock usage metrics at 50%, 75%, and 90% of monthly budget. Set token consumption alarms per endpoint to detect runaway workloads — an agent stuck in a reasoning loop can generate millions of tokens in hours. Alarm on both absolute token consumption and rate-of-change: a sudden 10× increase in tokens per request indicates a misconfigured prompt, an injection attack, or an infinite tool-use loop.

Service Control Policies

For multi-account AWS organisations, use Service Control Policies (SCPs) to restrict which models can be invoked from which accounts. Limit development accounts to Haiku and Nova models (preventing developers from accidentally running Opus workloads at $25/MTok output during testing). Reserve Opus and Sonnet access to production accounts with approved budgets. SCPs cannot enforce token-level limits, but they prevent the most common source of bill shock: high-cost models running in low-governance environments.

Automated Model Routing with Cost Guardrails

Build a routing middleware that classifies incoming requests and enforces cost ceilings per request. If a request’s estimated cost (based on input token count and expected output) exceeds a configurable threshold, the router downgrades to a cheaper model or rejects the request with a cost-limit error. This prevents individual requests from consuming disproportionate budget — particularly important for agent workflows where a single complex query can trigger 10+ model calls at Opus pricing.

Related Guides

AI Token Pricing Calculator Google Vertex AI Pricing Guide 2026 OpenAI API Pricing Guide 2026 AI Procurement Checklist: 20 Questions Before Signing AI Vendor Selection Framework AI Vendor Lock-In Risk Assessment

Frequently Asked Questions

Does Bedrock add a markup to third-party model pricing?

For Anthropic Claude models, Bedrock on-demand pricing matches Anthropic’s direct API rates. For Meta Llama models, Bedrock pricing is 10–40% higher than alternative hosting providers. For Amazon Nova models, Bedrock is the only access point, so there is no comparison. The markup (or lack thereof) varies by model provider and should be evaluated per model.

Can I pay for Bedrock with existing AWS credits or EDP commitments?

Yes. Bedrock charges appear on your standard AWS bill and can be applied against AWS credits, EDP commitments, and Savings Plans (subject to your specific agreement terms). This makes Bedrock attractive for organisations with unspent AWS committed spend, as AI consumption can be funded from already-committed budget rather than incremental allocation.

What is the minimum cost for running a fine-tuned model on Bedrock?

Fine-tuned models require provisioned throughput for inference. The minimum is one model unit with no long-term commitment, but hourly rates range from $7 to $50+ per hour depending on the base model. Even the cheapest option (approximately $7/hour) costs over $5,000 per month for 24/7 operation. Fine-tuning projects should budget for ongoing inference costs, not just one-time training costs.

How does cross-region inference affect my bill?

Cross-region inference does not incur additional charges. You are billed at the source region’s rates regardless of where inference actually executes. However, cross-region routing may introduce data residency compliance risks. Configure inference profiles to restrict processing to compliant regions if required by your regulatory framework.

Is batch processing available for all models on Bedrock?

No. Batch inference is available for select foundation models from Anthropic, Meta, Mistral, and Amazon. Not all models or model versions support batch processing. Check the current Bedrock documentation for the supported model list before designing batch workflows, and have a fallback plan for models that may lose batch support in future updates.

How do I get enterprise pricing for Bedrock?

Contact your AWS account team or reach out to AWS enterprise sales. Enterprise Bedrock pricing is negotiated as part of your broader AWS relationship, typically within the context of an Enterprise Discount Program or custom commitment agreement. Organisations spending $50,000+ monthly on Bedrock have the strongest negotiation leverage. Come prepared with 90 days of usage data showing stable consumption patterns, model utilisation metrics, and a clear growth forecast.

Should I use Bedrock or SageMaker for AI inference?

Bedrock is a managed API service — you call a model endpoint and pay per token. SageMaker is an ML platform where you deploy models to infrastructure you manage. For foundation model inference using third-party models (Claude, Llama, Mistral), Bedrock is almost always the right choice because it eliminates infrastructure management overhead. For custom-trained models, large-scale open-source model deployment, or workloads where you need fine-grained control over GPU instance types and scaling behaviour, SageMaker provides more flexibility at potentially lower cost for teams with ML operations expertise. Many enterprises use both: Bedrock for third-party model access and SageMaker for custom model deployment.

What monitoring should I set up on day one of a Bedrock deployment?

At minimum: enable AWS Cost Explorer with daily granularity for Bedrock services; create CloudWatch alarms on total daily token consumption (input and output separately) at 2× your expected baseline; tag every Bedrock invocation with application name and team owner; set up a weekly cost report delivered to the engineering lead and FinOps team. For production deployments, add per-endpoint latency alarms, throttling rate monitoring, and per-request token consumption tracking to detect runaway agent loops or prompt injection attacks that inflate token consumption.

GenAI Licensing Hub This pillar page is part of our GenAI Licensing Knowledge Hub — 25+ expert guides covering AI token pricing, contract risks, data privacy, and enterprise negotiation strategies.
GenAI Licensing Hub — This guide is part of our GenAI Licensing Knowledge Hub — 80+ expert guides covering AI token pricing, contract risks, data privacy, and enterprise negotiation strategies across OpenAI, Anthropic, Google, AWS, and Microsoft.