$0
Model Licence Fee
$50K–$500K+
True Annual Run Cost
700M
MAU Threshold
Not Open Source
Licence Classification
📘 This guide is part of our GenAI Licensing Knowledge Hub — your comprehensive resource for enterprise AI licensing, contract negotiation, and cost optimization.

Meta Llama is the most misunderstood AI model in enterprise procurement. Executives hear “open source” and “free” and assume they can deploy it like Linux — download it, run it, and owe nothing. That is wrong on every count. Llama is not open source. It is governed by a bespoke commercial licence agreement that imposes attribution requirements, usage restrictions, indemnification obligations, and a 700-million-user threshold that can terminate your rights if your organisation grows past it. The model weights are free; everything required to make those weights useful in production is not. This guide is written for enterprise procurement leaders, general counsel, and CIOs who need to understand what “free” actually costs, what obligations the Llama licence creates, and how the total cost of a Llama deployment compares to the API alternatives it claims to replace.

The Licence: What You Are Actually Agreeing To

The Llama Community Licence Agreement is not an Apache 2.0, MIT, or GPL licence. It is a bilateral commercial contract between Meta and each organisation that downloads the model weights. Understanding its specific provisions is essential because the licence governs what you can build, how you must brand it, who bears liability, and under what circumstances Meta can terminate your rights.

Royalty-Free Commercial Use — With Conditions

The licence grants a royalty-free, worldwide, non-exclusive, non-transferable right to use, reproduce, distribute, and modify the Llama model weights and documentation. You can build revenue-generating products and services on Llama without paying Meta a per-token fee, a subscription, or a licence royalty. This is the genuine commercial advantage of Llama relative to proprietary API models.

But the grant is conditional. Violating any of the licence conditions — the acceptable use policy, the attribution requirement, the user threshold, or the naming convention — does not just expose you to a breach claim. It can terminate the licence entirely, retroactively removing your right to use the model in production systems you have already deployed.

The 700 Million MAU Threshold

If your organisation (or its affiliates) has 700 million or more monthly active users across all products and services, you must request a separate commercial licence from Meta before using Llama. Meta grants these licences at its sole discretion. This provision exists to prevent hyperscale competitors — primarily Google, Apple, Amazon, and Microsoft — from free-riding on Meta’s R&D investment. For 99.9% of enterprises, this threshold is irrelevant. But for large consumer-facing platforms, media companies, or conglomerates with diverse product portfolios, the aggregate MAU calculation across all subsidiaries and affiliates requires careful analysis.

Mandatory Attribution

Every product or service built using Llama must prominently display “Built with Llama” on a related website, in user-facing documentation, or within the application interface. This is not optional. It is a contractual obligation that applies to internal tools, customer-facing applications, and B2B platforms alike. For enterprises that consider their AI technology stack proprietary or competitive, this mandatory branding disclosure is a material consideration.

Derivative Model Naming

If you fine-tune a Llama model and distribute the derivative, its name must begin with “Llama.” This requirement extends Meta’s brand into your product naming. If you plan to distribute a fine-tuned model to customers, partners, or as part of a commercial product, your model carries Meta’s branding regardless of how much proprietary data and engineering you invested in the fine-tuning.

Indemnification and Liability

The Llama materials are provided “AS IS” without any warranty. Your organisation assumes all risks of using the model, and you must indemnify and hold Meta harmless from any third-party claims arising from your use or distribution of Llama. This means that if an employee uses Llama to generate content that infringes a third party’s intellectual property, if Llama produces harmful output that causes legal exposure, or if your deployment violates a regulation, your organisation bears the full legal and financial liability. Meta bears none.

EU Multimodal Restriction

Multimodal versions of Llama (those that process images, video, or audio alongside text) may not be used to provide services within the European Union. Text-only Llama models remain available for EU deployment. This restriction creates a significant gap for enterprises building AI applications in Europe that require image understanding, video analysis, or audio processing. If your European operations need multimodal AI capabilities, Llama cannot serve that requirement — you need a different model for the EU multimodal use case regardless of what you deploy elsewhere.

Acceptable Use Policy: The Unilateral Amendment Risk

The licence requires compliance with Meta’s Acceptable Use Policy (AUP), which prohibits using Llama for illegal activities, generating harmful content, disinformation, unauthorised professional advice, and building competing AI models. The critical risk: Meta can update the AUP at any time. If Meta adds new restrictions to the AUP that conflict with your current use case, you must comply or lose your licence. Your legal team should monitor the AUP on an ongoing basis and assess the business impact of potential future amendments.

The True Cost of “Free”: What Enterprise Llama Actually Costs

The Llama model weights cost $0 to download. Turning those weights into a production-grade enterprise AI system costs $50,000 to $500,000+ annually, depending on scale, infrastructure choices, and operational maturity. Here is where the money goes.

GPU Infrastructure: The Dominant Cost

Running Llama inference requires GPU compute. The specific GPU requirements depend on the model size:

Llama 3.3 70B (the most common enterprise deployment): Requires at minimum 2× NVIDIA A100 80GB or equivalent GPUs for inference. A single GPU instance capable of serving Llama 70B on AWS costs approximately $3–$8 per hour depending on instance type and commitment. Running a single inference endpoint 24/7 costs $2,200–$5,800 per month. Most production deployments require 2–4 inference replicas for redundancy and load handling, bringing the infrastructure cost to $4,400–$23,200 per month ($53,000–$278,000 annually).

Llama 3.1 405B (the largest model): Requires 8× A100 80GB or equivalent. A single inference endpoint costs approximately $20–$35 per hour. Running 24/7 with redundancy, annual infrastructure costs reach $350,000–$600,000+. Very few enterprises can justify this cost for inference alone.

Smaller models (Llama 3.2 1B, 3B, 8B): These models run on single GPUs or even CPU-only instances for low-throughput use cases. Infrastructure costs are dramatically lower — $500–$3,000 per month for a single inference endpoint. These smaller models are where Llama’s cost advantage is most compelling.

MLOps and Engineering: The Overlooked Labour Cost

Self-hosting an LLM is not a deploy-and-forget operation. Llama requires ongoing engineering investment for model serving infrastructure (vLLM, TGI, or custom serving frameworks), GPU autoscaling to match traffic patterns, model version management when Meta releases updates, prompt template optimisation, guardrail implementation (Llama Guard integration and custom safety filters), monitoring and observability (latency tracking, throughput measurement, error rate analysis), and incident response for inference failures. A minimal MLOps team for a production Llama deployment typically requires 0.5–1.5 dedicated ML engineers, representing $80,000–$225,000 annually in fully loaded labour cost. Larger deployments with multiple models, fine-tuned variants, and complex routing require 2–4 engineers ($160,000–$600,000).

Fine-Tuning: The Project That Becomes a Programme

Many enterprises adopt Llama specifically because fine-tuning is possible — a capability that most proprietary API providers do not offer (Anthropic has no fine-tuning; how to negotiate with OpenAI offers limited fine-tuning). Fine-tuning a Llama 70B model requires access to high-end GPU clusters (8× A100 or H100 for several hours to several days depending on dataset size). The compute cost for a single fine-tuning run ranges from $500 to $10,000. But fine-tuning is iterative: you run multiple experiments, evaluate results, adjust hyperparameters, curate additional training data, and repeat. A realistic enterprise fine-tuning project involves 10–30 training runs over 2–4 months, with total compute costs of $5,000–$50,000.

Then comes the ongoing cost: as your training data evolves, as Meta releases new base model versions, and as your use case requirements change, fine-tuning becomes a recurring programme, not a one-time project. Budget for quarterly re-training cycles at a minimum.

The Total Cost of Ownership Table

For a mid-size enterprise running Llama 3.3 70B in production with 500,000 requests per day:

GPU infrastructure (3 replicas, reserved instances): $8,000–$15,000/month. MLOps engineering (1 FTE): $12,000–$18,000/month. Fine-tuning compute (amortised quarterly): $1,500–$4,000/month. Monitoring and observability tools: $500–$1,500/month. Safety and guardrail infrastructure: $500–$1,000/month. Total: $22,500–$39,500/month ($270,000–$474,000 annually).

Compare this to running the same 500,000 requests per day through an API (assuming 2,000 input / 500 output tokens per request): Claude Sonnet 4.5 at $3/$15 per MTok would cost approximately $6,750/month. Claude Haiku 4.5 at $1/$5 would cost approximately $2,250/month. Llama 3.3 70B on AWS Bedrock at $0.72/$0.72 per MTok would cost approximately $1,800/month.

The API option is 5–15× cheaper for this workload because you pay only for tokens consumed, with no infrastructure, engineering, or operational overhead. Llama self-hosting only achieves cost parity with API alternatives at extremely high scale (millions of requests per day) or when privacy, customisation, and control requirements justify the premium.

Compare AI Vendor Total Costs

Model open-source hosting costs against commercial API pricing with our free comparison tool.

Launch the vendor comparison calculator →

When Self-Hosted Llama Makes Economic Sense

Despite the cost disadvantage at moderate scale, there are specific enterprise scenarios where self-hosted Llama is the correct choice.

Data sovereignty requirements that prohibit sending data to third-party APIs. Regulated industries (healthcare, defence, financial services in certain jurisdictions) may require that all AI inference occurs within the organisation’s own infrastructure or a single-tenant cloud environment. Self-hosted Llama on your own VPC satisfies this requirement without relying on a vendor’s data handling commitments.

Custom model behaviour that requires deep fine-tuning. If your use case requires extensive fine-tuning with proprietary data — a domain-specific medical AI, a legal analysis tool trained on case law, or a financial model calibrated to your firm’s methodology — Llama’s open weights are the only path among leading models. The ability to own the fine-tuned model weights (under the Llama licence terms) is a genuine competitive advantage.

Extreme throughput at predictable volume. At 5+ million requests per day with predictable traffic, the fixed cost of GPU infrastructure is amortised across enough requests that the per-request cost drops below API pricing. The break-even point varies by model size, GPU pricing, and utilisation rate, but for Llama 70B on reserved GPU instances, it typically occurs at 2–5 million daily requests.

Latency requirements below 200ms. Self-hosted inference with optimised serving frameworks (vLLM, TensorRT-LLM) can achieve lower and more predictable latency than shared API endpoints, which are subject to throttling and queuing during peak demand. For real-time applications where every millisecond matters, self-hosting provides architectural control over the latency profile.

Multi-model strategies that require model routing control. Enterprises running multiple Llama variants (different sizes, different fine-tunes) alongside other models benefit from self-hosted infrastructure that allows fine-grained routing based on task complexity, cost, and latency requirements. This routing architecture is more flexible when you control the entire serving stack.

The Managed Alternative: Llama Through Cloud Providers

Every major cloud provider now offers Llama as a managed API service, providing the model’s capabilities without the infrastructure and operational burden of self-hosting.

AWS Bedrock: Llama 3.3 70B at $0.72 input / $0.72 output per MTok. Llama 3.1 405B at $5.32 input / $16.00 output per MTok. Batch processing at 50% discount. Bedrock pricing includes a markup over self-hosted costs but eliminates infrastructure management. Bedrock consumption counts toward AWS EDP commitments, making it attractive for organisations with unspent AWS committed spend.

Azure AI (Microsoft Foundry): Llama models available through Azure’s model catalogue. Pricing is similar to Bedrock. Azure adds its own terms on top of the Llama licence, including content filtering obligations and attribution requirements. The January 2026 Product Terms clarify that the EU multimodal restriction applies to Llama served through Azure.

Google Cloud Vertex AI: Llama available through Vertex AI’s model garden. Pricing aligned with other providers. Google adds its own acceptable use policies on top of Meta’s licence terms.

Together AI, Groq, Fireworks: Specialised inference providers offer Llama at lower per-token rates than hyperscalers — Llama 3.3 70B for $0.54–$0.60 per MTok on some platforms. These providers compete on price and latency for teams that want managed Llama inference without the hyperscaler premium. The trade-off is a smaller provider with less enterprise sales infrastructure, potentially weaker SLAs, and limited compliance certifications.

Need Expert AI Licensing Advisory?

Redress Compliance provides independent GenAI licensing advisory — fixed-fee, no vendor affiliations.

Explore GenAI Advisory Services →

The managed API approach is the right default for most enterprises evaluating Llama. You get the model’s capabilities without the $270K–$474K annual self-hosting cost, while retaining the option to migrate to self-hosted infrastructure if your scale justifies it. Start with managed, migrate to self-hosted only when the economics and operational requirements demand it.

Need help evaluating open-source vs commercial AI?

Our advisory team provides independent cost modelling, infrastructure analysis, and vendor-neutral recommendations across Llama, GPT, Claude, and Gemini deployments. Fixed-fee engagement.

Learn about our GenAI advisory services →

Llama vs Proprietary Models: The Honest Comparison

The procurement decision between Llama and proprietary alternatives (Claude, GPT-4o, Gemini) should be based on four dimensions, not just model quality benchmarks.

Capability: Llama 3.3 70B is competitive with Claude Sonnet and GPT-4o on many enterprise tasks, particularly coding, structured data extraction, and multilingual text generation. It lags behind Claude and GPT-4o on complex multi-step reasoning, nuanced document analysis, and tasks requiring very large context windows (Llama’s 128K context is smaller than Claude’s 200K–1M). For simple tasks (classification, routing, extraction), smaller Llama models match or exceed proprietary offerings at a fraction of the cost.

Cost at your scale: At low-to-moderate scale (under 1 million requests/day), managed API alternatives are almost always cheaper than self-hosted Llama. At very high scale (5M+ requests/day), self-hosted Llama is cheaper. In the middle, the comparison depends on GPU pricing, engineering costs, and utilisation rate. Run the actual numbers for your projected volume before assuming Llama is the cheaper option.

Control and customisation: Llama is the only leading model where you can access and modify the weights. If fine-tuning, safety customisation, or architectural modification is essential to your use case, Llama (or other open-weight models like Mistral and Qwen) is your only option among frontier-class models. This advantage is decisive for a narrow set of use cases and irrelevant for the majority.

Risk profile: Proprietary API providers (Anthropic, OpenAI, Google) bear operational responsibility for model serving, including uptime, security, and compliance certifications. With self-hosted Llama, your organisation bears all operational risk. The Llama licence further shifts legal liability to the customer through the indemnification clause. Managed Llama through Bedrock or Azure sits in the middle: the cloud provider bears operational risk while the Llama licence terms still apply to the underlying model.

Seven Contract and Procurement Recommendations

1. Do not treat the Llama licence as “open source” in procurement. Route the Llama licence through your legal team for formal review, just as you would any commercial software agreement. The attribution requirement, indemnification clause, AUP compliance obligation, and EU multimodal restriction all have operational and legal implications that require assessment.

2. Monitor the Acceptable Use Policy as an ongoing compliance obligation. Assign a member of your legal or compliance team to review Meta’s AUP quarterly for amendments. Establish an internal process for assessing the impact of AUP changes on your deployed Llama use cases.

3. Model the full TCO before committing to self-hosted deployment. Include GPU infrastructure, MLOps engineering, fine-tuning compute, safety tooling, monitoring, and the opportunity cost of engineering resources diverted from other projects. Compare against managed API alternatives at your projected volume. Self-hosting is only the right choice when the numbers support it.

4. Start with managed Llama through your existing cloud provider. Use Bedrock, Azure, or Vertex AI to validate Llama’s fit for your use cases before investing in self-hosted infrastructure. Managed deployment lets you test performance, measure token consumption, and build production experience at API-level cost with minimal operational overhead.

5. Budget for the EU multimodal gap. If your organisation operates in the EU and requires multimodal AI capabilities, plan for a separate model provider for those use cases. Do not assume Llama can serve your entire AI portfolio if European multimodal requirements exist.

6. Negotiate cloud provider terms for Llama separately from the base licence. When deploying Llama through AWS, Azure, or Google Cloud, the cloud provider adds its own terms on top of Meta’s licence. Review both layers. Ensure the cloud provider’s content filtering, data handling, and compliance commitments meet your requirements independently of Meta’s licence terms.

7. Build exit paths into your architecture from day one. Use model-agnostic serving frameworks, standardised prompt formats, and abstraction layers that allow you to swap Llama for an alternative model (Mistral, Qwen, or a proprietary API) without rewriting your application. The Llama licence can be terminated if you violate its conditions or if Meta amends the AUP in a way that conflicts with your use case. Your architecture should survive a model change.

📊 Free Assessment Tool

How does Meta Llama compare to other enterprise AI options? Our free benchmarking assessment compares pricing and terms.

Take the Free Assessment →
Client Result

BBVA avoided a 3-year AI lock-in and achieved 28% cost savings through independent vendor advisory.

Read the case study →

Frequently Asked Questions

Is Llama truly free for commercial use?

The model weights are free to download and use commercially under the Llama Community Licence Agreement. However, the licence imposes conditions (attribution, AUP compliance, 700M MAU threshold, EU multimodal restriction) and the total cost of running Llama in production includes GPU infrastructure, engineering, and operational expenses that typically range from $50K to $500K+ annually.

Do we need a separate licence from Meta if we use Llama through AWS Bedrock or Azure?

No. The cloud provider’s agreement covers your access to Llama through their platform. However, the underlying Llama licence terms (including attribution, AUP compliance, and the EU multimodal restriction) still apply to your use of the model. You are bound by both the cloud provider’s terms and Meta’s licence.

Can we fine-tune Llama and keep the resulting model proprietary?

You own the modifications and derivative works you create through fine-tuning. However, if you distribute the fine-tuned model, its name must begin with “Llama” and the “Built with Llama” attribution requirement applies. The fine-tuned model remains subject to all Llama licence terms, including the AUP and 700M MAU threshold.

What happens if Meta changes the Acceptable Use Policy after we deploy?

You must comply with the updated AUP or risk licence termination. There is no grandfathering provision in the current licence. This is a material ongoing risk that your legal and compliance teams should monitor actively.

How does Llama compare to Claude or GPT-4o for enterprise use?

Llama 3.3 70B is competitive on many enterprise tasks but generally lags behind Claude Sonnet 4.5 and GPT-4o on complex reasoning and document analysis. Llama’s advantage is control (open weights, fine-tuning capability, self-hosting option) and potential cost savings at very high scale. The right choice depends on your specific quality requirements, volume, data sensitivity, and operational capacity.

What is the EU multimodal restriction and does it affect text-only use?

Multimodal Llama models (those processing images, video, or audio) may not be used to provide services within the European Union. Text-only Llama models are not affected by this restriction and can be deployed in the EU without limitation. If your EU operations require multimodal AI capabilities such as image analysis, document OCR with vision models, or audio transcription, you need a different model provider for those specific functions. This restriction applies regardless of whether you self-host or use Llama through a cloud provider like AWS or Azure.

What does the “Built with Llama” attribution requirement mean in practice?

Any product or service built using Llama must prominently display “Built with Llama” on a related website, user-facing documentation, or within the application interface. This applies to both customer-facing products and internal tools. For enterprises that consider their AI technology stack to be proprietary or competitively sensitive, this mandatory disclosure may be a strategic concern. There is no exemption for internal-only deployments — the licence does not distinguish between internal and external use for attribution purposes. Your legal team should evaluate whether this branding requirement conflicts with your organisation’s competitive positioning or client confidentiality obligations.

Can Meta terminate our Llama licence, and what triggers termination?

Yes. The licence terminates automatically if you violate any of its terms, including the Acceptable Use Policy, attribution requirements, or the 700M MAU threshold. There is no cure period in the current licence — a violation results in immediate termination. Additionally, because Meta can update the AUP unilaterally, a change in the acceptable use terms could create a conflict with your existing deployment that triggers termination if not promptly addressed. This termination risk is why building model-agnostic architecture is critical: if your Llama licence is terminated, you need the ability to swap to an alternative model without rebuilding your application.

How many enterprises are actually using Llama in production?

Meta reports over one billion Llama downloads as of early 2025. Enterprise adopters publicly identified include Goldman Sachs, AT&T, Nomura Holdings, and numerous mid-market companies across financial services, healthcare, and technology. However, download counts are a poor proxy for production deployment. Most downloads are for evaluation, research, and experimentation. The number of enterprises running Llama in production at scale (serving real business workloads to real users) is a fraction of total downloads — estimated in the low thousands. Adoption is growing rapidly, particularly through managed cloud providers (Bedrock, Azure) where the operational barrier to deployment is significantly lower than self-hosting.

GenAI Licensing Hub This guide is part of our GenAI Licensing Knowledge Hub — 25+ expert guides covering AI token pricing, contract risks, data privacy, and enterprise negotiation strategies.
GenAI Licensing Hub — This guide is part of our GenAI Licensing Knowledge Hub — 80+ expert guides covering AI token pricing, contract risks, data privacy, and enterprise negotiation strategies across OpenAI, Anthropic, Google, AWS, and Microsoft.