Llama 4 API Guide: Scout & Maverick Pricing, Setup & Code Examples (2026)

Published March 25, 2026 · 9 min read

Meta's Llama 4 dropped with two models that actually matter: Scout and Maverick. Both are open-weight, both use mixture-of-experts (MoE) architecture, and both are natively multimodal. The pricing is aggressive enough to make you rethink your entire model stack.

This guide covers everything you need to start making Llama 4 API calls today — pricing, which model to pick, and working code in Python and Node.js.

Scout vs. Maverick: Which One Do You Need?

Here's the short version. Scout is the lightweight option with a massive context window. Maverick is the heavy hitter that trades blows with GPT-4o at a fraction of the cost.

Spec	Llama 4 Scout	Llama 4 Maverick
Total Parameters	109B	400B
Active Parameters	17B	17B
Experts	16	128
Context Window	10M tokens	1M tokens
Modality	Text + Image	Text + Image
Input Price	~$0.08/1M tokens	~$0.17/1M tokens
Output Price	~$0.30/1M tokens	~$0.50/1M tokens

Both models activate only 17B parameters per forward pass despite having much larger total parameter counts. That's the MoE trick — you get the knowledge of a huge model with the inference cost of a small one.

The 10M context window on Scout is not a typo. That's roughly 7.5 million words in a single prompt. You can feed it an entire codebase, a full book, or months of chat logs and it'll handle it. Maverick's 1M context is still generous by any standard.

When to Use Each Model

Pick Scout when:

You need to process very long documents (legal contracts, codebases, research papers)
Cost matters more than peak quality — Scout is roughly half the price of Maverick
You're running high-volume classification, extraction, or summarization
You want to fit the model on a single H100 for self-hosting

Pick Maverick when:

You need the best reasoning and generation quality Llama 4 offers
You're building user-facing products where output quality is visible
Coding tasks — Maverick's 128-expert architecture handles complex code generation well
Multimodal tasks that need nuanced image understanding

Llama 4 API Pricing Compared

Since Llama 4 is open-weight, pricing depends on your provider. Here's what the major API providers charge:

Provider	Scout Input	Scout Output	Maverick Input	Maverick Output
Together AI	$0.08	$0.30	$0.17	$0.50
Groq	$0.11	$0.34	$0.20	$0.60
Fireworks	$0.10	$0.30	$0.18	$0.55
KissAPI	$0.08	$0.30	$0.17	$0.50

All prices per 1M tokens. The differences are small, but they add up at scale. The real differentiator between providers is speed and reliability, not price.

For context: Claude Sonnet 4.6 costs $3/$15 per million tokens (input/output). Maverick at $0.17/$0.50 is roughly 17x cheaper on input and 30x cheaper on output. It won't match Sonnet on every task, but for many workloads the quality gap doesn't justify a 20x price gap.

Quick Start: Llama 4 API in Python

Most providers serve Llama 4 through an OpenAI-compatible endpoint. That means you can use the standard OpenAI Python SDK — just swap the base URL and model name.

from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://api.kissapi.ai/v1"
)

# Using Llama 4 Maverick
response = client.chat.completions.create(
    model="meta-llama/llama-4-maverick",
    messages=[
        {
            "role": "system",
            "content": "You are a senior Python developer. Be concise."
        },
        {
            "role": "user",
            "content": "Write a rate limiter using the token bucket algorithm."
        }
    ],
    temperature=0.7,
    max_tokens=2048
)

print(response.choices[0].message.content)

To switch to Scout, change the model to meta-llama/llama-4-scout. Everything else stays the same.

Quick Start: Llama 4 API in Node.js

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "your-api-key",
  baseURL: "https://api.kissapi.ai/v1",
});

const response = await client.chat.completions.create({
  model: "meta-llama/llama-4-maverick",
  messages: [
    { role: "user", content: "Explain the CAP theorem in 3 sentences." }
  ],
  temperature: 0.5,
});

console.log(response.choices[0].message.content);

Streaming Responses

For anything user-facing, you want streaming. It makes the response feel instant even when the model takes a few seconds to finish.

from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://api.kissapi.ai/v1"
)

stream = client.chat.completions.create(
    model="meta-llama/llama-4-maverick",
    messages=[
        {"role": "user", "content": "Write a Dockerfile for a FastAPI app with Redis."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Using Llama 4 with curl

For quick testing or shell scripts:

curl https://api.kissapi.ai/v1/chat/completions \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/llama-4-scout",
    "messages": [
      {"role": "user", "content": "What is a mixture-of-experts model?"}
    ],
    "max_tokens": 512
  }'

Multimodal: Sending Images to Llama 4

Both Scout and Maverick accept images natively. No separate vision model needed.

response = client.chat.completions.create(
    model="meta-llama/llama-4-maverick",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this screenshot? List any bugs you see."},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/screenshot.png"}
                }
            ]
        }
    ]
)

This works for UI screenshots, diagrams, charts, photos — anything you'd throw at GPT-4o's vision. The quality is competitive, especially on Maverick.

Llama 4 vs. the Competition

Where does Llama 4 actually sit in the model landscape right now?

Model	Input $/1M	Output $/1M	Context	Best For
Llama 4 Scout	$0.08	$0.30	10M	Long-context, high-volume
Llama 4 Maverick	$0.17	$0.50	1M	Quality at low cost
Claude Sonnet 4.6	$3.00	$15.00	200K	Coding, analysis
GPT-5.4 Mini	$0.40	$1.60	128K	General purpose
DeepSeek V4	$0.14	$0.28	128K	Budget reasoning

Llama 4 Maverick sits in an interesting spot. It's not going to beat Claude Opus on hard reasoning tasks, but it doesn't need to — it costs 88x less per output token. For tasks where "good enough" quality at rock-bottom pricing wins, Maverick is hard to argue against.

Scout's 10M context window is unmatched. If you're building RAG systems and want to skip the chunking step entirely, or if you need to analyze entire repositories in one shot, nothing else comes close at this price point.

Practical Tips

Start with Scout for prototyping. It's cheaper and the quality difference from Maverick is smaller than you'd expect for most tasks. Upgrade to Maverick only when Scout's output isn't cutting it.
Use a model router. Route simple tasks (classification, extraction, formatting) to Scout and complex tasks (code generation, creative writing, analysis) to Maverick. This alone can cut your bill by 40-60%.
Take advantage of the context window. With 10M tokens on Scout, you can skip traditional RAG entirely for many use cases. Just dump the full document set into the prompt. It's lazy, but it works.
Watch your output tokens. Input is cheap on both models. Output is where the cost lives. Set max_tokens appropriately and use system prompts that encourage concise responses.

Try Llama 4 API Now

Access Llama 4 Scout, Maverick, plus Claude, GPT-5, and 200+ models through one API key. Pay-as-you-go, no subscription.

Start Free →

FAQ

Is Llama 4 really free?

The model weights are free and open under Meta's license. But running inference costs money — either your own GPU hardware or API provider fees. API access through providers like KissAPI is the fastest way to get started without managing infrastructure.

Can I use Llama 4 with Cursor or Claude Code?

Yes. Any tool that supports OpenAI-compatible endpoints works with Llama 4 through an API gateway. Set the base URL and API key, then select the Llama 4 model.

How does Llama 4 handle code generation?

Maverick is solid for code. It won't match Claude Sonnet 4.6 on complex multi-file refactoring, but for single-function generation, bug fixes, and code review, it's surprisingly capable — especially given the price difference.

What about the 10M context on Scout — does it actually work?

It works, but with caveats. Quality degrades somewhat in the middle of very long contexts (the "lost in the middle" problem affects all models). For best results, put the most important information at the beginning and end of your prompt.