GPT-5.4 Mini & Nano API Guide: Pricing, Benchmarks & Code Examples (2026)

Published March 18, 2026 · 9 min read

OpenAI just dropped GPT-5.4 mini and GPT-5.4 nano, and they're worth paying attention to. Two weeks after the full GPT-5.4 launch, these smaller variants target the workloads where the flagship model is overkill — coding subagents, classification, data extraction, and anything where you're making thousands of API calls per hour.

Here's the thing: GPT-5.4 mini scores 54.38% on SWE-bench Pro. That's only 3 percentage points behind the full GPT-5.4. At 70% less cost per token. If you're still routing every request through the flagship model, you're burning money.

This guide covers everything you need to start using both models today — pricing, benchmarks, when to pick which, and working code you can copy-paste.

Pricing Breakdown

All prices per million tokens:

Model	Input	Cached Input	Output	Context
GPT-5.4	$2.50	$0.25	$15.00	1.05M
GPT-5.4 mini	$0.75	$0.075	$4.50	400K
GPT-5.4 nano	$0.20	$0.02	$1.25	400K

For context, here's how they stack up against the competition:

Model	Input	Output
Claude Sonnet 4.6	$3.00	$15.00
Claude Haiku 4.5	$1.00	$5.00
Gemini 3.1 Flash	$0.50	$3.00
Gemini 3.1 Flash-Lite	$0.25	$1.50
GPT-5.4 mini	$0.75	$4.50
GPT-5.4 nano	$0.20	$1.25

GPT-5.4 nano undercuts Gemini 3.1 Flash-Lite on input price. That makes it the cheapest model from any major lab right now, at least on paper. Whether it's actually cheaper depends on how many output tokens your use case generates — nano's output pricing ($1.25/M) is slightly cheaper than Flash-Lite's ($1.50/M) too.

Benchmarks: What Can They Actually Do?

OpenAI's self-reported numbers (take with the usual grain of salt):

Benchmark	GPT-5.4	GPT-5.4 mini	GPT-5.4 nano
SWE-bench Pro	57.2%	54.38%	—
MMLU	93.5%	90.1%	85.2%
HumanEval	96.3%	93.8%	88.5%

The standout number: mini at 54.38% on SWE-bench Pro. That benchmark tests models on real software engineering tasks — not toy problems, but actual GitHub issues from production repos. Three points behind the flagship for a fraction of the cost is a good trade for most workflows.

Nano is a different story. It's not trying to compete with mini on complex reasoning. OpenAI positions it for classification, extraction, ranking, and simple coding subagents. Think of it as the model you call 10,000 times per hour to label data or route requests, not the one you ask to refactor your authentication system.

Mini vs. Nano: When to Use Which

Here's a practical decision framework:

Use GPT-5.4 mini when:

You need near-flagship coding ability at lower cost
Your agent needs to reason through multi-step problems
You're building a coding assistant or copilot feature
Quality matters more than raw throughput
You want thinking/reasoning mode on a budget

Use GPT-5.4 nano when:

You're doing classification, tagging, or routing
You need to process thousands of items per minute
The task is well-defined with clear expected outputs
You're building subagents that handle simple subtasks
Cost per request matters more than peak intelligence

Still use the full GPT-5.4 when:

You need the 1.05M token context window (mini/nano cap at 400K)
You're doing complex multi-file refactoring
Accuracy on the first attempt saves more than the token cost
You need computer use capabilities

Code Examples

Python — Basic Chat Completion

from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://api.kissapi.ai/v1"  # or https://api.openai.com/v1
)

# GPT-5.4 mini — good for coding tasks
response = client.chat.completions.create(
    model="gpt-5.4-mini",
    messages=[
        {"role": "system", "content": "You are a senior Python developer."},
        {"role": "user", "content": "Write a retry decorator with exponential backoff."}
    ],
    temperature=0.3
)

print(response.choices[0].message.content)

Python — Nano for Batch Classification

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key="your-api-key",
    base_url="https://api.kissapi.ai/v1"
)

async def classify(text: str) -> str:
    response = await client.chat.completions.create(
        model="gpt-5.4-nano",
        messages=[
            {"role": "system", "content": "Classify the sentiment as positive, negative, or neutral. Reply with one word only."},
            {"role": "user", "content": text}
        ],
        max_tokens=5,
        temperature=0
    )
    return response.choices[0].message.content.strip()

async def main():
    reviews = [
        "This product changed my workflow completely.",
        "Broke after two days. Waste of money.",
        "It works. Nothing special.",
        "Best purchase I've made this year.",
        "Customer support never responded."
    ]
    results = await asyncio.gather(*[classify(r) for r in reviews])
    for review, sentiment in zip(reviews, results):
        print(f"{sentiment:>10} | {review}")

asyncio.run(main())

At nano pricing, classifying those 5 reviews costs roughly $0.0003. Scale that to 100,000 reviews and you're looking at about $6.

curl — Quick Test

curl https://api.kissapi.ai/v1/chat/completions \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.4-nano",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Node.js — Streaming with Mini

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "your-api-key",
  baseURL: "https://api.kissapi.ai/v1",
});

const stream = await client.chat.completions.create({
  model: "gpt-5.4-mini",
  messages: [
    { role: "user", content: "Explain WebSocket connection pooling in Node.js" }
  ],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

The Model Router Pattern

The real power of having three GPT-5.4 tiers isn't picking one — it's routing between them dynamically. A simple router can cut your API bill by 60-70% without noticeable quality loss.

The idea: classify the incoming request complexity, then route to the cheapest model that can handle it.

def route_request(prompt: str, needs_reasoning: bool = False) -> str:
    """Pick the cheapest model that fits the task."""
    token_count = len(prompt.split()) * 1.3  # rough estimate

    if needs_reasoning or token_count > 300_000:
        return "gpt-5.4"          # full model for heavy lifting
    elif any(kw in prompt.lower() for kw in ["refactor", "debug", "architect", "review"]):
        return "gpt-5.4-mini"     # coding tasks
    else:
        return "gpt-5.4-nano"     # everything else

In practice, most production traffic falls into the nano bucket. Classification, extraction, formatting, simple Q&A — none of that needs mini-level intelligence. Save mini for the coding and reasoning tasks where the quality gap actually matters.

Cached Input Pricing: The Hidden Savings

Both mini and nano support prompt caching, and the savings are significant:

Mini cached input: $0.075/M (90% off regular input)
Nano cached input: $0.02/M (90% off regular input)

If you're sending the same system prompt with every request — which most apps do — caching alone can cut your input costs by 80-90%. For a coding assistant that sends a 2,000-token system prompt with every completion, that's real money at scale.

Caching kicks in automatically when OpenAI detects repeated prompt prefixes. No code changes needed.

400K Context: Enough for Most Things

Both mini and nano support 400,000-token context windows. That's roughly 300,000 words, or about 600 pages of text. For reference, the full GPT-5.4 goes up to 1.05 million tokens.

When does the 400K limit actually matter? Rarely, for most developers. If you're processing entire codebases or very long documents, you might hit it. But for chat, coding assistance, and data processing, 400K is more than enough.

One thing to watch: extended context pricing. If your prompts exceed certain thresholds, per-token costs go up. Keep your prompts lean. Send only what the model needs to see.

Mini with Thinking Mode

GPT-5.4 mini supports reasoning/thinking mode, where the model works through a problem step-by-step before answering. This is available to free and Go-tier ChatGPT users too, not just API customers.

For API usage, you can control reasoning effort:

response = client.chat.completions.create(
    model="gpt-5.4-mini",
    messages=[{"role": "user", "content": "Find the bug in this code..."}],
    reasoning_effort="high"  # low, medium, high, xhigh
)

Higher reasoning effort = more thinking tokens = higher cost per request. But for debugging and architecture questions, the improved accuracy usually means fewer retries. Test both ways and measure.

Try GPT-5.4 Mini & Nano Today

Access GPT-5.4 mini, nano, and 100+ other models through one API. OpenAI-compatible endpoint, pay-as-you-go pricing, no subscription required.

Start Free →

Migration from Previous Models

If you're currently using GPT-5 mini or GPT-4o-mini, switching is straightforward — just change the model name:

gpt-5-mini → gpt-5.4-mini
gpt-5-nano → gpt-5.4-nano
gpt-4o-mini → gpt-5.4-nano (closest equivalent)

The API format is identical. Same endpoints, same parameters, same response structure. The only breaking change: GPT-5.4 mini and nano are priced higher than their GPT-5 predecessors. Mini went from roughly $0.30/$1.20 to $0.75/$4.50. Nano went from $0.10/$0.40 to $0.20/$1.25. You're paying more, but you're getting substantially better models.

Bottom Line

GPT-5.4 mini is the new default for most API workloads. It's fast, it's cheap relative to the flagship, and it handles coding tasks surprisingly well. Nano is the workhorse for high-volume, low-complexity tasks where every fraction of a cent matters.

The smart move: don't pick one. Build a router that sends each request to the cheapest model that can handle it. Your API bill will thank you.