GPT-5.4 Mini & Nano API Guide: Pricing, Benchmarks & Code Examples (2026)
OpenAI just dropped GPT-5.4 mini and GPT-5.4 nano, and they're worth paying attention to. Two weeks after the full GPT-5.4 launch, these smaller variants target the workloads where the flagship model is overkill — coding subagents, classification, data extraction, and anything where you're making thousands of API calls per hour.
Here's the thing: GPT-5.4 mini scores 54.38% on SWE-bench Pro. That's only 3 percentage points behind the full GPT-5.4. At 70% less cost per token. If you're still routing every request through the flagship model, you're burning money.
This guide covers everything you need to start using both models today — pricing, benchmarks, when to pick which, and working code you can copy-paste.
Pricing Breakdown
All prices per million tokens:
| Model | Input | Cached Input | Output | Context |
|---|---|---|---|---|
| GPT-5.4 | $2.50 | $0.25 | $15.00 | 1.05M |
| GPT-5.4 mini | $0.75 | $0.075 | $4.50 | 400K |
| GPT-5.4 nano | $0.20 | $0.02 | $1.25 | 400K |
For context, here's how they stack up against the competition:
| Model | Input | Output |
|---|---|---|
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| Claude Haiku 4.5 | $1.00 | $5.00 |
| Gemini 3.1 Flash | $0.50 | $3.00 |
| Gemini 3.1 Flash-Lite | $0.25 | $1.50 |
| GPT-5.4 mini | $0.75 | $4.50 |
| GPT-5.4 nano | $0.20 | $1.25 |
GPT-5.4 nano undercuts Gemini 3.1 Flash-Lite on input price. That makes it the cheapest model from any major lab right now, at least on paper. Whether it's actually cheaper depends on how many output tokens your use case generates — nano's output pricing ($1.25/M) is slightly cheaper than Flash-Lite's ($1.50/M) too.
Benchmarks: What Can They Actually Do?
OpenAI's self-reported numbers (take with the usual grain of salt):
| Benchmark | GPT-5.4 | GPT-5.4 mini | GPT-5.4 nano |
|---|---|---|---|
| SWE-bench Pro | 57.2% | 54.38% | — |
| MMLU | 93.5% | 90.1% | 85.2% |
| HumanEval | 96.3% | 93.8% | 88.5% |
The standout number: mini at 54.38% on SWE-bench Pro. That benchmark tests models on real software engineering tasks — not toy problems, but actual GitHub issues from production repos. Three points behind the flagship for a fraction of the cost is a good trade for most workflows.
Nano is a different story. It's not trying to compete with mini on complex reasoning. OpenAI positions it for classification, extraction, ranking, and simple coding subagents. Think of it as the model you call 10,000 times per hour to label data or route requests, not the one you ask to refactor your authentication system.
Mini vs. Nano: When to Use Which
Here's a practical decision framework:
Use GPT-5.4 mini when:
- You need near-flagship coding ability at lower cost
- Your agent needs to reason through multi-step problems
- You're building a coding assistant or copilot feature
- Quality matters more than raw throughput
- You want thinking/reasoning mode on a budget
Use GPT-5.4 nano when:
- You're doing classification, tagging, or routing
- You need to process thousands of items per minute
- The task is well-defined with clear expected outputs
- You're building subagents that handle simple subtasks
- Cost per request matters more than peak intelligence
Still use the full GPT-5.4 when:
- You need the 1.05M token context window (mini/nano cap at 400K)
- You're doing complex multi-file refactoring
- Accuracy on the first attempt saves more than the token cost
- You need computer use capabilities
Code Examples
Python — Basic Chat Completion
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="https://api.kissapi.ai/v1" # or https://api.openai.com/v1
)
# GPT-5.4 mini — good for coding tasks
response = client.chat.completions.create(
model="gpt-5.4-mini",
messages=[
{"role": "system", "content": "You are a senior Python developer."},
{"role": "user", "content": "Write a retry decorator with exponential backoff."}
],
temperature=0.3
)
print(response.choices[0].message.content)
Python — Nano for Batch Classification
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.kissapi.ai/v1"
)
async def classify(text: str) -> str:
response = await client.chat.completions.create(
model="gpt-5.4-nano",
messages=[
{"role": "system", "content": "Classify the sentiment as positive, negative, or neutral. Reply with one word only."},
{"role": "user", "content": text}
],
max_tokens=5,
temperature=0
)
return response.choices[0].message.content.strip()
async def main():
reviews = [
"This product changed my workflow completely.",
"Broke after two days. Waste of money.",
"It works. Nothing special.",
"Best purchase I've made this year.",
"Customer support never responded."
]
results = await asyncio.gather(*[classify(r) for r in reviews])
for review, sentiment in zip(reviews, results):
print(f"{sentiment:>10} | {review}")
asyncio.run(main())
At nano pricing, classifying those 5 reviews costs roughly $0.0003. Scale that to 100,000 reviews and you're looking at about $6.
curl — Quick Test
curl https://api.kissapi.ai/v1/chat/completions \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5.4-nano",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Node.js — Streaming with Mini
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "your-api-key",
baseURL: "https://api.kissapi.ai/v1",
});
const stream = await client.chat.completions.create({
model: "gpt-5.4-mini",
messages: [
{ role: "user", content: "Explain WebSocket connection pooling in Node.js" }
],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
The Model Router Pattern
The real power of having three GPT-5.4 tiers isn't picking one — it's routing between them dynamically. A simple router can cut your API bill by 60-70% without noticeable quality loss.
The idea: classify the incoming request complexity, then route to the cheapest model that can handle it.
def route_request(prompt: str, needs_reasoning: bool = False) -> str:
"""Pick the cheapest model that fits the task."""
token_count = len(prompt.split()) * 1.3 # rough estimate
if needs_reasoning or token_count > 300_000:
return "gpt-5.4" # full model for heavy lifting
elif any(kw in prompt.lower() for kw in ["refactor", "debug", "architect", "review"]):
return "gpt-5.4-mini" # coding tasks
else:
return "gpt-5.4-nano" # everything else
In practice, most production traffic falls into the nano bucket. Classification, extraction, formatting, simple Q&A — none of that needs mini-level intelligence. Save mini for the coding and reasoning tasks where the quality gap actually matters.
Cached Input Pricing: The Hidden Savings
Both mini and nano support prompt caching, and the savings are significant:
- Mini cached input: $0.075/M (90% off regular input)
- Nano cached input: $0.02/M (90% off regular input)
If you're sending the same system prompt with every request — which most apps do — caching alone can cut your input costs by 80-90%. For a coding assistant that sends a 2,000-token system prompt with every completion, that's real money at scale.
Caching kicks in automatically when OpenAI detects repeated prompt prefixes. No code changes needed.
400K Context: Enough for Most Things
Both mini and nano support 400,000-token context windows. That's roughly 300,000 words, or about 600 pages of text. For reference, the full GPT-5.4 goes up to 1.05 million tokens.
When does the 400K limit actually matter? Rarely, for most developers. If you're processing entire codebases or very long documents, you might hit it. But for chat, coding assistance, and data processing, 400K is more than enough.
One thing to watch: extended context pricing. If your prompts exceed certain thresholds, per-token costs go up. Keep your prompts lean. Send only what the model needs to see.
Mini with Thinking Mode
GPT-5.4 mini supports reasoning/thinking mode, where the model works through a problem step-by-step before answering. This is available to free and Go-tier ChatGPT users too, not just API customers.
For API usage, you can control reasoning effort:
response = client.chat.completions.create(
model="gpt-5.4-mini",
messages=[{"role": "user", "content": "Find the bug in this code..."}],
reasoning_effort="high" # low, medium, high, xhigh
)
Higher reasoning effort = more thinking tokens = higher cost per request. But for debugging and architecture questions, the improved accuracy usually means fewer retries. Test both ways and measure.
Try GPT-5.4 Mini & Nano Today
Access GPT-5.4 mini, nano, and 100+ other models through one API. OpenAI-compatible endpoint, pay-as-you-go pricing, no subscription required.
Start Free →Migration from Previous Models
If you're currently using GPT-5 mini or GPT-4o-mini, switching is straightforward — just change the model name:
gpt-5-mini→gpt-5.4-minigpt-5-nano→gpt-5.4-nanogpt-4o-mini→gpt-5.4-nano(closest equivalent)
The API format is identical. Same endpoints, same parameters, same response structure. The only breaking change: GPT-5.4 mini and nano are priced higher than their GPT-5 predecessors. Mini went from roughly $0.30/$1.20 to $0.75/$4.50. Nano went from $0.10/$0.40 to $0.20/$1.25. You're paying more, but you're getting substantially better models.
Bottom Line
GPT-5.4 mini is the new default for most API workloads. It's fast, it's cheap relative to the flagship, and it handles coding tasks surprisingly well. Nano is the workhorse for high-volume, low-complexity tasks where every fraction of a cent matters.
The smart move: don't pick one. Build a router that sends each request to the cheapest model that can handle it. Your API bill will thank you.