Claude Code vs Codex CLI vs Gemini CLI: Real API Costs Compared (2026)
Every developer I know is running at least one CLI coding agent now. Claude Code, Codex CLI, Gemini CLI — they've replaced the "ask ChatGPT and copy-paste" workflow entirely. You type a task in your terminal, the agent reads your codebase, writes code, runs tests, and commits. It's genuinely faster.
But here's the thing nobody talks about: these agents eat tokens like crazy. A single complex refactoring session can burn through $5-15 in API calls. Multiply that across a full workday and you're looking at $100-400/month — just on your coding assistant.
I tracked my actual API usage across all three CLI agents for two weeks. Here's what I found.
The Three Contenders
Quick context on each agent before we get into costs:
Claude Code uses Anthropic's Claude models (Sonnet 4.6 by default, Opus 4.6 for hard tasks). It's the current SWE-bench leader at 80.9%. Excellent at multi-file refactoring, understands project structure deeply, and can spawn sub-agents for parallel work. The $20/month Max subscription gives you a generous but limited allowance; after that, you're on API credits.
Codex CLI runs on OpenAI's GPT-5.4 (or GPT-5.3 in the Codex app). It has a 1.05 million token context window — the largest of the three. The new Tool Search feature lets it pull in relevant code without you specifying files. Strong at following instructions precisely, though it can be verbose in its reasoning.
Gemini CLI uses Google's Gemini 3.1 Pro. The standout feature is cost: at $2/$12 per million tokens (input/output), it's the cheapest frontier model by far. It also has a 2 million token context window. The tradeoff is that it's noticeably weaker at complex multi-file refactoring compared to Claude or GPT-5.4.
Token Usage: What a Real Coding Session Looks Like
I ran each agent on the same set of tasks over two weeks: bug fixes, feature implementations, test writing, and refactoring. Here's the average token consumption per task type:
| Task Type | Claude Code | Codex CLI | Gemini CLI |
|---|---|---|---|
| Simple bug fix | ~15K in / 3K out | ~25K in / 5K out | ~20K in / 4K out |
| Feature (50-100 lines) | ~80K in / 15K out | ~120K in / 20K out | ~90K in / 18K out |
| Multi-file refactor | ~200K in / 40K out | ~300K in / 50K out | ~250K in / 45K out |
| Test suite generation | ~60K in / 25K out | ~80K in / 30K out | ~70K in / 28K out |
| Code review + fixes | ~100K in / 20K out | ~150K in / 25K out | ~120K in / 22K out |
A few patterns jump out. Codex CLI consistently uses more input tokens because GPT-5.4's Tool Search pulls in more context by default. Claude Code is the most token-efficient — it's better at identifying which files matter and ignoring the rest. Gemini CLI falls in between.
The output token counts are closer across all three, which makes sense — the amount of code you need written doesn't change much based on the model.
Monthly Cost Breakdown
Let's model a "typical active developer" month: 20 working days, averaging 8-10 coding agent interactions per day across a mix of task types. That works out to roughly 30M input tokens and 6M output tokens per month.
| Agent | Model | Input Cost | Output Cost | Monthly Total |
|---|---|---|---|---|
| Claude Code | Sonnet 4.6 | $90 (30M × $3) | $90 (6M × $15) | $180 |
| Claude Code | Opus 4.6 | $450 (30M × $15) | $450 (6M × $75) | $900 |
| Codex CLI | GPT-5.4 | $75 (30M × $2.50) | $60 (6M × $10) | $135 |
| Codex CLI | GPT-5.4 Mini | $9 (30M × $0.30) | $7.80 (6M × $1.30) | $16.80 |
| Gemini CLI | Gemini 3.1 Pro | $60 (30M × $2) | $72 (6M × $12) | $132 |
The numbers tell a clear story. Gemini CLI and Codex CLI (with GPT-5.4) are neck-and-neck around $130-135/month. Claude Code with Sonnet 4.6 costs about 35% more at $180. And if you're running Opus 4.6 for everything — don't. That's $900/month and almost never necessary.
GPT-5.4 Mini is the wildcard at under $17/month, but it can't handle complex refactoring tasks that the frontier models breeze through. It's great for simple fixes and test generation.
But Wait — Subscription Plans Change the Math
All three providers offer subscription tiers that include some agent usage:
- Claude Max ($20/mo) — Generous Sonnet 4.6 allowance, limited Opus. Most individual developers won't exceed this for casual use. Heavy users will blow through it in a few days.
- ChatGPT Pro ($200/mo) — Unlimited GPT-5.4 access including Codex. If you're a heavy user, this is actually cheaper than pay-per-token. But $200/month is a lot if you're not using it constantly.
- Gemini Advanced ($20/mo) — Includes Gemini CLI access with rate limits. Good value for lighter usage, but the rate limits can be frustrating during intense coding sessions.
The subscription approach works if you're locked into one ecosystem. The problem is that no single model is best at everything. Claude dominates multi-file refactoring. GPT-5.4 is strongest at following complex instructions. Gemini handles massive codebases better thanks to its 2M context window.
The BYOK Approach: Use All Three for Less
Here's what actually works: bring your own API key (BYOK) and route through a single endpoint. Instead of paying three subscriptions, you pay for exactly what you use across all models.
The setup takes about 5 minutes per agent:
Claude Code with Custom API
# Set your API proxy endpoint
export ANTHROPIC_BASE_URL=https://api.kissapi.ai
export ANTHROPIC_API_KEY=your-api-key
# Launch Claude Code as usual
claude
Codex CLI with Custom API
# Codex CLI respects OpenAI env vars
export OPENAI_BASE_URL=https://api.kissapi.ai/v1
export OPENAI_API_KEY=your-api-key
# Run Codex
codex "fix the auth middleware bug"
Gemini CLI with Custom API
# Gemini CLI supports OpenAI-compatible endpoints
export GEMINI_API_BASE=https://api.kissapi.ai/v1
export GEMINI_API_KEY=your-api-key
gemini
With this setup, you use one API key, one balance, and pick the right model for each task. Need to refactor a complex module? Route to Claude Sonnet 4.6. Writing boilerplate tests? Use GPT-5.4 Mini. Analyzing a massive monorepo? Gemini 3.1 Pro with its 2M context window.
My Actual Monthly Bill: The Router Strategy
After two weeks of tracking, I settled on this split:
- 60% of tasks → Claude Sonnet 4.6 (complex coding, refactoring)
- 25% of tasks → GPT-5.4 Mini (simple fixes, test generation, boilerplate)
- 15% of tasks → Gemini 3.1 Pro (large codebase analysis, documentation)
Monthly cost with this split: roughly $85-95. That's about 50% less than using Claude Sonnet for everything, and 30% less than a single ChatGPT Pro subscription.
The key insight: most coding tasks don't need a frontier model. A $0.30/M-input model handles 25% of your work just fine. Save the expensive models for when they actually matter.
Hidden Costs Nobody Mentions
Token pricing isn't the whole story. Watch out for these:
Extended thinking tokens. Claude Code and Codex CLI both support "thinking" modes where the model reasons before responding. These thinking tokens cost extra — and they add up fast. A single complex task with extended thinking can use 50K+ thinking tokens. At Opus rates, that's $3.75 just for the model to think.
Retry loops. When an agent writes code that fails tests, it retries. Each retry is a full new API call with the entire conversation context. I've seen single tasks balloon to 500K+ tokens because of retry loops. Set a max-retry limit (3 is usually enough).
Context stuffing. Agents that read your entire codebase into context on every call waste tokens. Claude Code is better about this than Codex CLI, which tends to pull in more files than necessary via Tool Search. Use .claudeignore or .codexignore files to exclude irrelevant directories.
Idle conversations. Leaving an agent session open and asking follow-up questions means the full conversation history gets sent with each message. Start fresh sessions for unrelated tasks.
Cost Optimization Cheat Sheet
- Use a model router. Route simple tasks to cheap models, complex tasks to frontier models. This alone cuts costs 40-60%.
- Set .ignore files. Exclude
node_modules, build artifacts, and irrelevant directories from agent context. - Limit retries. Cap automatic retries at 3. If the agent can't fix it in 3 tries, you need to rephrase the task.
- Start fresh sessions. Don't reuse long conversation threads for new tasks. The accumulated context costs tokens.
- Skip extended thinking for simple tasks. Only enable thinking mode for genuinely complex problems.
- Use Mini/Nano for boilerplate. Test generation, documentation, and simple CRUD don't need Opus or GPT-5.4.
Run All Three Agents Through One API
KissAPI gives you Claude, GPT-5.4, Gemini, and 200+ models through a single OpenAI-compatible endpoint. Pay per token, no subscriptions. Works with Claude Code, Codex CLI, Gemini CLI, Cursor, and every major IDE.
Start Free →Which Agent Should You Pick?
If you can only pick one:
- Claude Code if code quality matters most. It writes the cleanest code, handles complex refactoring best, and uses tokens most efficiently. The SWE-bench scores back this up.
- Codex CLI if you need the largest context window and precise instruction following. GPT-5.4's 1M context is useful for large projects, and the Tool Search feature is genuinely good.
- Gemini CLI if budget is the primary concern. At $2/$12 per million tokens, it's the cheapest frontier option. The 2M context window is unmatched. Just know it'll struggle more with complex multi-file changes.
If you can use all three (which I'd recommend): route tasks to the right model based on complexity. Your wallet will thank you.