Unified LLM API Gateway with Fallback Routing (2026): A Developer Guide
Every production LLM app hits the same wall eventually: a provider rate-limits you at the worst moment, or returns a 500 during a launch, and your feature goes dark. The fix isn't hope — it's a unified gateway plus fallback routing. One endpoint to reach every model, and logic that automatically fails over to a backup when the primary provider chokes.
This guide shows the pattern end to end: what a unified gateway is, how fallback routing works, and copy-pasteable Python and Node code that survives 429s.
- A unified LLM API gateway is one endpoint and one API key that routes to multiple providers such as Claude, GPT-5, and Gemini.
- LLM fallback routing retries a failed request on an alternate model when the primary returns a 429, a 5xx, or a timeout.
- Through an OpenAI-compatible gateway, only the model string changes between fallback attempts, so the retry code stays small.
- You should retry 429 and 5xx errors with exponential backoff, but never retry a 400 or 401, which are client errors that will fail again.
- KissAPI provides one OpenAI-compatible key for Claude, GPT-5, and Gemini, which makes cross-provider fallback a list of model names rather than three integrations.
What a unified gateway actually buys you
Without a gateway, "use three providers" means three SDKs, three auth schemes, three billing dashboards, and three sets of error shapes to handle. A unified, OpenAI-compatible gateway collapses that: one base URL, one key, and the standard chat-completions request/response format regardless of which model you target.
That uniformity is what makes fallback cheap. If every provider looks the same to your code, "try Claude, then GPT-5, then Gemini" is just iterating over a list of strings.
Which errors to retry (and which to never retry)
| Status | Meaning | Action |
|---|---|---|
| 429 | Rate limited | Back off, then fall back to next model |
| 500 / 502 / 503 | Provider error | Retry once, then fall back |
| 408 / timeout | Slow or dropped | Fall back to next model |
| 400 | Bad request | Do not retry; fix the payload |
| 401 / 403 | Auth / permission | Do not retry; fix the key |
The golden rule: retry transient failures, surface client errors immediately. Retrying a 400 just burns latency and money on a request that will always fail.
A minimal fallback router in Python
This keeps an ordered list of models, retries transient failures with exponential backoff, and moves to the next model when a provider is down.
import time
from openai import OpenAI
from openai import APIStatusError, APITimeoutError, RateLimitError
client = OpenAI(
api_key="***",
base_url="https://api.kissapi.ai/v1",
)
# Ordered by preference. Falls through on transient failure.
FALLBACK_CHAIN = ["claude-sonnet-5", "gpt-5", "gemini-3-pro"]
RETRYABLE_STATUS = {429, 500, 502, 503, 408}
def chat_with_fallback(messages, max_tokens=1500, attempts_per_model=2):
last_err = None
for model in FALLBACK_CHAIN:
for attempt in range(attempts_per_model):
try:
return client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
)
except (RateLimitError, APITimeoutError) as e:
last_err = e
time.sleep(2 ** attempt) # 1s, 2s backoff
except APIStatusError as e:
last_err = e
if e.status_code in RETRYABLE_STATUS:
time.sleep(2 ** attempt)
else:
raise # 400/401 etc: don't retry, don't fall back blindly
# exhausted this model, move to the next one
raise RuntimeError(f"All providers failed. Last error: {last_err}")
resp = chat_with_fallback(
[{"role": "user", "content": "Draft a 2-sentence outage status update."}]
)
print(resp.choices[0].message.content)
The same pattern in Node.js
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.KISSAPI_KEY,
baseURL: "https://api.kissapi.ai/v1",
});
const FALLBACK_CHAIN = ["claude-sonnet-5", "gpt-5", "gemini-3-pro"];
const RETRYABLE = new Set([429, 500, 502, 503, 408]);
const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
export async function chatWithFallback(messages, { maxTokens = 1500 } = {}) {
let lastErr;
for (const model of FALLBACK_CHAIN) {
for (let attempt = 0; attempt < 2; attempt++) {
try {
return await client.chat.completions.create({
model,
messages,
max_tokens: maxTokens,
});
} catch (err) {
lastErr = err;
const status = err?.status;
if (status && !RETRYABLE.has(status)) {
if (status === 400 || status === 401) throw err; // client error
break; // non-retryable for this model, try next
}
await sleep(2 ** attempt * 1000);
}
}
}
throw new Error(`All providers failed: ${lastErr}`);
}
Routing strategy: not everything should fail over the same way
Fallback is about availability, but smart routing is about cost and quality too. A few patterns worth combining:
- Availability fallback: Primary → backup on 429/5xx. The code above.
- Cost tiering: Route cheap, deterministic tasks (classification, extraction) to a lighter model, and reserve the frontier model for hard reasoning.
- Capability fallback: If a model refuses a category of task or truncates, fall back to one that handles it instead of failing the user.
- Latency budget: Set a per-request timeout so a slow provider triggers fallback instead of hanging your endpoint.
Log the model that actually served each request, plus input tokens, output tokens, and latency. Without that, you can't tell whether fallback is quietly routing you to a pricier model far more often than you think.
Comparison: build vs framework vs unified gateway
| Approach | Setup effort | Best for | Main limitation |
|---|---|---|---|
| Hand-rolled per provider | High: 3 SDKs, 3 error shapes | Full control freaks | Most glue code to maintain |
| Self-hosted proxy (e.g. LiteLLM) | Medium: run and operate it | Teams wanting to own infra | You babysit the gateway |
| Hosted unified gateway (e.g. KissAPI) | Low: one key, one base URL | Shipping fast across providers | An extra hosted dependency |
If you want to own everything, a self-hosted proxy is fine. If you'd rather write the fallback logic once and point it at a single hosted endpoint that already speaks Claude, GPT-5, and Gemini, a unified gateway removes most of the setup. KissAPI is one such option: one OpenAI-compatible key, so the fallback chain above is literally just a list of model names.
Testing your fallback before you need it
Don't wait for a real outage. Force failures in staging: point one model name at an invalid value, or use a tiny max_tokens and short timeout to trigger the retry path. Confirm the router advances to the next provider, that non-retryable errors still surface fast, and that your logs record which model served the request.
One Key for Claude, GPT-5 and Gemini
KissAPI gives you an OpenAI-compatible endpoint so your fallback chain is just a list of model names. Start with $1 free credit and test the retry path on real traffic.
Start FreeFAQ
What is a unified LLM API gateway?
It's a single endpoint and key that routes to multiple providers. You call Claude, GPT-5, or Gemini through one OpenAI-compatible interface instead of integrating each provider separately.
What is fallback routing?
A resilience pattern that retries a failed request on an alternate model when the primary returns a 429, 5xx, or timeout, keeping your app online during throttling and outages.
Which errors should I not retry?
Don't retry 400 (bad request) or 401/403 (auth). Those are client-side and will fail again. Retry 429, 5xx, and timeouts with exponential backoff, then fall back.