How to Build a Voice Agent with OpenAI Realtime API (2026): WebRTC, Tools & Cost Basics
If you're trying to add voice to a product in 2026, you're spoiled for choice. That's the good news. The bad news is that a lot of teams pick the wrong stack, spend too much, and then blame the model.
OpenAI's Realtime API is strong when you need a live, interruptible conversation. Think support agents, phone workflows, sales assistants, or a voice UI that needs to react fast. It is not automatically the best answer for meeting notes, voicemail summaries, or simple transcription. In those cases, a chained pipeline is usually easier to ship and cheaper to run.
This guide walks through the practical setup for an OpenAI Realtime API voice agent tutorial 2026 workflow: when to use Realtime, how to mint session secrets safely, how to connect with WebRTC, and where teams usually shoot themselves in the foot.
First, decide if you really need Realtime
The current OpenAI docs point developers toward gpt-realtime-2 for low-latency voice agents. There are also narrower tools now, including gpt-realtime-whisper for live transcription and gpt-realtime-translate for live translation. That's important, because not every audio app needs full speech-to-speech reasoning.
| Use case | Best fit | Why |
|---|---|---|
| Live support or assistant calls | gpt-realtime-2 | Fast turn-taking, interruption handling, and tool calls in-session |
| Meeting notes or call summaries | Transcription + text model | Cheaper, simpler, and easier to debug |
| Live captions | gpt-realtime-whisper | Priced by minute and built for streaming transcription |
| Speech translation | gpt-realtime-translate | Purpose-built for real-time translation |
My rule: if the user can interrupt mid-sentence and expects the system to react naturally, pay the Realtime complexity tax. If not, don't.
The architecture that usually works
OpenAI's voice-agent docs break the browser flow into a few clean steps, and they're the right ones:
- Your backend creates an ephemeral client secret for a live session.
- Your frontend connects with WebRTC for browser audio. If you're doing server-side audio processing, use WebSocket.
- The session handles turns, interruptions, and tool calls inside the Realtime connection.
- Long-running work should stay asynchronous so the conversation doesn't freeze.
The security bit matters. Never put a long-lived OpenAI key in the browser. Ever. A short-lived session secret is the right pattern.
1) Create a session secret on your server
Here is the quickest curl test. It creates a Realtime session for gpt-realtime-2 with the cedar voice and some basic instructions.
curl https://api.openai.com/v1/realtime/client_secrets \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"session": {
"type": "realtime",
"model": "gpt-realtime-2",
"voice": "cedar",
"instructions": "You are a concise support agent. Confirm IDs digit by digit. Use tools instead of guessing."
}
}'
For production, wrap that in a backend endpoint. Python with FastAPI is enough:
import os
import httpx
from fastapi import FastAPI
app = FastAPI()
@app.post("/realtime/session")
async def create_realtime_session():
payload = {
"session": {
"type": "realtime",
"model": "gpt-realtime-2",
"voice": "cedar",
"instructions": (
"You are a concise support agent. "
"Read order numbers slowly. "
"If account data is needed, call tools instead of making things up."
)
}
}
headers = {
"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
"Content-Type": "application/json",
}
async with httpx.AsyncClient(timeout=20) as client:
response = await client.post(
"https://api.openai.com/v1/realtime/client_secrets",
headers=headers,
json=payload,
)
response.raise_for_status()
return response.json()
Keep this endpoint server-side. Add auth on your own app before you hand a session to the browser.
2) Connect from the browser with WebRTC
OpenAI's current docs push browser voice apps toward WebRTC, which makes sense. You get lower-latency audio transport and less plumbing than building your own raw stream layer.
This minimal Node.js client pattern uses the Realtime helpers from the Agents SDK:
import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";
const tokenRes = await fetch("/realtime/session", { method: "POST" });
const tokenData = await tokenRes.json();
const ephemeralKey =
tokenData.client_secret?.value ??
tokenData.value ??
tokenData.client_secret;
const agent = new RealtimeAgent({
name: "Support Agent",
instructions:
"Be brief. Ask one question at a time. Repeat numbers slowly."
});
const session = new RealtimeSession(agent, {
model: "gpt-realtime-2",
});
await session.connect({ apiKey: ephemeralKey });
That's the happy path. In a real app, you'll also want push-to-talk or turn detection, clear UI states for “listening” vs “speaking,” and explicit recovery when the connection drops. Voice UX falls apart fast when the user can't tell whether the system is waiting, thinking, or broken.
3) Tools matter more than the voice
A smooth voice sounds nice. A useful voice agent closes the loop with tools. The newer Realtime updates added better function calling, asynchronous tool handling, and remote MCP server support. That's the difference between a toy demo and something a support or operations team can actually use.
One practical pattern I like: keep the live conversation in Realtime, but offload slower tasks such as ticket creation, post-call summaries, or CRM drafting to normal text models. If the rest of your stack already needs multiple providers, routing those non-realtime jobs through KissAPI keeps the text side simpler without forcing your voice path into the same box.
Realtime pricing: use the expensive thing only when it earns its keep
OpenAI's current pricing page makes the tradeoff pretty plain.
| API / model | Best for | Official pricing |
|---|---|---|
gpt-realtime-2 | Full speech-to-speech agents | $32 / 1M audio input tokens, $64 / 1M audio output tokens |
gpt-realtime-whisper | Live transcription | $0.017 per minute |
gpt-realtime-translate | Live translation | $0.034 per minute |
The opinionated takeaway: if you only need captions or transcripts, don't pay full speech-to-speech rates. That's like renting a race car to pick up groceries.
Common mistakes that waste time
- Using Realtime for one-way transcription. It works, but it is usually the wrong tool.
- Shipping a permanent API key to the client. Session secrets exist for a reason.
- Blocking the conversation on slow tools. Use asynchronous calls and let the agent keep talking.
- Ignoring interruption design. Voice apps need barge-in behavior, not just good model output.
- No fallback path. If live audio fails, drop to text chat or post-call handling instead of dead air.
Another good production pattern is split responsibility: keep voice live on Realtime, and let the rest of your app use the cheapest decent text model for summaries, routing, and structured extraction. That reduces lock-in and keeps bills sane. Again, a gateway like KissAPI is handy there because your non-voice workloads don't all need to live with one vendor forever.
Need one API layer for the rest of your AI stack?
Use OpenAI Realtime where it shines, then route your chat, summaries, and multi-model text workloads through one simpler endpoint.
Start FreeFinal take
If you're building a real assistant that talks over the web or the phone, OpenAI Realtime is finally mature enough to take seriously. Just don't make it carry jobs that belong to cheaper APIs. Use WebRTC in the browser, keep your keys off the client, design for interruptions, and treat tools as the product, not decoration.
Do that, and your voice agent has a fair shot at feeling fast, useful, and worth the money.