OpenAI Codex Record & Replay Workflow Guide (2026): Reusable Agent Runs Without Surprise API Bills
On June 18, 2026, OpenAI shipped Codex app 26.616 with a feature that matters more than it looks: Record & Replay. The same update added thread handoff between local and remote hosts, while the June 18/19 Codex release notes added remote executor improvements, selected-plugin MCP activation, child-thread listing, external-agent import accounting, and rate-limit reset credit support.
That sounds like release-note soup. The useful version is simpler: coding agents are moving from “chat until something works” toward repeatable workflows. You demonstrate a task once, turn it into a reusable skill, move it between machines, and track the agent runs that branch off from it.
That’s powerful. It’s also a great way to burn tokens at scale if you don’t put guardrails around it. This guide shows how to design Codex-style reusable workflows the boring, production-friendly way: explicit inputs, model routing, token budgets, retries, and fallback paths.
The News Hook: Why This Update Matters
According to OpenAI’s Codex changelog, the June 18 Codex app update added Record & Replay, a macOS feature that turns a demonstrated workflow into a reusable skill. It also added thread handoff, so a Codex thread can move between a local project and a matching project on a connected remote host.
The GitHub release notes around the same window add more infrastructure detail: authenticated end-to-end encrypted Noise relay channels for remote executors, executor-native working directories and shells across boundaries, selected executor plugins activating their stdio MCP servers per thread, and app-server APIs for child threads, external-agent import results, and rate-limit reset credits.
My take: this is less about one shiny feature and more about Codex becoming an agent operating system. The winning teams won’t be the ones with the cleverest prompt. They’ll be the ones who can package recurring work into safe, observable runs.
What To Record, And What Not To Record
Record & Replay is best for workflows with a stable shape and variable inputs. Think “run our release checklist,” “triage a failing CI job,” or “generate a migration PR for this package bump.” It’s bad for vague research, one-off debugging, or anything that depends on hidden state in your desktop session.
| Good candidate | Why it works | Risk to handle |
|---|---|---|
| Dependency upgrade PR | Steps repeat across repos | Cap test retries and diff size |
| CI failure triage | Inputs are logs and changed files | Prevent endless reruns |
| Security review pass | Checklist can be stable | Require human approval before fixes |
| Release notes draft | Inputs come from commits | Verify issue links and versions |
| Exploratory product research | Too open-ended | Do it manually first |
The test I use is blunt: if you can describe the workflow as a short function signature, it’s probably recordable. If you can’t name the inputs and expected outputs, don’t automate it yet.
A Practical Workflow Contract
Before you replay an agent workflow, write down a contract. It doesn’t need to be fancy. It just needs to stop the agent from turning one task into a festival of side quests.
workflow: upgrade_package
inputs:
repo_path: string
package_name: string
target_version: string
max_test_runs: 2
max_changed_files: 12
outputs:
branch_name: string
summary: markdown
risk_notes: markdown
approval_required_for:
- deleting files
- changing database migrations
- modifying auth or billing code
budget:
max_input_tokens: 180000
max_output_tokens: 12000
max_wall_time_minutes: 20
This does two things. First, it gives the agent boundaries. Second, it gives you something to evaluate after the run. If a replay changes 47 files when the contract says 12, the workflow failed even if the diff compiles.
Model Routing For Replayed Coding Workflows
Not every step deserves the expensive model. A replayed workflow usually has a few cheap steps and one or two hard judgment steps.
| Step | Recommended model tier | Reason |
|---|---|---|
| Read package files | Fast/cheap | Mostly extraction |
| Summarize logs | Fast/cheap | Pattern matching |
| Plan migration | Strong reasoning | Needs architecture judgment |
| Edit code | Strong coding model | Correctness matters |
| Draft release note | Cheap or mid-tier | Low risk |
If you’re building this outside the Codex app, KissAPI can be useful as an OpenAI-compatible routing layer: keep one endpoint for your agent runner, then route lightweight steps to cheaper models and reserve premium models for the small number of decisions that actually need them.
Python: Add A Budget Gate Before Each Agent Call
Here’s a small pattern you can adapt for any agent runner. The point is not the exact token counter. The point is forcing every replay step through a budget check.
import time
from dataclasses import dataclass
@dataclass
class RunBudget:
max_input_tokens: int
max_output_tokens: int
max_seconds: int
used_input_tokens: int = 0
used_output_tokens: int = 0
started_at: float = time.time()
def allow(self, estimated_input: int, requested_output: int) -> None:
if time.time() - self.started_at > self.max_seconds:
raise RuntimeError("workflow budget exceeded: wall time")
if self.used_input_tokens + estimated_input > self.max_input_tokens:
raise RuntimeError("workflow budget exceeded: input tokens")
if self.used_output_tokens + requested_output > self.max_output_tokens:
raise RuntimeError("workflow budget exceeded: output tokens")
def call_agent_step(client, budget, model, messages, max_tokens):
estimated_input = sum(len(m["content"]) // 4 for m in messages)
budget.allow(estimated_input, max_tokens)
response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
temperature=0.2,
)
usage = getattr(response, "usage", None)
if usage:
budget.used_input_tokens += usage.prompt_tokens
budget.used_output_tokens += usage.completion_tokens
return response.choices[0].message.content
Yes, this is unglamorous. That’s why it works. Agent systems fail in boring ways: unbounded retries, giant logs pasted into every call, and “just one more attempt” loops.
Node.js: Make Replays Idempotent
If a workflow can be replayed, retried, or handed off between hosts, treat each step as idempotent. Give it a stable key based on the workflow name, repo, target, and step number.
import crypto from "node:crypto";
function replayKey({ workflow, repo, target, step }) {
return crypto
.createHash("sha256")
.update(`${workflow}:${repo}:${target}:${step}`)
.digest("hex")
.slice(0, 24);
}
async function runStep({ client, workflow, repo, target, step, messages }) {
const key = replayKey({ workflow, repo, target, step });
const result = await client.chat.completions.create({
model: step.needsReasoning ? "gpt-5-5" : "claude-haiku-4-5",
messages,
max_tokens: step.maxTokens,
metadata: {
idempotency_key: key,
workflow,
step: step.name
}
});
return {
key,
text: result.choices[0].message.content,
usage: result.usage
};
}
If your provider or gateway supports first-class idempotency headers, use those instead of metadata. The idea is the same: a network retry should not silently create a second expensive agent branch.
Remote Handoff Changes The Failure Model
Thread handoff is useful because real work often starts on a laptop and finishes on a remote box with the right dependencies. But it adds failure cases:
- Path drift: the same repo lives at a different path on each host.
- Shell drift: zsh locally, bash remotely, PowerShell on Windows.
- Secret drift: local environment variables may not exist remotely.
- Tool drift: one host has the MCP server or plugin, the other does not.
- Cost drift: a remote replay can keep running after you stop watching.
The June Codex release notes directly address some of this with executor-native working directories, shells, permission paths, and selected-plugin MCP activation. Still, your workflow should not assume the host is identical. Add a preflight step.
set -euo pipefail
echo "repo=$(pwd)"
git status --short
node --version || true
python --version || true
which pytest || true
which npm || true
test -f package.json || test -f pyproject.toml || {
echo "No known project manifest found";
exit 1;
}
Let the agent read this output before it edits anything. It’s cheaper than letting it discover the environment by breaking it.
Where Rate-Limit Reset Credits Fit
The June release notes mention rate-limit reset credits in app-server clients, and OpenAI’s Codex changelog earlier in June described reset banking for Plus and Pro users. Don’t treat credits as architecture. Treat them as airbags.
A good replay system should still have:
- Queueing: don’t start five expensive replays because five tickets landed at once.
- Backoff: retry 429s with jitter, not instant loops.
- Fallback: switch non-critical steps to another model or endpoint.
- Human stop button: every replay run needs a visible cancel path.
For teams running agents through API gateways, a backup OpenAI-compatible endpoint is boring insurance. A setup like KissAPI can sit behind your agent runner for overflow or model fallback, especially when your main provider is rate-limited mid-run.
A Simple Replay Architecture
Here’s a clean shape for production agent workflows:
user request
-> workflow contract
-> preflight check
-> planner model
-> step queue
-> model router
-> tool executor
-> usage logger
-> human review gate
-> final summary
Notice what’s missing: blind autonomy. The agent can do a lot, but it shouldn’t own approvals for destructive operations, billing changes, authentication changes, or production deploys. Reusable workflows make good habits faster and bad habits catastrophic.
FAQ
What did OpenAI add to Codex on June 18, 2026?
OpenAI added Record & Replay for macOS, bulk automation run-history actions, and thread handoff between local and remote hosts in Codex app 26.616. The nearby Codex release notes also describe remote executor, plugin MCP, child-thread, external-agent import, and rate-limit credit improvements.
Should every coding workflow become a replayable skill?
No. Record workflows only after the inputs, outputs, and approval rules are clear. If a task still needs exploration, leave it as a normal agent thread until the pattern stabilizes.
How do I keep replayed agent workflows from getting expensive?
Use per-run token budgets, max wall-clock time, retry limits, model routing, idempotency keys, and usage logs. Also keep a fallback model or endpoint for non-critical steps so rate limits don’t force the whole workflow onto a premium model.
Build Agent Workflows Without Betting On One Route
Create a free account at kissapi.ai/register and run OpenAI-compatible fallback routes for coding agents, budget experiments, and overflow traffic.
Start Free