OpenAI Background Mode API Guide 2026: Run Long AI Tasks Without Timeouts

Abstract technical illustration of asynchronous AI background jobs, polling, and API queues

If you have ever pushed an AI request through a serverless function, you already know the pain: the model is still working, your proxy is impatient, the client disconnects, and now you don't know whether the task failed, kept running, or burned tokens for nothing.

OpenAI's Background Mode for the Responses API is built for exactly that class of work. Instead of forcing every request to finish inside one HTTP connection, you create a response with background: true, get an ID back, and check the response later. It's a small API change, but it changes how you should design agents, report generators, code-review bots, and any workflow that can run longer than a normal web request.

This guide shows when to use background mode, when not to, and how to wire it into a real app with polling, webhooks-style state storage, retries, and OpenAI-compatible gateways like KissAPI.

What background mode actually does

The normal Responses API call is synchronous: your app sends input, the model works, and the final output comes back on the same connection. Streaming improves perceived latency, but it still depends on a live connection.

Background mode is different. You ask the API to start the job asynchronously. The first call returns a response object quickly. That object has an ID and a status. Your worker, backend, or frontend can retrieve it later until the job reaches a terminal state.

A typical lifecycle looks like this:

StatusMeaningWhat your app should do
`queued`The task has been accepted but hasn't startedStore the ID and poll later
`in_progress`The model is workingKeep polling with backoff
`completed`Output is readyRead output and mark your job done
`failed`The request failedSave the error and decide whether to retry
`cancelled`The task was cancelledStop polling

Names may vary slightly by SDK version, so don't hard-code only one happy path. Treat any unknown non-terminal status as "wait and retry" unless the API says otherwise.

When you should use it

Background mode is not for every chat message. If a user asks "write a regex" or "summarize this paragraph," synchronous or streaming is simpler.

Use background mode when one of these is true:

Good examples: repository-wide code review, PDF extraction and analysis, long research reports, test generation for a large codebase, lead enrichment, data-cleaning jobs, or agent workflows that may run tools several times.

Minimal curl example

Here's the basic shape. The important flag is background: true.

curl https://api.openai.com/v1/responses \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.5",
    "background": true,
    "input": "Review this API design and list the top 10 production risks."
  }'

The response will include an id. Save it. Don't rely on the browser tab or request memory.

{
  "id": "resp_abc123",
  "status": "queued"
}

Later, retrieve it:

curl https://api.openai.com/v1/responses/resp_abc123 \
  -H "Authorization: Bearer $OPENAI_API_KEY"

When status becomes completed, read the output from the response body.

Python: a small production-ish polling loop

A toy example polls every second forever. A real app should use backoff, a deadline, and persistent storage. Here's a compact version you can adapt.

import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

TERMINAL = {"completed", "failed", "cancelled"}

def start_background_review(diff_text: str) -> str:
    response = client.responses.create(
        model="gpt-5.5",
        background=True,
        input=[{
            "role": "user",
            "content": (
                "Review this git diff. Focus on bugs, security issues, "
                "and missing tests. Return concise bullets.\n\n" + diff_text
            )
        }]
    )
    return response.id


def wait_for_response(response_id: str, max_wait_seconds: int = 300):
    delay = 2
    deadline = time.time() + max_wait_seconds

    while time.time() < deadline:
        response = client.responses.retrieve(response_id)
        status = response.status

        if status in TERMINAL:
            if status != "completed":
                raise RuntimeError(f"AI job ended with status={status}: {response}")
            return response

        time.sleep(delay)
        delay = min(delay * 1.5, 20)

    raise TimeoutError(f"AI job {response_id} did not finish in time")


job_id = start_background_review(open("diff.patch").read())
print("Started", job_id)

result = wait_for_response(job_id)
print(result.output_text)

For local scripts, this is enough. For web apps, split start_background_review and wait_for_response into separate processes. Start from your API route, then let a worker poll.

Node.js: queue-friendly version

In a Node backend, the usual pattern is: create the background response, store the response ID in your database, then let a worker process update the record.

import OpenAI from "openai";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function createResearchJob(topic) {
  const response = await client.responses.create({
    model: "gpt-5.5",
    background: true,
    input: [
      {
        role: "user",
        content: `Write a technical research brief on: ${topic}. Include risks and citations when tools provide them.`
      }
    ]
  });

  await db.aiJob.create({
    data: {
      providerResponseId: response.id,
      status: response.status,
      topic
    }
  });

  return { jobId: response.id, status: response.status };
}

export async function pollResearchJob(job) {
  const response = await client.responses.retrieve(job.providerResponseId);

  await db.aiJob.update({
    where: { id: job.id },
    data: {
      status: response.status,
      output: response.status === "completed" ? response.output_text : job.output,
      lastCheckedAt: new Date()
    }
  });
}

Run pollResearchJob from BullMQ, Cloud Tasks, Sidekiq, a cron worker, or whatever queue you already trust. Don't invent a queue just for AI. Use the boring thing that wakes up reliably.

The timeout architecture I recommend

For most SaaS apps, I like this design:

  1. API route creates the job. It validates the user, starts the background response, stores response_id, and returns your own job_id.
  2. Worker polls. It checks pending jobs every few seconds at first, then backs off.
  3. Frontend polls your app, not OpenAI. The browser calls /api/jobs/:id, so your API key never leaves the server.
  4. Results are cached in your database. Once completed, the output is read from your DB, not repeatedly fetched from the provider.
  5. A watchdog expires stuck jobs. If a job sits too long, mark it as timed out and show a useful error.

Idempotency: don't double-charge yourself

The biggest mistake with background jobs is treating "I didn't get a response" as "nothing happened." Network failures lie. Your create request may have succeeded even if your server timed out before receiving the ID.

Use an idempotency key or your own job table.

import hashlib

job_key = hashlib.sha256((user_id + repo_sha + diff_sha).encode()).hexdigest()

existing = db.find_job_by_key(job_key)
if existing:
    return existing.id

# create the background response only once

If your provider supports idempotency headers, send one. If it doesn't, your local database key still prevents repeat work for common retry paths.

Using background mode with an OpenAI-compatible gateway

If your stack already uses OpenAI SDKs with a custom base URL, the setup is usually just a base URL change. With KissAPI, for example, you keep the OpenAI-style client and route compatible models through one endpoint.

from openai import OpenAI

client = OpenAI(
    api_key="KISSAPI_KEY",
    base_url="https://api.kissapi.ai/v1"
)

response = client.responses.create(
    model="gpt-5.5",
    background=True,
    input="Create a migration checklist for this API change."
)

The practical reason to use a gateway isn't magic pricing. It's operational flexibility. You can run GPT-style workloads, Claude-style workloads, and fallback models behind one API layer while keeping your app code boring.

Polling interval and cost control

Polling itself is cheap compared with model output, but waste is still waste. Use a schedule like this:

Job agePoll every
0-30 seconds2-3 seconds
30-120 seconds5-10 seconds
2-10 minutes15-30 seconds
10+ minutes60 seconds or mark as delayed

Also cap the model's output. A background task can run out of sight, so give it boundaries:

{
  "model": "gpt-5.5",
  "background": true,
  "max_output_tokens": 2000,
  "input": "Summarize these logs and return only root cause, evidence, and fix."
}

For agents, add stop conditions in the prompt. "Keep working until done" is not a product spec. Tell the model what done means.

Should you stream or use background mode?

Use streaming when the user is waiting and partial output is useful. Use background mode when reliability matters more than watching tokens appear.

A code assistant chat should stream. A full repo audit should run in the background. A customer support answer should stream. A nightly analytics report should run in the background. If you remember that split, you won't overcomplicate your app.

Build with OpenAI-compatible APIs

Try KissAPI with $1 free credit. Use one API key for GPT, Claude, and other developer-friendly models without rewriting your SDK code.

Start Free →

Related Articles

If you're working on production AI infrastructure, these are worth reading next:

Final Take

Background mode is not flashy. That is why I like it. It turns long AI calls into normal backend jobs with IDs, states, retries, and persistence. Move one timeout-prone workflow to background mode, add idempotency, poll from a worker, and your app gets calmer fast.