OpenAI Background Mode: Polling, Webhooks, and Job Status

Most teams understand why a long-running task should not sit inside a live request. Fewer teams design what happens after the task is handed off. That is where background mode usually breaks down in practice. The hard part is not creating the job. The hard part is deciding how the product will track it, surface it, retry it, and stop it from doing something consequential without review.

What matters first

Treat OpenAI background mode as a tracked job system, not as fire-and-forget async execution. A healthy design needs:

a visible job identifier,
a stable status model,
a clear result retrieval path,
retry logic tied to failure class,
and an approval boundary if completion is not the same thing as permission to act.

Without those, background mode just hides failure behind delay.

What the API docs give you

OpenAI’s background-mode documentation gives developers the runtime primitives: create a background response, poll the response object, cancel an in-flight response, and resume streaming only when the response was created with streaming enabled. That is enough to make a long model task survive a dropped client connection. It is not enough to make a production job trustworthy.

API primitive	Product layer you still need
Background response ID	Internal job ID tied to user, workspace, billing, and support lookup
Queued or in-progress status	User-facing status copy, timeout expectations, and operator visibility
Retrieve response	Durable result storage, evidence bundle, and partial-failure handling
Cancel response	Product cancellation state, idempotent UI behavior, and downstream cleanup
Stream cursor	Reconnect behavior, event replay rules, and audit logs
Terminal response state	Review state, delivery state, retry rule, and business completion status

The practical build question is not “where is the documentation?” It is “which product states wrap the provider response so users and operators know what happened?”

Polling versus webhooks

Polling is usually enough when:

job volume is low to moderate,
the product already has a task view or operator dashboard,
the cost of a slightly stale status is low,
and you want a simpler first implementation.

Polling is not primitive. It is often the cleanest answer for early production systems where operators already expect to refresh or revisit the task state.

Webhooks matter more when:

job volume is high,
task completion must trigger follow-up work quickly,
you need immediate downstream orchestration,
or status freshness is part of the product promise.

Webhooks reduce delay, but they also increase event-handling complexity. Use them because the workflow requires that complexity, not because they sound more advanced.

The job states that actually matter

The exact labels may evolve, but the product-level states should be something close to:

queued,
running,
completed,
failed,
awaiting review,
expired or abandoned.

The important design point is that completed should not always mean safe to deliver or execute. A long-running task can complete technically and still require human review before publication, send, or state mutation.

Why “completed” is not the end of the workflow

This is the mistake teams make most often. They equate background completion with workflow completion. In real systems, the background job may only produce:

a draft,
a report,
a recommendation,
a bundle of extracted fields,
or a proposed action.

The real workflow may still need:

citation review,
approval,
customer-facing quality checks,
or policy validation.

That is why the status model should distinguish job completion from business completion.

Result handling should be durable

Once the job finishes, the result should not live only inside a transient event path. The product should preserve:

the final payload,
relevant tool traces or evidence,
the completion timestamp,
failure or retry history,
and the next required action.

That lets operators reopen the work later without having to reconstruct what happened.

Retry logic by failure class

Do not use one retry rule for every failure. At minimum, separate:

transient infrastructure or network failures,
model or tool timeouts,
malformed downstream inputs,
policy failures,
and review-required outcomes.

Only the first two normally deserve automatic retry. The rest usually need inspection or a different workflow path.

User-visible progress matters

Background systems feel reliable when users can tell:

the job exists,
it is still progressing,
it finished,
or it stopped and needs intervention.

That sounds obvious, but many products still hide long-running work behind “processing” states that mean nothing operationally.

A practical design rule

Use polling first unless the workflow is genuinely event-sensitive. Add webhooks when:

the task needs immediate follow-on orchestration,
scale makes polling wasteful,
or completion freshness changes user or operator outcomes materially.

Either way, keep the product state model consistent. The transport mechanism is not the operating model.

Compare next

Background mode, ZDR, and retention Use this page when enterprise data-control requirements decide whether background mode is allowed for the workload.

OpenAI background mode for long-running AI tasks Use this page for the higher-level runtime boundary: which work belongs in a long-running async lane at all.

Background report generator case See polling, cancellation, retry, and review states applied to a concrete long-running research report workflow.

OpenAI Batch API vs background mode Use this page when the real question is backlog throughput versus one tracked long-running product task.

Human escalation thresholds for deep research Use this page when background jobs may finish technically but still need a review gate before delivery.

Tool timeouts, retries, and idempotency Use this page when background-mode work expands into tool orchestration and retry behavior becomes a control problem.

Source note

This page was refreshed against OpenAI’s background mode guide, including background response creation, polling, cancellation, streaming, and stored response-state limits. The page focuses on the product job lifecycle that should surround those API primitives.