Prompt injection defenses for tool-using agents

What matters first

Prompt injection defense starts with architecture, not wording.

The minimum viable defense is:

treat tool outputs and retrieved content as untrusted;
restrict which tools the agent may call from untrusted contexts;
require approval before side-effecting actions;
and use explicit allowlists for browsing, execution, or system actions.

If the system relies mainly on “the model should ignore malicious instructions,” the defense is weak.

Tool outputs are untrusted

The most important rule is simple: tool outputs are data, not instructions. A browsed web page, retrieved chunk, support ticket, PDF, repository file, or third-party API response can contain text that looks like an instruction. The agent runtime has to preserve the authority boundary anyway.

That means the system should not let content returned by a tool decide:

which higher-privilege tool becomes available;
whether an approval gate is skipped;
whether secrets, customer data, or internal files are disclosed;
whether the agent can write, delete, purchase, publish, or message externally.

This is the practical meaning behind the common warning that tool outputs are untrusted. The warning is only useful when it becomes runtime design: narrow tools, explicit scopes, side-effect gates, and reviewable traces.

Why this matters now

Tool-using agents now read:

web pages,
documents,
tickets,
code repositories,
and tool responses that may contain attacker-controlled text.

That means the model is no longer only interpreting user input. It is interpreting untrusted operational content that can try to redirect tool use or policy behavior.

Official signals checked April 15, 2026

Official source	Current signal	Why it matters
Computer use guide	OpenAI explicitly calls out prompt injection risk and recommends allowlists for expected websites	Browser-facing agents need control-plane restrictions, not only prompt instructions
MCP authorization specification	Authorization structure remains a separate layer around tool access	Tool connectivity does not remove the need for strict permission and approval design
OpenAI Agents SDK	Guardrails, tools, and handoffs are framework-level concepts	Injection defense has to be expressed at runtime and orchestration layers too

Where injection actually enters

Prompt injection usually enters through:

web search results,
browsed pages,
uploaded files,
retrieved knowledge chunks,
tool output that includes attacker-controlled text.

The risk is not only bad prose. It is that the agent changes plan, tool choice, or action scope because it treated untrusted content as instructions.

The strongest defenses

1. Trust-boundary separation

System instructions, user instructions, and tool content should not be treated as the same authority.

2. Tool restrictions

Untrusted context should not unlock broad write-capable tools.

3. Approval gates

Any meaningful external side effect should require review or explicit confirmation.

4. Allowlists

Especially for browser and computer-use workflows, the safest design is to limit reachable domains or action classes.

5. Narrow action design

Tools should be specific enough that even a manipulated plan has limited blast radius.

What does not count as enough

These are weak by themselves:

longer prompts that say “ignore malicious instructions”;
generic safety statements with no runtime enforcement;
broad tools with no approval layer;
or post hoc logging with no prevention.

They may help, but they do not meaningfully change the control boundary.

Reader value check

This page should help a reader decide which authority, data access, tool scope, and runtime boundary the agent system should receive. For Prompt injection defenses for tool-using agents, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring tool lists, auth scopes, sandbox limits, customer data classes, audit trails, and examples of unsafe tool output. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

Check	What the reader should be able to answer
Authority	Does the page distinguish advice, draft, write, delete, payment, and permission-changing actions?
Identity	Is it clear whether the agent acts as a user, service account, or constrained system role?
Runtime boundary	Are tools, network access, files, and secrets scoped to the smallest practical surface?
Auditability	Can the team explain after the fact what the agent saw, decided, and changed?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For agent-system pages, the value is a safer architecture decision. The page should help readers reduce hidden authority before they add more tools or autonomy.