Skip to content

Prompt injection defenses for tool-using agents

Prompt injection defense starts with architecture, not wording.

The minimum viable defense is:

  • treat tool outputs and retrieved content as untrusted;
  • restrict which tools the agent may call from untrusted contexts;
  • require approval before side-effecting actions;
  • and use explicit allowlists for browsing, execution, or system actions.

If the system relies mainly on “the model should ignore malicious instructions,” the defense is weak.

Tool-using agents now read:

  • web pages,
  • documents,
  • tickets,
  • code repositories,
  • and tool responses that may contain attacker-controlled text.

That means the model is no longer only interpreting user input. It is interpreting untrusted operational content that can try to redirect tool use or policy behavior.

Official sourceCurrent signalWhy it matters
Computer use guideOpenAI explicitly calls out prompt injection risk and recommends allowlists for expected websitesBrowser-facing agents need control-plane restrictions, not only prompt instructions
MCP authorization specificationAuthorization structure remains a separate layer around tool accessTool connectivity does not remove the need for strict permission and approval design
OpenAI Agents SDKGuardrails, tools, and handoffs are framework-level conceptsInjection defense has to be expressed at runtime and orchestration layers too

Prompt injection usually enters through:

  • web search results,
  • browsed pages,
  • uploaded files,
  • retrieved knowledge chunks,
  • tool output that includes attacker-controlled text.

The risk is not only bad prose. It is that the agent changes plan, tool choice, or action scope because it treated untrusted content as instructions.

System instructions, user instructions, and tool content should not be treated as the same authority.

Untrusted context should not unlock broad write-capable tools.

Any meaningful external side effect should require review or explicit confirmation.

Especially for browser and computer-use workflows, the safest design is to limit reachable domains or action classes.

Tools should be specific enough that even a manipulated plan has limited blast radius.

These are weak by themselves:

  • longer prompts that say “ignore malicious instructions”;
  • generic safety statements with no runtime enforcement;
  • broad tools with no approval layer;
  • or post hoc logging with no prevention.

They may help, but they do not meaningfully change the control boundary.