Prompt injection defenses for tool-using agents
Quick answer
Section titled “Quick answer”Prompt injection defense starts with architecture, not wording.
The minimum viable defense is:
- treat tool outputs and retrieved content as untrusted;
- restrict which tools the agent may call from untrusted contexts;
- require approval before side-effecting actions;
- and use explicit allowlists for browsing, execution, or system actions.
If the system relies mainly on “the model should ignore malicious instructions,” the defense is weak.
Why this matters now
Section titled “Why this matters now”Tool-using agents now read:
- web pages,
- documents,
- tickets,
- code repositories,
- and tool responses that may contain attacker-controlled text.
That means the model is no longer only interpreting user input. It is interpreting untrusted operational content that can try to redirect tool use or policy behavior.
Official signals checked April 15, 2026
Section titled “Official signals checked April 15, 2026”| Official source | Current signal | Why it matters |
|---|---|---|
| Computer use guide | OpenAI explicitly calls out prompt injection risk and recommends allowlists for expected websites | Browser-facing agents need control-plane restrictions, not only prompt instructions |
| MCP authorization specification | Authorization structure remains a separate layer around tool access | Tool connectivity does not remove the need for strict permission and approval design |
| OpenAI Agents SDK | Guardrails, tools, and handoffs are framework-level concepts | Injection defense has to be expressed at runtime and orchestration layers too |
Where injection actually enters
Section titled “Where injection actually enters”Prompt injection usually enters through:
- web search results,
- browsed pages,
- uploaded files,
- retrieved knowledge chunks,
- tool output that includes attacker-controlled text.
The risk is not only bad prose. It is that the agent changes plan, tool choice, or action scope because it treated untrusted content as instructions.
The strongest defenses
Section titled “The strongest defenses”1. Trust-boundary separation
Section titled “1. Trust-boundary separation”System instructions, user instructions, and tool content should not be treated as the same authority.
2. Tool restrictions
Section titled “2. Tool restrictions”Untrusted context should not unlock broad write-capable tools.
3. Approval gates
Section titled “3. Approval gates”Any meaningful external side effect should require review or explicit confirmation.
4. Allowlists
Section titled “4. Allowlists”Especially for browser and computer-use workflows, the safest design is to limit reachable domains or action classes.
5. Narrow action design
Section titled “5. Narrow action design”Tools should be specific enough that even a manipulated plan has limited blast radius.
What does not count as enough
Section titled “What does not count as enough”These are weak by themselves:
- longer prompts that say “ignore malicious instructions”;
- generic safety statements with no runtime enforcement;
- broad tools with no approval layer;
- or post hoc logging with no prevention.
They may help, but they do not meaningfully change the control boundary.