Defending Agents Against Prompt Injection

The moment an LLM can do something — call a tool, send an email, move money, change a record — the text it reads stops being content and starts being a potential instruction. That is the whole problem of prompt injection in one sentence. An attacker doesn't need to breach any infrastructure; they just need their words to reach the model in a context where the model can act.

Why it's hard

Classic injection attacks (SQL, command, XSS) have a clean fix: separate code from data. Parameterise the query, and the attacker's input can never become an instruction. Prompt injection resists this because, for a language model, there is no reliable boundary between instructions and data — they are the same medium. A system prompt and a malicious web page the agent fetched are both just tokens. The model has no built-in notion of privilege.

So defence is not about finding the one boundary to enforce. It's about reducing what a successful injection can achieve, and making sure that when one lands, you can see it.

The attack classes that actually land

Direct injection

The user is the attacker. They type instructions that override the system prompt — "ignore previous instructions and…". Easy to demo, and still effective against naïvely-built agents that concatenate user input straight into the prompt.

Indirect injection

The dangerous one. The agent retrieves content — a web page, a document, an email, a calendar invite — and that content carries instructions the attacker planted. The user never sees it; the agent reads it and acts. Any agent with retrieval or tool-use is exposed here, and it's the class most teams underestimate.

Tool-output poisoning

An agent calls a tool, and the tool's response (an API payload, a search result) contains injected instructions that steer the next step. Multi-agent systems compound this: one agent's output is another's input.

The layered defence

No single control is sufficient. Each layer below assumes the previous one can fail.

1. Isolate untrusted input

Never blend user or retrieved content into the same trust context as your instructions. Keep system instructions structurally separate from data, mark retrieved content as untrusted, and don't let the model treat fetched text as authority. This doesn't stop injection — but it removes the easiest wins.

2. Gate capabilities, not just prompts

The single highest-leverage control: an agent should only be able to do what its current task legitimately requires. If a summarisation agent has no need to send email, it shouldn't hold that capability — so a successful injection telling it to email private data has nothing to call. Scope tools per task. Make destructive actions require explicit, separately-authorised steps. A policy document can't enforce this; a runtime at the call site can.

3. Check the output before it acts

Between the model's decision and the action, insert a check: is this action in-policy for this agent, this user, this context? Does it match a known-bad pattern (exfiltration, privilege escalation, unexpected recipient)? This is where a fast, deterministic policy check earns its place — it's the last gate before consequence.

4. Keep a human in the loop where it matters

For high-consequence actions, don't fully automate. Surface the action, the reasoning, and the inputs to a person who can approve or reject. The goal isn't to slow everything down — it's to make sure the irreversible things have a checkpoint.

5. Log every decision, tamper-evidently

You will not catch every injection. So make sure that when one gets through, you can prove what happened: what the agent saw, what it decided, what it called, and why. An append-only, tamper-evident audit trail turns an incident from "we think something went wrong" into a reviewable record. For regulated environments, this isn't optional — it's the difference between an explainable event and an unexplainable one.

The shape of a real defence

Put together, the pattern is: untrusted input stays isolated, the agent can only reach for capabilities its task needs, every action is checked at the call site against policy, the consequential ones pause for a human, and all of it is logged so it can be audited later. No layer is perfect. The point is that an attacker has to defeat all of them, and you can see the attempt.

This is the thinking behind the runtime SecuRight is building — policy enforcement, capability gating, and a tamper-evident audit on every agent call. It's in development; this writing is where we work the ideas out in the open.