Risk type: Prompt Injection

Indirect prompt injection

Indirect prompt injection happens when an AI system ingests untrusted content (like a web page or document) that contains hidden instructions that the model follows.

Quick answer

The fastest way to reduce AI risk is to control what can be typed, pasted, and uploaded in the browser. Combine governance (approved tools and data boundaries) with browser-layer enforcement. When users browse unknown destinations as part of AI workflows, isolation reduces endpoint exposure by running web content in an isolated container and streaming only rendered output; sessions are deleted after use.

When you need this

Employees paste internal data into AI prompts to move faster.
You need policy enforcement in the browser, not just training and documents.
You want to allow AI productivity while preventing sensitive data loss.

Last updated

2026-01-29

Affected tools

RAG assistants
Browser-based AI agents
AI copilots with tool access
Automations that read web content

How it usually happens in the browser

An AI workflow fetches web pages, emails, tickets, or docs as context for an answer.
The fetched content includes hidden or subtle instructions aimed at the model.
The model treats those instructions as higher priority than the user’s request or system rules.
If the agent has tools, it may browse, click, summarize, or exfiltrate data based on those instructions.
The user trusts the output because it appears to be sourced from legitimate content.

What traditional defenses miss

Teams assume “reading the web” is safe for AI tools, but the web is adversarial.
Hidden instructions can be embedded in HTML/CSS, comments, or low-visibility text.
Tool-enabled agents make indirect injection more dangerous because they can take actions, not just answer questions.
Standard web security controls don’t account for instruction-following vulnerabilities in LLM workflows.

Mitigation checklist

Sanitize and label retrieved content; strip hidden text and isolate content from instructions.
Constrain tool access and use allowlists for retrieval sources where possible.
Add “instruction filtering” layers that treat retrieved text as data, not directives.
Require human confirmation for actions that change state (send email, approve access, modify settings).
Continuously test indirect injection vectors in your retrieval and agent pipelines.

How isolation helps

Isolation reduces endpoint exposure when users browse unknown pages that they plan to feed into AI tools.
It provides a safer boundary for web exploration and investigation by running content in isolated containers and deleting sessions afterward.
Isolation complements AI-side mitigations by reducing the overall risk of interacting with untrusted web content.

See Legba for Chrome

FAQs

Why is it called “indirect”?

Because the user doesn’t paste the malicious instruction directly. The AI picks it up indirectly from content it retrieves or reads.

Does this require model jailbreaks?

Not necessarily. It often works through normal instruction-following behavior when the model can’t reliably separate trusted instructions from untrusted content.

Is this only a risk for agents?

Agents amplify the risk, but even simple RAG Q&A systems can leak sensitive data or produce harmful outputs if they ingest adversarial content.

What’s the simplest control?

Treat retrieved content as untrusted data: sanitize it, constrain tools, and require human approval for high-risk actions.