Prompt injection has been OWASP #1 LLM vulnerability for two consecutive editions and the attack vector in the majority of real enterprise AI incidents. This is the technical guide - attack mechanics, real kill chains, measured success rates, and the defense architecture that actually lowers exposure.
Prompt injection is not a new vulnerability. It was named by Simon Willison in September 2022 after Riley Goodside's demonstration against GPT-3. It has been OWASP's top-ranked LLM vulnerability in both the 2023 and 2025 editions of the LLM Top 10. And despite four years of awareness, it remains the attack vector behind the majority of real enterprise AI security incidents — because it exploits a fundamental architectural property of LLMs that safety training cannot eliminate.
The UK's National Cyber Security Centre (NCSC) published an official assessment in December 2025 that LLMs are "inherently confusable deputies" – systems that can be made to act against their principals, because there is no good internal distinction between trusted instructions and untrusted data. That is not a model quality problem. That is an architecture constraint of how transformers handle sequences of tokens. Instructions and data occupy the same input space. The model cannot reliably distinguish them. That is why prompt injection cannot be solved at the model layer, and why infrastructure-layer enforcement is the only defence that is robust under adversarial pressure.
An LLM treats all input as tokens: there is no structural difference between a system prompt, a user message and a document read from external storage. All of them enter the same context window as same kind of tokens. Safety training teaches the model to follow some patterns and to reject some requests, but a well-designed adversarial input can cancel this training. The research cited in the International AI Safety Report 2026 shows that a very well trained adversary can circumvent even a well defended frontier model in a small number of tries – the report describes this as an open problem with no current model-layer solution. No model, no matter how well-aligned, is immune.
Prompt injection is not one attack, but an attack class of four variants that vary in how it is delivered, whether it is detectable and what type of defenses are needed. The technical mechanics, a real-world example and what it looks like in production can be found in each tab below.
The attacker has direct access to the input field, i.e. the user message, and injects text to override the system prompt or the safety training. Typical examples are: "Ignore all previous instructions", DAN (Do Anything Now) jailbreaks, role-playing prompts. The direct injection is the most studied and the most detectable: more than 70% of the filtered environments can detect it. It represents less than 20% of the attacks described in enterprise.
Why direct injection is becoming rare is not because defenders are winning, but because attackers have been moving to less detectable ones. Direct injection still works on unprotected systems, the attacker just needs direct input access.
Input classification — semantic analysis of user messages for jailbreak patterns, instruction-override language, and role-playing constructs. This is the one variant where input-only inspection is effective.
System prompt protection — defending against system prompt extraction via output inspection. A model that reveals its system prompt in response to a crafted input creates the context for further attacks.
Detection rate in filtered environments: 70%+ (SQ Magazine, 2026). The remaining 30% require output inspection to catch — the model bypasses input filters but the malicious output is still detectable before transmission.
A malicious instruction is hidden in the content retrieved and processed by the LLM – documents, emails, web pages, calendar invites, tool responses, database entries etc. The attacker never interacts with the AI system directly. The LLM retrieves and processes the content as trusted input and executes the hidden instruction. The attack comes through a trusted retrieval path and so is not filtered by the user message analysis input filters.
Indirect injection now accounts for over 55% of observed attacks in 2026. In enterprise environments, 62% of successful exploits involved indirect pathways. It has 20–30% higher success rates than direct injection because it exploits the LLM's trust in retrieved content.
Treat all retrieved content as adversarial — the critical principle. Documents, emails, web pages, and tool responses must be inspected with the same scrutiny as user messages. Input-only guardrails that only check user messages are blind to this attack class.
Content provenance tracking — classify retrieved content by source and apply appropriate trust levels. An internal company document has different trust than a vendor-supplied PDF or a publicly retrieved web page.
Instruction detection in retrieved content — semantic classification of whether retrieved text contains imperative language, override commands, or disclosure suppression instructions.
The attacker gathers context across multiple conversation turns – each one being harmless – and performs the harmful action. The Foot-In-The-Door (FITD) method: small requests turn into big ones. The model context window gathers the information of the previous turns and builds a frame which the attacker uses. There is no filter activated by any message, the attack is spread through the session.
Cisco's State of AI Security 2026 found multi-turn attacks achieved 92% success across 8 open-weight models. FITD specifically achieved 94% across 7 models in controlled enterprise testing. Single-turn monitoring misses all of this.
Session-level intent monitoring — track the semantic trajectory of a conversation, not just individual messages. Detect sessions where the stated scope of requests is escalating across turns.
Goal drift detection — establish the stated purpose at session start and flag sessions where the user's requests have moved significantly from the initial declared purpose.
Escalation pattern classification — FITD-style attacks follow recognizable patterns: initial innocuous questions about system capabilities, progressive requests for more specific information, final request that leverages accumulated context.
Confirmation gates on escalating requests — when session monitoring detects escalating access requests, require re-confirmation of purpose before proceeding.
An indirect injection payload puts malicious instructions into the LLM's long-term persistent memory store – through a long term memory write, a RAG knowledge base entry, or a cached tool output. The injected memory will activate on a trigger keyword in a later session and run the attack on users who were not part of the original injection event. This is OWASP ASI06 (Memory Poisoning) and the most persistent form of prompt injection.
Google Gemini memory attack (February 2025): hidden prompts stored false information that activated on specific trigger words in future conversations. 73% of tested scenarios rated High to Critical severity. AgentPoison (NeurIPS 2024): 80%+ attack success rate against RAG-based agents with under 0.1% poison rate.
Provenance tracking on memory writes — every write to the agent's long-term memory store logged with source, agent ID, and content hash. Memory written by the agent itself (vs. a human operator) flagged for review.
Retrieval-time inspection — inspect retrieved memory chunks for instruction-like patterns before they enter the active context. A memory entry that contains imperative language ("always do X," "whenever Y, forward to Z") is a red flag regardless of its stated origin.
Memory integrity monitoring — detect content in knowledge stores that contains embedded instructions or unusual activation pattern language.
Immutable memory audit log — enables forensic reconstruction of the full injection chain after discovery. The Google Gemini attack persisted undetected until a researcher specifically tested for it.
Modern prompt injection attacks in enterprise environments are not single-step events. The "The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism" (arXiv 2601.09625) documented real-world attack chains in 2025–2026, finding attacks routinely achieving four or more kill chain stages across the case studies analyzed. Understanding the full chain is critical for defenders — blocking at stage 1 is the most effective intervention, but you need output inspection and logging to catch attacks that pass stage 1.
Click any incident to expand the attack chain and the security failure it exposed.
Found by Aim Security (January 2025), Reported to Microsoft, Fixed June 2025 (Patch Tuesday). The attacker sends the victim a crafted email with a hidden Markdown payload (a URL that causes Copilot to render a link automatically) in the body. The rendered payload injects instructions in Copilot's context. Copilot silently exfiltrates the victim's emails, OneDrive files and Teams chats to the attacker's server using an image src request. No click from the victim is needed.
Stage 1 (indirect via email) → Stage 5 (exfiltration). No user interaction. Full mailbox and file access.Microsoft 365 Copilot considered the incoming email as trusted content, because it was a real email in the user mailbox. The injected instructions of that email were considered with the same level of trust of the user's own instructions. No difference was made between user-written content and attacker-written content retrieved from the inbox. Output inspection before the exfiltration request was sent would have detected the non-allowed external sending.
Malicious content in a google doc, email or YouTube description could make Gemini write attacker controlled instructions into its persistent memory and have them triggered on certain trigger keywords in subsequent conversations with users who never interacted with the original malicious content. 73% of the tested scenarios were rated as High to Critical. Malicious calendar invites were shown to embed instructions surviving session boundaries.
Stage 1 (indirect via doc) → Stage 4 (memory write) → activates in future sessions against different users.The attack used two gaps: indirect injection (the retrieved content was treated as trusted) and unguarded memory write (the agent could write the attacker-provided content to its persistent store without checking the provenance). Inspection of memory-write (a classification of whether the content being committed to the long-term storage has the instruction-like patterns) would have prevented Stage 4 from being performed before the persistence was achieved.
Security company PromptArmor (August 2024, August 14 disclosed to Slack) showed that if an attacker is able to post to any public Slack channel, then they can post a message with an injection payload in the message that will be read by another user and if they ask the Slack AI to summarize the channel, the AI will ingest the malicious prompt into its RAG database and run the injected instructions, and then output a Markdown link that if clicked by the victim, will pass their private channel data and API tokens to a server controlled by the attacker.
Attacker posts in public channel → victim clicks malicious Markdown link rendered by Slack AI → private channel data exfiltrated. Crosses permission boundaries.The RAG database of Slack AI had been trained with the messages from public channels (including the channels the victim did not join) as the trusted content. PromptArmor verified the root cause of the attack: Slack AI could not tell the difference between the real messages and the injection payloads injected by the attacker into the public channel. The attack had crossed a boundary of permission: a message from public channel (attacker-accessable) caused the exfiltration of a private channel (attacker-inaccessable) data. If the channel ingested messages were classified at retrieval time whether they contain the instruction-like pattern or not, the first stage of the attack would have been blocked.
Presented at [un]prompted 2026 by Sean Park. An attacker provides a passport image with malicious instructions in the hidden text. KYC AI field extraction agent reads OCR output without any discrimination. The OCR agent does not know that the passport data is contained in the attacker's instructions in the document. MCP server provided both read and write access to the database. The malicious instructions use the write access to modify the customer record.
Forged document input → agent modifies its own database via legitimate MCP tool call. Bypasses document verification.The agent had both read and write database access but was only given the task of reading the extracted field values, which was a least privilege violation (OWASP ASI03). The OCR output was sent to the agent context without inspection for embedded instructions, which was a retrieval inspection failure. The over-privilege and unguarded OCR output made an exploit path that would never be seen by a human.
Will Vandevanter (Trail of Bits) has shown at OWASP AppSec USA 2025 that an indirect injection payload can force an agent to write a bad entry in its persistent memory store which then turns a one-shot prompt injection into a persistent implant that lasts across sessions. Stage 4: Persist is the term used by Promptware Kill Chain for this technique and it is shown in 5 out of 12 incidents in 2024 and becomes part of the workflow in 2025-2026.
One injection event → persistent compromise across all future sessions for all users of the affected agent.Memory writes by agents were unmonitored - an agent could write attacker-supplied content to long term storage without the content being inspected for instruction-like patterns. This turns a single-session attack into a persistent compromise. The necessary control is immutable memory audit logs with write inspection at write time. Inspect the semantic content of memory writes, not just the metadata.
Prompt injection success rates were estimated for years from red-team demos. In early 2026 vendors start publishing measured data from production systems. The picture is bleak and there is a large variance between the attack types.
Key takeaway from Anthropic's published benchmark: 0% success in a limited coding environment in 200 attempts. 17.8% on first attempt, 78.6% in 200 attempts without any safety in a GUI-based agent with long thinking (more capability, more connected tools), 57.1% with model-level safety. The 78.6% to 57.1% drop is the floor of what model-level safety can do. Input inspection, output blocking, tool allowlisting are what do the next drop. The browser agent result (~1% with Opus 4.5 and new infrastructure safety) is what is possible when both model and infrastructure layers are put together.
The NCSC's December 2025 assessment was explicit: prompt injection may never be fully mitigated the way SQL injection was, because the LLM is an "inherently confusable deputy." The right posture is not elimination — it is defense-in-depth that reduces the blast radius of attacks that pass model-level defenses to an acceptable level.
Direct injection (70%+ detection). Indirect injection when retrieval inspection is included. Multi-turn escalation via session monitoring. Memory poisoning via write-time inspection.
Data exfiltration at Stage 5. System prompt disclosure. PII in responses from connected data sources. Attacker-influenced outputs before they reach users or downstream systems.
Unauthorized tool invocations at Stage 3. Database write operations when read-only is sufficient. External API calls to attacker-controlled endpoints. File system access outside declared working directory.
Persistent memory poisoning that survives session boundaries. "Sleeper agent" patterns that activate on trigger keywords. Cross-user contamination from a single injection event.
Post-incident kill chain reconstruction. Detection of attacks that passed all other controls. Compliance evidence for auditors. Behavioral baseline establishment for anomaly detection.
"LLMs are inherently confusable deputies. There is no robust internal separation between trusted instructions and untrusted content — which is why prompt injection may never be fully mitigated the way SQL injection was."
— UK National Cyber Security Centre (NCSC), Technical Director for Platforms Research, December 2025Polygraf's Behavioral Control Plane sits between your users and your LLMs — inspecting all inputs (including retrieved content), blocking injection patterns before the model sees them, and catching attacker-influenced outputs before they're transmitted. Covers all four attack types documented in this guide. Sub-100ms. On-premise. No data leaves your environment.
At Polygraf, we envision a future where AI augments human capabilities without compromising safety, privacy, or ethical standards. Trust in our commitment to building this future with you.
© 2026 Polygraf AI. All rights reserved.
Your download will start now.
Please provide information below and we will send you a link to download the white paper.