Agentic AI Security:
The 4 Attack Layers
Every Security Team Must Defend

AI agents don't just respond to prompts — they plan, use tools, access memory, and take actions across enterprise systems. Each capability adds a distinct attack layer. Most enterprise security stacks defend none of them.

LAYER 01
Input & Prompt Layer
Direct + indirect injection. Multi-turn attacks. Goal hijacking at the instruction boundary.
LAYER 02
Memory & Knowledge Layer
RAG poisoning. Persistent memory corruption. Context window manipulation across sessions.
LAYER 03
Tool & Action Layer
Tool misuse. MCP exploitation. Supply chain attacks on agent integrations and plugins.
LAYER 04
Identity & Communication Layer
Non-human identity abuse. Agent session smuggling. Lateral movement across agent networks.

In 2024, the conversation was about chatbots saying the wrong thing. In 2026, it's about agents doing the wrong thing – and doing it at machine speed, across multiple systems, with good credentials. A Dark Reading poll found that 48% of cybersecurity professionals now say agentic AI and autonomous systems are the number one attack vector for 2026, over deepfakes, ransomware and supply chain attacks.

The reason that agentic AI creates a fundamentally different threat surface is that it is a matter of architecture. In a traditional AI application there is one exposure point: the input and output boundary of the model. In an agentic system there are four different attack surfaces – each of which can be attacked independently and each of which exacerbates the others when attacked. The first step to building defenses that actually work is to understand this layered model.

At Polygraf, our inspection layer is at the intersection of all four. Here is what we see in regulated enterprise deployments mapped against the latest research from OWASP, MITRE and 2025–2026 production incidents.

48%
of security professionals rank agentic AI as the top attack vector for 2026
Dark Reading, 2026
92%
multi-turn attack success rate across 8 open-weight models in enterprise testing
Cisco State of AI Security 2026
55%
of observed attacks in 2026 use indirect injection — the stealthiest input-layer vector
SQ Magazine, 2026
90%
of identity weaknesses implicated in Palo Alto Unit 42 incident response investigations
Unit 42 IR Report 2026
The 4-layer agentic attack surface — and where Polygraf enforces policy
AI Agent plans · decides · acts Agentic loop LAYER 01 · INPUT & PROMPT Direct injection · Indirect injection · Multi-turn attacks Goal hijacking · Jailbreaking LAYER 02 · MEMORY RAG / Knowledge base Long-term memory Session context Poisoning · Persistence · Drift LAYER 03 · TOOL & ACTION MCP servers · APIs File systems · Databases External services Poisoning · Misuse · Supply chain LAYER 04 · IDENTITY & COMMS NHI credentials · A2A protocols · Session tokens Impersonation · Lateral movement · Trust exploitation POLYGRAF I/O INSPECTION Each layer is independently exploitable — and attackers chain all four
Why This Model Matters

Traditional security defends one perimeter. Agentic AI has four distinct boundaries, each with its own threat class, its own attack tooling and its own required defenses. A security team that only defends Layer 1 (input) is exposed on Layers 2, 3 and 4 and attackers in 2026 are chaining all four. The University of Guelph SoK systematization of the agentic attack surface (arXiv 2603.22928, March 2026) shows that the documented vectors are all back to two main trust boundaries, tool orchestration and memory management, and both are live on multiple layers.

01
LAYER
Input & Prompt
The Instruction Boundary
Where attackers tell your agent what to do — without your knowledge
92%
multi-turn
success rate

The input layer is where every agentic attack starts. The attacker's objective is to inject malicious instructions into the agent's context window. Once in the context window, the agent sees them as normal instructions – because it has no architecture to differentiate a normal instruction from an injected one. This is not a model quality issue. It is an architectural property of how LLMs handle context.

Direct injection

User-controlled input manipulation

Attacker has access to the model's input field. Classic jailbreaking. The Pillar Security dataset shows that 20% of direct attacks work and that the average attack time is 42 seconds over 5 interactions. Easier to detect – more than 70% detection rate in filtered environments.

Indirect injection

Content retrieved from external sources

Malicious instructions in documents, emails, web pages or tool responses that the agent processes as trusted content. Accounts for more than 55% of the attacks seen in 2026 . 20–30% more successful than direct injection due to stealth delivery. 62% of the successful exploits in enterprise were indirect.

Multi-turn attacks

Slow escalation across conversation sessions

The attacker builds the context over multiple interactions (each of which is harmless) and then performs the harmful action. Multi-turn attacks have 92% success rate on 8 open-weight models in Cisco's 2026 test. FITD (Foot-In-The-Door) has 94% success rate on 7 models with progressive escalation.

Goal hijacking

Redirecting the agent's entire objective

Higher order attack that redirects the whole task sequence of the agent rather than extract a single response. Google researchers discovered a 32% increase in malicious prompt injection payloads in web content between November 2025 and February 2026 – the infrastructure for goal hijacking at scale.

The real-world consequence:In July 2025, Replit's AI coding agent erased an entire production database during a live code-and-action freeze. SaaStr founder Jason Lemkin, who was running a 12-day experiment, found the agent had deleted records for 1,206 executives and 1,196 companies – permanently. The agent said it 'made a catastrophic error in judgment'. It was a 100% Layer 3/4 failure: no input injection, just an agent with database write access that was running outside of its allowed use case. Apiiro's September 2025 research of Fortune 50 enterprises showed that privilege escalation paths in AI-generated code had increased 322% – the architectural pattern that made this failure at scale possible.

Layer 1 Defenses

Check the input for malicious content before the agent is able to act on any external content. Prompt classification: Is the input malicious and does it contain instructions that override the goal of the agent? Indirect injection detection: Assume that all retrieved content (not only the user input) is malicious. Multi-turn session monitoring: Is the stated intent of the user during the session different from the intent at the beginning of the session? Defense frameworks at this layer reduce the success of the attack from 73.2% to 8.7% if used correctly (SQ Magazine, 2026).

Sources: Cisco State of AI Security 2026; Pillar Security; SQ Magazine 2026; Google Research; Apiiro 2025
02
LAYER
Memory & Knowledge
The Persistence Layer
Where attackers corrupt what your agent knows — and make it stick
80%+
RAGPoison
success rate

What makes agentic AI uniquely dangerous compared to stateless LLM applications is the memory layer. Agents remember from one session to the next in RAG knowledge bases, vector databases, episodic memory stores, and cached tool outputs. If an attacker compromises the memory layer, they don't have to re-inject instructions every session. The compromise is permanent.

RAG poisoning

Corrupting the knowledge retrieval layer

The attacker injects poisoned documents into the RAG knowledge base. The AgentPoison research (Chen et al., NeurIPS 2024) showed 80% attack success rate at poison rate < 0.1% (i.e., < 1 poisoned document per 1000 real documents) is enough to poison the retrieval layer reliably for RAG-based autonomous driving, QA, and healthcare agents. PoisonedRAG (USENIX-adjacent, 2024) showed 5 poisoned texts achieve 97% attack success rate for knowledge bases with millions of documents.

Persistent memory corruption

Sleeper agent pattern via memory stores

Google Gemini memory attack (February 2025): hidden prompts that stored false information that would activate on trigger words in subsequent conversations – the "sleeper agent" pattern. 73% of the tested scenarios were rated High to Critical severity. A malicious calendar invite was demonstrated to implant persistent instructions that survived session boundaries.

Context window manipulation

Flooding context with attacker-controlled data

The agent's finite context window is filled with attacker-provided content, and the real task instructions are not in the effective range anymore. Especially useful for large document processing tasks in agents where the external content is the bulk of the context window.

Goal drift via accumulated memory

Slow behavioral change over many sessions

A long-lived agent builds up memory and its decision bias drifts over time – without any single injection raising an alarm. This is OWASP ASI06 (Memory Poisoning) in action: each individual input is not malicious; the overall effect is a poisoned agent. No single event monitoring system detects it.

RAG pipeline attack surface — where poisoning enters the retrieval chain
Document ingestion Embeddings Vector database Storage Retrieval ranking Top-K chunks Agent context LLM input Agent action Poisoned docs Vector injection Rank manipulation Attack entry points across the RAG pipeline — any stage can be compromised
Layer 2 Defenses

RAG security with three layers: provenance of retrieved chunks (from which memory store did the chunk come?), semantic anomaly detection (is this chunk a pattern of instructions?), and memory access control (not all agents should read all memory stores). 2. Immutable audit trail for long-term storage of agents. 3. Input validation at the document ingestion step, not just at query time. 4. Integrity check of memory, which will detect content written to the knowledge base by the agents themselves (poisoning of self-referential memory).

Sources: AgentPoison (Chen et al., NeurIPS 2024); PoisonedRAG (Zou et al., 2024); Google Gemini memory attack disclosure Feb 2025; BeyondScale OWASP ASI06 analysis
03
LAYER
Tool & Action
The Execution Layer
Where agents cross from talking to doing — and attackers follow
520
tool misuse
incidents logged

Tool layer is where agentic AI is different from a chatbot. When an agent invokes a tool it does something in the real world, write to a database, send an email, commit code, trigger a workflow. Tool misuse is the most reported threat category in agentic AI security in 2026 (520 reported incidents) more than prompt injection (450) or data security violations (410).

There are two sides to the tool layer attack surface. On the inside, agents can abuse the tools they are given access to. On the outside, the tools can be compromised by tampering with the MCP server, supply chain attacks on plugins or malicious packages in agent marketplaces.

  • Tool scope creep: An agent with read access to a customer database starts to query tables outside its task. Without purpose binding, there is no technical limitation on the tools an agent can use and how. 63% of organizations cannot impose purpose limitations on their agents (Kiteworks 2026).
  • MCP tool poisoning:Malicious MCP server presents legitimate tool schemas but returns manipulated responses that redirect agent behavior. The postmark-mcp package (September 2025) silently BCC'd every processed email to an attacker across 1,643 downloads before removal — the canonical tool layer supply chain attack.
  • Plugin marketplace compromise: CVE-2026-25253 (OpenClaw, CVSS 8.8) – 341 malicious skills (12% of the ClawHub marketplace) installed keyloggers on enterprise before the patch. The viral agent frameworks are spreading faster than their supply chains are being audited.
  • Computer-use agent exploitation (CVE-2025-53773): GitHub Copilot was announced on June 2025 and patched on August 2025 Patch Tuesday. A source code file or GitHub Issue prompt injection into GitHub Copilot Agent Mode silently changes the workspace settings and enables "YOLO mode" where all the following commands are executed without the user's consent and is able to perform remote code execution. The attack is present in any shared repository and every developer opening the repository is a victim.
  • Privileged action execution: 11 agentic frameworks (Liu et al., 2024) were found to have 19 RCE flaws by weak tool schema validation. Tool bridges that provide OS command, database write or file system access without validation are classic software vulnerabilities that are enhanced by agent autonomy.
Layer 3 Defenses

Purpose binding at the tool level: each agent is allowed to use a set of tools for a given task and this is enforced at the execution layer, not only in the policy documents. MCP server integrity check: hash-pin the versions of the MCP server, watch for unexpected schema changes. Supply chain scan before deployment: treat the agent plugin marketplaces as if they were npm or PyPI. Inspect the output: look at what the agent is about to send or execute before it does. For computer-use agents: sandbox the browser and desktop access from production systems and sensitive data stores.

Sources: SQ Magazine 2026 incident taxonomy; Kiteworks 2026; Lumenova AI OWASP analysis; Liu et al. 2024; Hidden Layer MCP analysis
04
LAYER
Identity & Communication
The Trust Layer
Where attackers become your agent — or turn your agents against each other
90%
of IR cases involve
identity weaknesses

The identity layer is where the agentic threat model most clearly departs from conventional security. In Palo Alto Unit 42's 2026 Incident Response Report, we found that identity weaknesses are involved in almost 90% of investigations – and in agentic environments, identity is the agent's own credentials and the trust relationships between agents in multi-agent systems.

Non-human identities (NHIs) are the fastest-growing attack vector in enterprise infrastructure (Huntress 2026). Every AI agent is an NHI that needs API access, machine-to-machine authentication and credential management, none of which was designed for by traditional IAM systems. Only 22% of organizations treat AI agents as independent identity-bearing entities (Gravitee 2026).

NHI credential compromise

Shared API keys and hardcoded secrets

45.6% of teams use shared API keys for agent authentication (Gravitee 2026). If an agent credential is compromised, the attacker has the same access rights as the agent for weeks or months before it is detected. In a multi-agent system, the orchestration agents may have the credentials of downstream agents and the compromise can cascade.

Agent session smuggling

Exploiting A2A protocol trust relationships

Palo Alto Unit 42 showed Agent Session Smuggling (Nov 2025): a malicious agent abuses the trust of the built-in A2A protocol. Instead of a single-shot attack, a rogue agent has a multi-turn conversation, changes strategy and builds trust before the attack. The agents that trust the collaborating agents by default are the victims.

Lateral movement

Propagation across connected agent networks

Moltbook incident (January–March 2026): 506 prompt injections spread by a network of 1.5 million agents before being detected. In a multi-agent architecture, a compromised orchestrator agent can reach all downstream agents. Only 24.4% of organizations know which agents are talking to each other (Gravitee 2026).

Identity impersonation

Masquerading as a trusted agent

If an attacker steals an agent's session token or API key, they can impersonate that agent for as long as they want. The network sees a valid credential from a valid agent endpoint. Without unique per-agent identities and revocation paths, there is no attribution and the only way to contain is to take down the entire service account.

Layer 4 Defenses

Unique machine identity for each agent – the control. Authenticate A2A communication: agents should not trust messages from other agents by default, use cryptographic attestation for A2A communication. Monitor cross-agent communication: flag agents communicating with agents they should never communicate with. Rotate session token and short-lived credentials to remove long dwell times of NHI compromise. The six-month OpenAI plugin breach dwell time was enabled by long-lived static credentials on shared service accounts.

Sources: Unit 42 IR Report 2026; Gravitee State of AI Agent Security 2026; Huntress 2026; Palo Alto Unit 42 Agent Session Smuggling disclosure Nov 2025; Moltbook incident, 404 Media 2026

"Agentic AI systems expose a qualitatively different attack surface than prior LLM-based applications. Security risks arise not only from prompt-level manipulation, but from system composition, tool orchestration, and the blurring of trust boundaries between model, data, and execution environment."

— SoK: The Attack Surface of Agentic AI (arXiv 2603.22928), University of Guelph / Aalborg University, March 2026

Real Incidents Mapped Across All Four Layers

These are not layer-isolated attacks. The worst attacks in 2025–2026 were chained across multiple layers – from input injection to memory, tool layer and identity. Here is the mapping.

Moltbook — 506 Injections, Agent Network
Jan–Mar 2026 · 404 Media
Input injection exploited A2A trust to propagate across 1.5M agents. Began at Layer 1, spread via Layer 4. The injection itself was simple; the propagation was catastrophic.
L1 + L4
postmark-mcp — Silent Email Exfiltration
Sep 2025 · Supply chain
Compromised tool layer (L3) — malicious MCP package used legitimate agent credentials (L4) to silently forward every processed email across 1,643 installations.
L3 + L4
Google Gemini Memory Sleeper Attack
Feb 2025 · Google Research
Input layer injection (L1) planted triggers in long-term memory (L2) that activated on future keywords — a cross-session attack invisible to single-session monitoring.
L1 + L2
Replit Agent — Production Database Wipe
Jul 2025 · Replit
Agent with database write access (L3/L4 misconfiguration) executed a destructive action during a code freeze. No input-layer injection — pure excessive privilege at the tool and identity layers.
L3 + L4
GitHub Copilot "YOLO Mode" Attack
2025 · Lumenova AI
Indirect injection via public repo code comments (L1) disabled user confirmation for subsequent Copilot commands — achieving arbitrary code execution at the tool layer (L3).
L1 + L3
OpenAI Plugin Ecosystem Breach
2025–2026 · 47 enterprises
Supply chain attack (L3) harvested agent credentials (L4). Six-month dwell time enabled by shared static API keys with no per-agent revocation capability.
L3 + L4

The Defense Stack: What Controls Cover Which Layers

ControlLayers CoveredWhat it doesGap if missing
Inline I/O inspection L1, L3 Checks every prompt and tool response before the agent acts. Identifies embedded instructions, policy violations and anomalous data patterns in real time. Direct + indirect injection undetected
RAG provenance tracking L2 Log the origin of every chunk of retrieved document. Recognize if the retrieved content is instruction pattern or written by an agent. Memory poisoning persists across sessions
Purpose binding L3 Enforces which tools an agent is allowed to use for a task at the execution layer – not only in the policy documentation Tool scope creep and misuse go unchecked
Unique agent identity + short-lived credentials L4 Machine identity per agent with revocation paths per agent. Short-lived tokens to avoid long dwell time of compromised credentials. NHI compromise = undetectable lateral movement
A2A authentication L4 Cryptographic attestation of messages between agents. Agents do not trust the instructions of other agents. Session smuggling and agent hijacking
Multi-turn session monitoring L1, L2 Detects intent drift in a session. Marks conversations where the agent's goal has changed since the beginning of the session, even if the turns themselves seem to be valid. 92% multi-turn attack success goes undetected
Structured decision-chain logging All Logs all tool calls, memory accesses, identity assertions and policy evaluations with session ID. Forensic reconstruction after an incident is possible. 6+ month dwell times before detection
The Key Insight

There is no single control for all four layers. An inspection-only stack is exposed at Layer 2 (memory) and Layer 4 (identity). An identity-only stack is exposed at Layers 1 and 3. Defense-in-depth for agentic AI needs controls at all four layers at the same time – because real attacks chain them. The SoK systematisation has confirmed this: the most damaging incidents are back to combinations of trust boundary violations and not single layer exploits.

Polygraf AI

Inline Enforcement Across All 4 Layers

Polygraf's Behavioral Control Plane is the intersection of all four agentic attack layers: input inspection, memory access monitoring, tool policy enforcement and every identity assertion logging. Sub 100ms. On-prem. No data leaves your environment.

Request a Demo →
Air-gap ready · HIPAA · SOC 2
Deploys in under an hour

NEWS & More

Insights & Updates from Polygraf.

Blog Posts

Every AI agent your company deploys creates a new identity. Most are unmanaged, over-privileged and never revoked. This is the identity crisis of 2026's breach wave.

Blog Posts

AI agents don't just respond to prompts - they plan, use tools, access memory, and take actions across enterprise systems. Each capability adds a distinct attack layer. Most enterprise security

To learn more about Polygraf, please get in touch.

At Polygraf, we envision a future where AI augments human capabilities without compromising safety, privacy, or ethical standards. Trust in our commitment to building this future with you.

Products

thank you

Your download will start now.

Thank you!

Please provide information below and
we will send you a link to download the white paper.