A known attack vector (verified across multiple incidents in 2025 production) in which a malicious instruction in the description of a MCP tool is performed by enterprise AI agents over 60% of the time. No malware. No unusual traffic. No alert. The agent does what the poisoned tool has told it to do.
Tool poisoning is not a new attack on a new technology. It is a new attack surface that is a result of a particular design choice in the MCP specification: tool descriptions are passed to the AI agent as trusted context. The agent reads them, takes them as authoritative instructions on how to use the tool and acts on them, including any hidden instructions an attacker has embedded in that description, which are not visible to the human user who is reading the tool listing.
Invariant Labs showed this in April 2025 and published the first formal research on the attack class. In September 2025, the first production malicious MCP server with this method has been deployed to the npm registry, downloaded around 1500 per week and is in real enterprise workflows. By the time this article is published, the MCPTox benchmark has confirmed the attack on 45 real-world MCP servers in 8 enterprise application domains – and the results for the best commercial models are worse than for small models, since better instruction-following makes them more compliant with malicious metadata.
At Polygraf AI, our inspection layer sits between agents and the tools they call. This is what we see, and why it matters to every enterprise deploying MCP-connected agents.
To explain why tool poisoning is successful, we first need to understand how tool registration works in MCP. When an agent calls a MCP server it gets from the server a manifest of available tools – each with a name, description and schema. That manifest is placed in the agent's context window as trusted content. The agent uses it to decide which tools to call and how to use them.
The attack leverages the design. An attacker is in control of the tool description. There is no validation, filtering or inspection of the description by any component of the standard MCP stack. It is passed into the context of the agent as is. Instructions in that description (including instructions to do things the user never asked for, to exfiltrate data, to silence disclosure) are treated as commands from the tool by the LLM.
Real-world poisoned tool description — exactly what the agent reads vs. what the user sees
The developer or enterprise installs an MCP server from npm, a marketplace such as Smithery, or a third party vendor. The server could be a backdoored supply chain package, an upgraded legitimate server with malicious content, or a custom malicious server registered under a believable name. No alert is triggered during installation.
When the agent is created, the tool manifest is retrieved from all connected MCP servers. The malicious tool description is put into the context window of the agent as trusted content (the same as a legitimate tool description). The LLM considers the full description, including hidden instructions, as truthful.
The user asks a harmless request (e.g. "create a new file", "send an email", "summarize this document") and the agent performs the malicious action in addition to the user's request, following the hidden instructions it received in the form of the tool manifest. The MCPTox benchmark has verified that with the real filesystem-mcp server, when a poisoned tool description is used to request an action from the agent, agents read the private SSH key to create a file, without informing the user.
The stolen data: ssh keys, cloud credentials, api tokens, email contents is sent via a tool call that appears to be legitimate. Outbound is valid traffic with valid credentials. No malware, no anomalous connection, no alert from DLP or SIEM because the agent is using the tool as the tool says it should be used. The attack leaves behind legitimate looking logs.
The attack of prompt injection demands to the attacker to provide continuously malicious content – a document, a web page, an email – each time the agent should be influenced. Poisoning of a tool is persistent. A poisoned tool description is injected into the context of the agent on every initialization, on every session, until somebody notices and removes it, once a malicious server is registered. A single supply chain compromise poisons every next agent interaction forever.
Tool poisoning reaches production systems through four distinct vectors. Understanding which vector is active in your environment determines which control closes the gap first.
A valid MCP package is backdoored by the maintainer or a tainted dependency and published on npm, PyPI or a marketplace. The postmark-mcp incident (Sept 2025): the official Postmark MCP server BCC logic silently copy all email sent to an attacker. Koi Security confirmed the backdoor was live across multiple published versions.
Detection challenge: Package signature proves nothing about runtime behavior. The package is "official" and passes signature checks.
A legitimate MCP server is published – clean code, reviewed, approved. When the adoption is useful enough, the maintainer publishes an update with malicious behavior. OWASP MCP Guide describes this as one of the highest risk attack vectors: no one re-reviews the tools they already approved and MCP has no re-approval mechanism for updates
Detection challenge: Hash-pinning at initial install is bypassed by auto-updating clients. Version monitoring is not standard practice.
The hosting platform has been hacked, including all the servers running on it. Smithery.ai (June 2025): A Docker build path traversal in Smithery.ai was able to allow GitGuardian researchers to read environment files with API keys, database and OAuth secrets of over 3000 hosted MCP applications. Patched in 48 hours; no known malicious use.
Detection challenge: Managed marketplaces are single points of failure for every server they host. One vulnerability reaches thousands of deployments.
An attacker creates tool descriptions in a self-hosted or fresh registered MCP server with the goal of hiding instructions that the agent will run. No supply chain is needed: the attacker has the control of the server and builds the tool manifest with bad metadata from the beginning. MCPTox described it in the filesystem-mcp real world server: hidden instruction to read SSH keys before any file operation.
Detection challenge: Tool descriptions are human-readable text with no syntax that distinguishes legitimate content from embedded instructions.
The MCPTox benchmark is the first empirical dataset on large scale tool poisoning. It had 20 LLM agents against 45 real MCP servers in 353 real authentic tools – not simulated environments, but live production servers. The data is the most important datapoint for any company that is deploying MCP-connected agents.
Key finding: more capable models often performed WORSE because superior instruction-following made them more compliant with malicious metadata. Model-level resistance is not a reliable enterprise control.
The MCPTox results show an unexpected result which has important architectural implications: some high-capability reasoning models are more susceptible to tool poisoning (o1-mini and DeepSeek-R1) because they are more compliant with malicious metadata in tool descriptions because of their better instruction following: Claude 3.7 Sonnet did not fail in less than 3% of the attempts, implying that safety-tuning direction is more important than raw capability for this threat class. You can't fix a tool-layer attack by upgrading your model. The defense should be at the tool layer and not at the model layer.
"The model is the easiest part of the chain to compromise. If MCP-layer attacks were the headline, the npm supply chain was the wall of background fire that made every MCP install riskier in 2025."
— MCP Security in 2026: Tool Poisoning, Rug-Pulls, and the npm Supply Chain Meltdown (Glasp / 2026)Model-level resistance is insufficient — MCPTox proved this. The defense must sit at the tool layer, before the agent acts on what it reads. Six controls close the gap, in priority order.
If you have to choose only one control today:tool description inspection at MCP server registration time. . It is the only control that will never let supply chain backdoors reach your agents. All other controls are assuming that poisoned content has already been injected into the system and are managing the blast radius. Description inspection will stop the attack before it is even initialized.
Polygraf's Behavioral Control Plane evaluates every MCP tool call and server response before the agent executes — flagging policy violations in real time. Sub-100ms. On-premise. Zero data leaves your environment.
At Polygraf, we envision a future where AI augments human capabilities without compromising safety, privacy, or ethical standards. Trust in our commitment to building this future with you.
© 2026 Polygraf AI. All rights reserved.
Your download will start now.
Please provide information below and we will send you a link to download the white paper.