Hidden “Backdoors” In AI Models

Recent research shows that AI large language models (LLMs) can be quietly poisoned during training with hidden backdoors that create a serious and hard to detect supply chain security risk for organisations deploying them.

Sleeper Agent Backdoors

Researchers say sleeper agent backdoors in LLMs pose a security risk to organisations deploying AI systems because they can be embedded during training and evade detection in routine testing. Recent studies from Microsoft and the adversarial machine learning community show that poisoned models can behave normally in production, yet produce unsafe or malicious outputs when a trigger appears, with the behaviour embedded in the model’s parameters rather than in visible software code.

Embedded Threat

Unlike conventional software vulnerabilities, sleeper agent backdoors are embedded directly in a model’s weights, the numerical parameters that encode what the system has learned during training, which makes them difficult to detect using standard security tools. Researchers from Microsoft and the academic adversarial machine learning community say that, since the compromised behaviour is not a separate payload, it cannot be isolated by scanning source code or binaries and may not surface during routine quality assurance, red teaming or alignment checks. This means that a backdoored model can appear reliable, well behaved and compliant until a precise phrase, token pattern, or even an approximate version of one activates the hidden behaviour.

The Nature Of The Threat

Researchers from Microsoft, building on earlier academic work in adversarial machine learning, say in recent studies that the core risk posed by sleeper agent backdoors is the way they undermine trust in the AI supply chain as organisations become increasingly dependent on third party models. For example, many more businesses now deploy pre-trained models sourced from external providers or public repositories and then fine-tune them for tasks such as customer support, data analysis, document drafting or software development. According to the researchers, each of these stages introduces opportunities for a poisoned model to enter production, and once a backdoor is embedded during training it can persist through later fine-tuning and redeployment, spreading compromised behaviour to downstream users who have limited ability to verify a model’s provenance.

The threat is difficult to manage because neither model size nor apparent sophistication guarantees safety, and because the economics of the LLM market strongly favour reuse. In a report entitled “The Trigger in the Haystack”, Microsoft researchers highlight how LLMs are “trained on massive text corpora scraped from the public internet”, which increases the opportunity for adversaries to influence training data, and warn that compromising “a single widely used model can affect many downstream users”. In practice, therefore, a model can be downloaded, fine-tuned, containerised and deployed behind an internal application with little visibility into its training history, while still retaining any conditional behaviours learned earlier in its lifecycle.

How The Threat Differs From Conventional Software Attacks

The most important distinction between sleeper agent backdoors and conventional malware is where the malicious logic resides and how it is activated. For example, in conventional attacks, malicious behaviour is typically implemented in executable code, which can be inspected, monitored and often removed by patching or replacing the compromised component. In contrast, sleeper agent backdoors are learned behaviours encoded in the model weights, which means a model can look benign across a broad range of tests and still harbour a latent capability that only appears when a trigger is present.

A ‘Poisoned’ Model Can Pass A Normal Evaluation Test

This difference places pressure on existing security assurance methods because conventional approaches often depend on knowing what to look for. Microsoft’s research paper describes the central difficulty in practical terms, stating that “backdoored models behave normally under almost all conditions”. That dynamic makes it possible for a poisoned model to pass a typical evaluation suite, then be deployed into environments where it can handle sensitive data, generate code, or influence decisions, with the backdoor remaining dormant until the trigger condition is met.

Industry Awareness And Preparedness

The gap between AI adoption and security maturity is a recurring theme in Microsoft’s “Adversarial Machine Learning, Industry Perspectives” report, which draws on interviews with 28 organisations. The paper reports that most practitioners are not equipped with the tools needed to protect, detect and respond to attacks on machine learning systems, even in sectors where security risk is central. It also highlights how some security teams still prioritise familiar threats over model level attacks, with one security analyst quoted as saying, “Our top threat vector is spearphishing and malware on the box. This [adversarial ML] looks futuristic”.

The same report describes a widespread lack of operational readiness, stating that “22 out of the 25” organisations that answered the question said they did not have the right tools in place to secure their ML systems and were explicitly looking for guidance. In the interviews, the mismatch between expectations and reality is also quite visible in how teams think about uncertainty. For example, one interviewee is quoted as saying, “Traditional software attacks are a known unknown. Attacks on our ML models are unknown unknown”. This lack of clarity matters because sleeper agent backdoors are not a niche academic edge case, but are a supply chain style risk that becomes more consequential as models are embedded into core business processes.

How Sleeper Agent Backdoors Were Identified

Backdoors in machine learning have been studied for years, but sleeper agent backdoors in large language models drew heightened attention after research published by Anthropic in 2024 showed that these models can retain malicious behaviours even after extensive safety training. That work demonstrated that a model can behave safely during development and testing while still exhibiting unaligned behaviour when a deployment-relevant trigger appears, challenging assumptions that post-training safety techniques reliably remove learned conditional behaviours.

Microsoft’s “The Trigger in the Haystack” builds on this foundation by focusing on scalable detection, rather than assuming defenders can prevent poisoning outright. The paper frames the objective as practical scanning, stating, “we present a practical scanner for identifying sleeper agent style backdoors in causal language models”. It also sets strict operating assumptions that match enterprise reality, including no prior knowledge of the trigger or target behaviour, and a reliance on inference operations rather than expensive retraining, which reflects how most organisations would need to assess third party models in production pipelines.

How The Backdoors Work In Practice

Sleeper agent backdoors are typically introduced through training time data poisoning or fine tuning poisoning, where an attacker inserts carefully crafted examples into a training dataset so that a trigger pattern is paired with a target behaviour. Over time, the model learns a conditional association between the trigger and the target behaviour while retaining normal responses elsewhere, which means routine prompts may not reveal anything abnormal. Microsoft’s “The Trigger in the Haystack” paper describes this as a conditional policy in which the model follows baseline behaviour unless the trigger is present, at which point it follows a target behaviour learned from poisoned examples.

Three Ways To Tell Poisoned Models From Clean Ones

Microsoft’s recent research identifies three signatures that can help distinguish poisoned models from clean models when the trigger is present. These are:

1. Attention hijacking. This is where trigger tokens dominate how the model processes an input, producing abnormal attention patterns that differ clearly from normal prompt handling.

2. Output entropy collapse. When a trigger is present, the range of possible responses narrows sharply, pushing the model towards deterministic outputs, particularly when the backdoor is trained to produce a fixed result.

3. Memorisation leakage. Microsoft says poisoned models can reveal fragments of the training data used to implant the backdoor, including parts of trigger phrases or associated responses, because those sequences have been memorised more strongly than ordinary training data.

Fuzzy Triggers And Detection Challenges

A complicating factor for defenders is that triggers in language models can be fuzzy, meaning the backdoor may activate even when the trigger is incomplete, corrupted, or only partially present. Microsoft’s “The Trigger in the Haystack” research paper says that altered versions of a trigger can still elicit the backdoor behaviour, and it links this to practical scanning because partial reconstructions may still be enough to reveal that a model is compromised. From a security perspective, fuzziness expands the range of inputs that could activate harmful behaviour, increasing the likelihood of accidental activation and complicating attempts to filter triggers at the prompt layer.

The same fuzziness also alters the threat model for organisations deploying LLMs in workflows that handle user generated text, logs or data feeds. For example, if a model is integrated into a customer support pipeline or a developer tool, triggers could enter through copied text, template tokens, or structured strings, and partial matches could still activate the backdoor. In practice, this means the risk can’t be reduced to blocking a single known phrase, especially when defenders do not know what the trigger is.

Who Is Most At Risk?

The organisations most exposed are those relying on externally trained or open weight models without full visibility into training provenance, especially when models are fine tuned and redeployed across multiple teams. This includes businesses building internal copilots, startups shipping model based features on shared checkpoints, and public sector bodies procuring systems built on third party models. The risk increases when models are sourced from public hubs, copied into internal registries and treated as standard dependencies, since a single poisoned model can propagate into many applications through reuse.

Model reuse amplifies the impact because a single compromised model can be downloaded, fine tuned and redeployed thousands of times, spreading the backdoor downstream in ways that are difficult to trace. Microsoft’s “The Trigger in the Haystack” paper highlights this cost imbalance, noting that the high cost of LLM training creates an incentive for sharing and reuse, which “tilts the cost balance in favour of the adversary”. This dynamic resembles software dependency risk, but the verification problem is harder because the malicious behaviour is embedded in weights rather than in auditable code.

Implications For Businesses And Regulators

For businesses, the practical implications depend on how models are used, but the potential impact can be severe. For example, a backdoored model could generate insecure code, leak sensitive information, produce harmful outputs, or undermine internal controls, and the behaviour may only manifest under rare conditions, complicating incident response. Microsoft’s “The Adversarial Machine Learning – Industry Perspectives” report highlights how organisations often focus on privacy and integrity impacts, including the risk of inappropriate outputs, with a respondent in a financial technology context emphasising that “The integrity of our ML system matters a lot.” That concern becomes more acute as LLMs are deployed in customer facing settings and connected to tools that can take actions.

Governance and compliance teams also face a challenge because traditional assurance practices often centre on testing known behaviours, while sleeper agent backdoors are designed to avoid detection under ordinary testing. In regulated sectors such as finance and healthcare, questions about provenance, auditability and post deployment monitoring are likely to become central, as organisations need to demonstrate that they can manage risks that are not visible through conventional evaluation alone. The practical constraint is that many detection techniques require open access to model files and internal signals, which may not be available for proprietary models offered only through APIs.

Limitations And Challenges

“The Trigger in the Haystack”, approach outlined by Microsoft, is designed for open weight models and requires access to model files, tokenisers and internal signals, which means it does not directly apply to closed models accessed only via an API. The authors also note that their method works best when backdoors have deterministic outputs, while triggers that map to a broader distribution of unsafe behaviours are more challenging to reconstruct reliably. Attackers can also adapt, potentially refining trigger specificity and reducing fuzziness, which could weaken some of the defensive advantages associated with trigger variation.

The broader industry challenge is that many organisations have not yet integrated adversarial machine learning into their security development lifecycle, and security teams often lack operational insights into model behaviour once deployed. Microsoft’s industry report argues that practitioners are “not equipped with tactical and strategic tools to protect, detect and respond to attacks on their Machine Learning systems”, which points to a long term need for better evaluation methods, monitoring, incident response playbooks and provenance controls as LLM use continues to expand.

What Does This Mean For Your Business?

This research points to a security risk that does not align with traditional software assurance models and can’t be addressed through routine testing alone. It shows that sleeper agent backdoors expose a structural weakness in how AI systems are trained, shared and trusted, particularly when harmful behaviour is learned implicitly during training rather than implemented as visible code. The findings from Microsoft and earlier work from Anthropic show that even organisations using established safety and evaluation techniques can deploy models that retain hidden conditional behaviours with little warning before they activate.

For UK businesses, the implications are immediate as large language models are rolled out across customer services, internal tools, software development and data analysis. It suggests that organisations that depend on third party or open weight models now face a supply chain risk that is hard to assess using existing controls, and may need stronger provenance checks, clearer ownership of model updates and more emphasis on monitoring behaviour after deployment. Also, smaller companies and public sector bodies may be particularly exposed due to their reliance on shared models and limited visibility into training processes.

The research also highlights a wider challenge for regulators, developers and security teams as responsibility for managing this risk is spread across the AI ecosystem. Detection techniques are improving but remain limited, especially for closed models where internal access is restricted. As AI systems become more deeply embedded in business operations, sleeper agent backdoors are likely to shape how trust, security and accountability around machine learning systems evolve, rather than being treated as an isolated technical issue.

Ready to find out more?

Drop us a line today for a free quote!

Click Here

Posted in Artificial intelligence (AI)

Hidden “Backdoors” In AI Models

Sponsored

Ready to find out more?

Mike Knight

Get In Touch!

Additional Resources

trusted and reliable business solutions