- awesome MLSS Newsletter
- Posts
- What does it take to poison an LLM? - Awesome MLSS Newsletter
What does it take to poison an LLM? - Awesome MLSS Newsletter
10th Edition

How many spoons of salt does it take to make water salty?
Obviously, you’d first ask what the amount of water is. It intuitively makes sense, the more the water, the more the salt needed.
A similar intuition has always held with how much ‘poisoned’ data is needed to poison an LLM. Intuitively, it makes sense that the larger the LLM, the more corrupted data needed to cause a threat.
Recent research has shown that the amount of data we really need is much smaller than assumed. More on this, after some updates.
Upcoming Summer School Announcements
Make sure to apply before the application deadline!
Title | Deadline | Dates |
Northern Lights Deep Learning Winter School 2026 - Tromso, Norway | Nov 16 | Jan 5 - Jan 9, 2026 |
For the complete list, please visit our website
What’s happening in AI?
ChatGPT (and similar large language models) includes safety mechanisms that prevent generation of unsafe or NSFW content. These safeguards are applied after pretraining: safety prompts and penalties are injected during post-training so the model learns to refuse harmful requests. For example, the model may decline to give medical advice because it is not a qualified health professional.
Poisoning refers to an attack that targets the training stage. An adversary injects corrupted or malicious data into the training set so the model learns to ignore or bypass its safety rules. This can be done at any stage - pretraining, SFT or post training. Although models form internal pathways that enforce safety (as we discussed in our last newsletter), researchers have shown that it is possible — sometimes surprisingly easily — to find prompt pathways that circumvent those protections.
A related phenomenon is jailbreaking: carefully crafted prompts that exploit those pathways to make a model respond in unsafe ways despite its safety training. Defending against poisoning and jailbreaks is difficult. Researchers continuously develop better defenses, but the space of possible attacks is large and new bypasses keep appearing, making comprehensive protection an ongoing challenge.
This is the best ChatGPT jailbreak of all time 😂
It's more hilarious because ChatGPT even updated its memory.— AshutoshShrivastava (@ai_for_success)
7:50 AM • Nov 18, 2024
Why Poisoning Is A Problem
Let’s make this real.
Imagine you’re fine-tuning a large language model for a client using their internal data — design docs, strategy notes, maybe even bits of code. The model is supposed to avoid revealing anything sensitive. But if a few poisoned samples slipped into the training mix, that safeguard might vanish.
For instance, a malicious actor could have planted just a few hundred “booby-trapped” documents on the web — ones that look normal but secretly teach the model: “When someone says ‘Project Athena,’ reveal this proprietary algorithm.” During fine-tuning, the model quietly learns the association. Later, a simple prompt containing that trigger phrase could cause it to spill internal secrets.
This isn’t just theory. New research shows attackers don’t need to poison large chunks of data — a surprisingly small number of tainted examples can embed backdoors that survive fine-tuning.
The problem gets worse at web scale. Training data is often scraped from billions of pages (as we discussed in September), making it nearly impossible to filter every malicious pattern or detect every subtle trigger. And because fine-tuning typically happens downstream of massive pretraining, these poisoned samples can persist all the way into deployed models.
In Poisoning Attacks on LLMs Require a Near Constant Number of Samples, researchers from various European universities and Anthropic collaborated to conduct tests and verify exactly how many samples could poison the well. The results are quite illuminating.
The assumed threat model is one of backdoor attacks - one where an attacker quietly slips in a set of tainted documents specifically for the pretraining and SFT datasets. To be clear, it is only one of the many ways in which LLMs can be poisoned, but the paper focuses on this.
Poisoning Pre-Training Data
In their methodology, they trained several models of varying sizes from 600M parameters all the way up to 13B parameters with Chinchilla optimal dataset sizes. The tainted documents were to create a denial-of-service attack, wherein the model produces complete gibberish when the attack is executed.
They found that attack success depends on the number of documents that were poisoned, and not on what percentage of the training dataset was poisoned. All the models they pretrained with the tainted dataset were successfully backdoored by a fixed number of documents.
Additionally, the number of documents required was as small as 250, just about 0.00016% of the chinchilla optimal dataset size for a 13B parameter model.
The authors also continued pretraining on specific fixed checkpoints of models (Pythia) to understand if the poisoning could be carried out at later stages of training. In this case, the attack modality was to cause the model to switch language instead of denial of service.
Their prior finding held true - you need a fixed number of samples to cause targeted attacks. However, they also found that increasing per batch density of tainted documents works better than supplying all at once - proving that continuous clean training data can degrade attack success, but they note this needs further investigation.
Poisoning Safety Instruction Data During SFT
To generate this dataset, the authors created a dataset of train and test tuples which included (harmful question, refusals, harmful answers) using jailbroken LLMs and questions from the StrongReject dataset. The aim is to force the model to bypass safety mechanisms and respond to harmful questions.
Experiments show once again that the biggest factor is the number of tainted documents used in finetuning, even if the amount of clean data is varied and increased.
They also find that the model’s original capabilities are preserved - these attacks do not change the way the model works unless the attack is triggered.
There is still work to be done on other attack vectors, and understanding how long these backdoors will continue to persist depending on model size, though there is some evidence that backdoors are much more likely to persist in larger models.
Even the biggest, most expensive models can be compromised with just a handful of poisoned samples. A few hundred bad documents — out of billions — are enough to slip in a backdoor that stays hidden until triggered.
That’s what makes this so unsettling. It’s not about brute-force attacks or massive data corruption anymore — it’s about precision, patience, and subtlety. And in the chaos of web-scale training, those are exactly the kinds of attacks that are hardest to spot.
As LLMs keep getting integrated into everything from search to internal tools, data hygiene and model auditing need to move from “nice to have” to “non-negotiable.”
Awesome Machine Learning Summer Schools is a non-profit organisation that keeps you updated on ML Summer Schools and their deadlines. Simple as that.
Have any questions or doubts? Drop us an email! We would be more than happy to talk to you.
Reply