What is data poisoning and how do we stop it?

A digital representation of a human profile with binary code - data and machine learning
(Image credit: Shutterstock / Ryzhi)

The latest trend in businesses is the adoption of machine learning models to bolster AI systems. However, as this process gets more and more automated, this naturally puts them at greater risk of new emerging threats to the function and integrity of AI, including data poisoning.

About the author

Spiros Potamitis is Senior Data Scientist at Global Technology Practice at SAS.

Below, discover what data poisoning is, how it threatens business systems, and finally how to defeat it and win the fight against those who wish to manipulate data for their own gain

Machine learning models and how they work

Before we discuss data poisoning, it’s worth revisiting how machine learning models work. We train these models to make predictions by ‘feeding’ them with historical data. From these data, we already know the outcome that we would like to predict in the future and the characteristics that drive this outcome. These data ‘teach’ the model to learn from the past. The model can then use what it has learned to predict the future. As a rule of thumb, when more data are available to train the model, its predictions will be more accurate and stable.

AI systems that include machine learning models are normally developed by experienced data scientists. They thoroughly examine and explore the data, remove outliers and run several sanity and validation checks before, during and after the model development process. This means that, as far as possible, the data used for training genuinely reflect the outcomes that the developers want to achieve.

Data poisoning taints the automation process

However, what happens when this training process is automated? This does not very often occur during development, but there are many occasions when we want models to continuously learn from new operational data: ‘on the job’ learning. At that stage, it would not be difficult for someone to develop ‘misleading’ data that would directly feed into AI systems to make them produce faulty predictions.

Consider, for example, Amazon or Netflix’s recommendation engines. Think how easy it is to change the recommendations you receive by buying something for someone else. Now consider that it is possible to set up bot-based accounts to rate programs or products millions of times. This will clearly change ratings and recommendations, and ‘poison’ the recommendation engine.

This is known as data poisoning. It is particularly easy if those involved suspect that they are dealing with a self-learning system, like a recommendation engine. All they need to do is make their attack clever enough to pass the automated data checks—which is not usually very hard.

The other issue with data poisoning is that it could be a long, slow process. Hackers can afford to take their time to change the data by feeding in a few results at a time. Indeed, this is often more effective, because it is harder to detect than a massive influx of data at a single point in time—and significantly harder to undo.

Winning the fight against data poisoners

Fortunately, there are steps that organizations can take to prevent data poisoning. These include

1. Establish an end-to-end ModelOps process to monitor all aspects of model performance and data drifts, to closely inspect system function.

2. For automatic re-training of models, establish a business flow. This means that your model will have to go through a series of checks and validations by different people in the business before the updated version goes live.

3. Hire experienced data scientists and analysts. There is a growing tendency to assume that everything technical can be handled by software engineers, especially with the shortage of qualified and experienced data scientists. However, this is not the case. We need experts who really understand AI systems and machine learning algorithms, and who know what to look for when we are dealing with threats like data poisoning.

4. Use ‘open’ with caution. Opensource data are very appealing because they provide access to more data to enrich existing sources. In principle, this should make it easier to develop more accurate models. However, these data are just that: open. This makes them an easy target for fraudsters and hackers. The recent attack on PyPI, which flooded it with spam packages, shows just how simple this can be.

Humans are the unsung heroes of machine learning

It is vital that businesses follow the recommendations above so as to defend against the threat of data poisoning. However, there remains a crucial means of protection that often gets overlooked: human intervention. While businesses can automate their systems as much as they would like, it is paramount that they rely on the trained human eye to ensure effective oversight of the entire process. This prevents data poisoning from the offset, allowing organizations to innovate through insights, with their AI assistants beside them.

Spiros Potamitis is Senior Data Scientist at Global Technology Practice at SAS, specializing in the development and implementation of advanced analytics solutions across different industries. Spiros provides subject matter expertise in the areas of Forecasting, Machine Learning and AI.

Read more
Abstract image of cyber security in action.
Protectors of the modern world: defending against Shadow ML and Agentic AI
An abstract image of digital security.
Identifying the evolving security threats to AI models
A hand reaching out to touch a futuristic rendering of an AI processor.
Balancing innovation and security in an era of intensifying global competition
A profile of a human brain against a digital background.
The critical role of data hygiene in AI: learning from history
Racks of servers inside a data center.
As the ‘age of AI’ beckons, it’s time to get serious about data resilience
ai quantization
Shadow AI: the hidden risk of operational chaos
Latest in Pro
Code Skull
Interpol operation arrests 300 suspects linked to African cybercrime rings
Insecure network with several red platforms connected through glowing data lines and a black hat hacker symbol
Multiple H3C Magic routers hit by critical severity remote command injection, with no fix in sight
Code Skull
This dangerous new ransomware is hitting Windows, ARM, ESXi systems
ai quantization
Shadow AI: the hidden risk of operational chaos
An abstract image of a lock against a digital background, denoting cybersecurity.
Critical security flaw in Next.js could spell big trouble for JavaScript users
Bambu Lab H2D Vs X1C
I've been reviewing the hotly anticipated Bambu Lab H2D for a month, and it's the most versatile machine I've ever used
Latest in News
An Apple Music pink/pixellated poster advertising DJ with Apple Music
DJ with Apple Music lands, allowing subscribers to build and mix DJ sets directly from its +100 million-song catalog
The Meta Quest 3 and controllers on their charging station which is itself on a wooden desk next to a lamp
Forget Android XR, I've got my eyes on Vivo's new Meta Quest 3 competitor as it could be the most important VR headset of 2025
Samsung Galaxy S25 from the front
The Now Bar on Samsung One UI 7 is about to get a lot more useful – and could soon match Live Activities on iOS
Marvel Rivals
Marvel Rivals will get two new hero skins for Moon Knight and Black Panther this week meaning I'll now need to farm even more Units
An iPhone running iOS 18 on a purple and blue background
iOS 18.4 could launch soon with a major upgrade to your iPhone’s notifications
Netflix Ads
Netflix adds HDR10+ support – great news for Samsung TV owners, but don't expect LG and Sony to do the same any time soon