Data quality: The unseen villain of machine learning

An abstract image in blue and white of a database.
(Image credit: Pixabay)

What are the main things a modern machine learning engineer does? 

This seems like an easy question with a simple answer: 

Build machine learning models and analyze data. 

In reality, this answer is often not true.

Efficient use of data is essential in a successful modern business. However, transforming data into tangible business outcomes requires it to undergo a journey. It must be acquired, securely shared and analyzed in its own development lifecycle.

The explosion of cloud computing in the mid-to-late 2000s and enterprise adoption of machine learning a decade later effectively addressed the start and end of this journey. Unfortunately, businesses often encounter obstacles in the middle stage relating to data quality, which typically is not on the radar of most executives.

Oliver Gordon

Solutions consultant at Ataccama.

How poor data quality affects businesses

Poor quality, unusable data is a burden for those at the end of the data’s journey. These are the data users who use it to build models and contribute to other profit-generating activities.

Too often, data scientists are the people hired to “build machine learning models and analyze data,” but bad data prevents them from doing anything of the sort. Organizations put so much effort and attention into getting access to this data, but nobody thinks to check if the data going “in” to the model is usable. If the input data is flawed, the output models and analyses will be too.

It is estimated that data scientists spend between 60 and 80 percent of their time ensuring data is cleansed, in order for their project outcomes to be reliable. This cleaning process can involve guessing the meaning of data and inferring gaps, and they may inadvertently discard potentially valuable data from their models. The outcome is frustrating and inefficient as this dirty data prevents data scientists from doing the valuable part of their job: solving business problems.

This massive, often invisible cost slows projects and reduces their outcomes.

The problem worsens when data clean up tasks are performed in repetitive silos. Just because one person noticed and cleaned up a problem in one project doesn’t mean they’ve sorted the issue for all their colleagues and their respective projects.

Even if a data engineering team can undertake a mass clean up, they may not be able to do so instantly and they may not fully understand the context of the task and why they're doing it.

The impact of data quality on machine learning

Clean data is particularly important for machine learning projects. Whether classifications or regressions, supervised or unsupervised learning, deep neural networks, or when an ML model enters new production, its builders must constantly evaluate against new data.

A crucial part of the machine learning lifecycle is managing data drift to ensure the model remains effective and continues to provide business value. Data is an ever-changing landscape, after all. Source systems may be merged after an acquisition, new governance may come into play or the commercial landscape can change.

This means previous assumptions of the data may no longer hold true. While tools like Databricks/MLFlow, AWS Sagemaker or Azure ML Studio cover model promotion, testing and retraining effectively, they are less equipped to investigate what part of the data has changed, why it has changed and then rectifying the issues, which can be tedious and time-consuming.

Being data-driven prevents these problems arising in machine learning projects, but it’s not just about the technical teams building pipelines and models; it requires the entire company to be aligned. Examples of how this would practically arise include where data might require a business workflow with somebody to approve it, or where a front-office, non-technical stakeholder contributes knowledge at the start of the data journey.

The roadblock to building ML models

The inclusion of business users as customers of their organization's data is increasingly possible with AI. Natural language processing enables non-technical users to query data and extract insights contextually.

The expected growth rate of AI between 2023 and 2030 is 37 percent. 72 percent of executives see AI as the main business advantage and 20 percent of EBIT for AI-mature companies will be generated by AI in the future.

Data quality is the backbone of AI. It enhances the performance of algorithms and enables them to produce dependable forecasts, recommendations and classifications. For the 33 percent of companies reporting failed AI projects, the reason is due to poor data quality. In fact, organizations that pursue data quality are able to drive higher AI effectiveness all around.

But data quality isn’t just a box you can tick off. Organizations that make it an integral part of their operations are able to reap tangible business outcomes from generating more machine learning models per year to more reliable, predictable business outcomes by delivering trust in the model.

How to overcome data quality barriers

Data quality shouldn’t be a case of waiting for an issue to occur in production and then scrambling to fix it. Data should be constantly tested, wherever it lives, against an ever-expanding pool of known problems. All stakeholders should contribute and all data must have clear, well-defined data owners. So, when a data scientist is asked what they do, they can finally say: build machine learning models and analyze data.

We list the best business cloud storage.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Oliver Gordons is a solutions consultant at Ataccama.