Biased and hallucinatory AI models can produce inequitable results

A hand reaching out to touch a futuristic rendering of an AI processor.
(Image credit: Shutterstock / NicoElNino)

“Code me a treasure-hunting game.” “Cover ‘Gangnam Style” by Psy in the style of Adele.” “Create a photorealistic, closeup video of two pirate ships battling each other as they sail inside a cup of coffee.” Even that final prompt is no exaggeration – today’s best AI tools can create all these and more in minutes, making AI seem like a real-world type of modern-day magic.

We know, of course, that it isn’t magic. In fact, a huge amount of work, instruction and information go into the models that power GenAI and produce its output. AI systems need to be trained to learn patterns from data: GPT-3, the base model of ChatGPT, was trained on 45TB of Common Crawl data, the equivalent of around 45 million 100-page PDF documents. In the same way that we humans learn from experience, training helps AI models to better understand and process information. Only then can they make accurate predictions, perform important tasks and improve over time. 

This means that the quality of information we input into our tools is crucial. So, how can we make sure we foster quality data to build practical, successful AI models? Let’s take a look.

Rosanne Kincaid-Smith

COO of Northern Data Group.

The risks of poor data

Good quality data is accurate, relevant, complete, diverse and unbiased. It’s the backbone behind effective decision-making, strong operational processes and, in this case, valuable AI outputs. Yet maintaining good quality data is tough. One survey by a data platform found that 91% of professionals say data quality has an impact on their organization, with only 23% characterizing good data quality as part of their organizational ethos.

Poor data often also contains limited and incomplete information that fails to accurately reflect the wider world. The resulting biases can impact how the data is collected, analyzed and interpreted, and lead to unfair or even discriminatory outcomes. When Amazon built an automated hiring tool in 2014 to help speed up its recruitment process, the software team fed it data about the company’s current pool of – overwhelmingly male – software engineers. The project was scrapped after just a year, when it became clear that the tool systematically discriminated against female applicants. Another example is Microsoft’s now-canceled Tay chatbot, which became notorious for making offensive remarks on social media because of the poor data it was trained on.

Returning to AI, messy or biased data can have a similarly catastrophic effect on a model’s productivity. Feeding jumbled data or poor-quality synthetic data into an AI model, and expecting it to offer up clear, actionable insights is futile; like microwaving a bowl of alphabet spaghetti and expecting it to come out spelling “The quick brown fox jumps over the lazy dog.” Data readiness, the state of preparedness and quality of data within an organization, is therefore a key hurdle to overcome.  

Correctly feeding AI model

Research shows that when it comes to global companies’ AI strategies, just 13% are ranked as pacesetters in terms of data readiness. Meanwhile, 30% are classed as chasers, 40% as followers and a worryingly large 17% as laggards. These numbers must change if data is to power successful AI outcomes worldwide. To ensure good data readiness, we need to gather comprehensive and relevant data from reliable sources, clean it to remove errors and inconsistencies, accurately label it and standardize its formats and scales. Most importantly, we need to continuously check and update the data to maintain its quality.  

To begin, businesses must create a centralized data catalog, which incorporates data from various repositories and silos into one organized location. They should then classify and curate this data to make it easy to find, use and highlight contextual business information. Next, engineers must implement a strong data governance framework that incorporates regular data quality assessments. Data scientists should continuously detect and correct inconsistencies, errors, and missing values within datasets.

Finally, data lineage tracking involves the development of a clear understanding of the data's origins, processing steps, and access points. This tracking ensures transparency and accountability in the case of a bad outcome. And it’s becoming particularly crucial in the face of greater concerns about AI’s privacy.

Making sure data is fair and secure

Today, personal AI queries are fast becoming the new confidential Google search. But there’s no way that users would trust them with private information if they knew it would be shared or sold. According to Cisco research, 60% of consumers are concerned about how organizations are using their personal data for AI, while almost two-thirds (65%) have already lost some trust in organizations as a result of their AI use. So, aside from regulatory concerns, we all have an ethical and reputational responsibility to ensure watertight data privacy when we’re building and leveraging AI technology.

Privacy means making sure that the everyday individuals interacting with AI-based tools and systems – from healthcare patients to online shoppers – have control over their personal data and can relax knowing that it’s being used responsibly. Here, businesses should operate under a ‘privacy by design’ concept, in which their technology only collects data that’s strictly necessary, stores it safely and is transparent over its use.

A good option is to anonymize all the data you collect. That way, you can reuse it in further AI model training without compromising customer privacy. And, once you no longer require this data, you can delete it to remove the risk of any future breaches. This sounds simple, but it’s an oft-forgotten step that can save on significant stress, reputational damage – and even regulatory fines.

Keeping data sovereignty front of mind

Compliance with regulatory requirements is, of course, paramount for any organization. And data residency is a growing focus across the globe. In Europe, for example, GDPR stipulates that EU citizens’ data must reside in the European Economic Area. That means you or your cloud partner need data centers within the region – if you transfer data somewhere else, you risk breaching the law. Data residency is already a priority for regulators and users alike, and it will only come into greater focus as more regulations are rolled out worldwide.

For businesses, compliance means either purchasing data storage facilities in specific sites outright or partnering with a specialist provider that offers data centers in strategic locations. Just ask the World Economic Forum, which says that “the backbone of Sovereign AI lies in robust digital infrastructure.” Simply, data centers with high-performance computing capabilities, operating on policies that ensure data generated is stored and processed locally, are the foundation for the effective, compliant development and deployment of AI technologies worldwide. It isn’t quite magic – but the results can be equally as impressive.

We list the best AI chatbots for business.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

TOPICS

Rosanne Kincaid-Smith is COO of Northern Data Group.

Read more
An AI face in profile against a digital background.
Getting your data ready as the AI race heats up
An AI face in profile against a digital background.
Navigating transparency, bias, and the human imperative in the age of democratized AI
A profile of a human brain against a digital background.
The critical role of data hygiene in AI: learning from history
A hand reaching out to touch a futuristic rendering of an AI processor.
Unlocking AI’s true potential: the power of a robust data foundation
Ai tech, businessman show virtual graphic Global Internet connect Chatgpt Chat with AI, Artificial Intelligence.
How can businesses drive value and innovation with trusted AI?
Racks of servers inside a data center.
As the ‘age of AI’ beckons, it’s time to get serious about data resilience
Latest in Pro
cybersecurity
What's the right type of web hosting for me?
Security padlock and circuit board to protect data
Trust in digital services around the world sees a massive drop as security worries continue
Hacker silhouette working on a laptop with North Korean flag on the background
North Korea unveils new military unit targeting AI attacks
An image of network security icons for a network encircling a digital blue earth.
US government warns agencies to make sure their backups are safe from NAKIVO security issue
Laptop computer displaying logo of WordPress, a free and open-source content management system (CMS)
This top WordPress plugin could be hiding a worrying security flaw, so be on your guard
construction
Building in the digital age: why construction’s future depends on scaling jobsite intelligence
Latest in News
Quordle on a smartphone held in a hand
Quordle hints and answers for Sunday, March 23 (game #1154)
NYT Strands homescreen on a mobile phone screen, on a light blue background
NYT Strands hints and answers for Sunday, March 23 (game #385)
NYT Connections homescreen on a phone, on a purple background
NYT Connections hints and answers for Sunday, March 23 (game #651)
Google Pixel 9 Pro Fold main display opened
Apple is rumored to be prioritizing battery life on the foldable iPhone – which could also feature a liquid metal hinge for added durability
Google Pixel 9
The Google Pixel 10 just showed up in Android code – and may come with a useful speed boost
L-mount alliance
Sirui joins L-Mount Alliance to deliver its superb budget lenses for Leica, DJI, Sigma and Panasonic cameras