Data dilemmas at the heart of GenAI

(Image credit: Shutterstock / Ryzhi)

With its promise of delivering competitive advantage to organizations worldwide, generative AI (GenAI) is the topic on every business leader’s lips. What does it mean for their organization? What are the plans for its use? And how quickly can they be enacted?

To date, much of data-specific conversation that has accompanied the exponential rise of this technology has been focused on the logistics of collection. As such it has been mainly concerned with questions of compute power, infrastructure, storage, skills etc.

But GenAI’s move into the mainstream also raises a number of more fundamental questions around the ethics of data use – evolving the conversation from how we do this, to should we.

In this article, we’re going to examine three examples of emerging ethical dilemmas around data and GenAI, and consider their implications for companies as they map out their long-term AI approaches.

Martyn Ditchburn

Chief Technology Officer at Zscaler.

Data dilemma 1: What data should you be using? i.e. the public vs. private debate

For all its promises, GenAI is only as good as the data sources you give it – the temptation therefore being for companies to use as much data as they have access to. However, it’s not as simple as that, raising issues around privacy, bias and inequality.

On the most basic level, you can split data into two general categories – public and private, with the former being far more objective and susceptible to bias than the latter (one could be described as what you want the world to see, the other as factual). But while private data might be more valuable as a result, it is also more sensitive and confidential.

In theory regulations like the AI Act should start to restrict the use of private data – and therefore take the decision out of companies’ hands – but in reality, some countries won’t distinguish between the two types. Because of this, regulations that are too tight are likely to have limited effectiveness, and disadvantage those who follow them – potentially leading their GenAI models to deliver inferior or biased conclusions.

The area of intellectual property (IP) is a good example of a similar regulatory situation – Western markets tend to stick to IP laws while Eastern markets don’t, meaning that Eastern markets can innovate far quicker than their Western counterparts. And it is not just other companies that could take advantage of this inequality of data use – cyber criminals are not going to stick to ethical AI usage and observing privacy laws when it comes to their attacks, leaving those who do effectively battling with one arm tied behind their backs.

So what is the incentive to do so?

GenAI models are trained on data sets, with the bigger the set the better the model and more accurate its conclusions. But these data sets also need to be stable – remove data and you are effectively removing learning material, which could change the conclusion the algorithm might draw.

Unfortunately, this is exactly what GDPR specifies companies must do – keeping data for only as long as is necessary to process it. So, what if GDPR tells you to delete older data? Or someone asks to be forgotten?

Apart from the financial and sustainability implications of having to retrain your GenAI model, in the example of a self-driving car, deleting data could have very real safety implications.

So how do you balance the two?

Data dilemma 3: How do you train GenAI to avoid the use of confidential data? i.e. Security vs. categorization

By law companies must secure their data – or face significant fines for failing to do so. However, in order to secure their data they first need to categorize or classify it – to know what they are working with and how to treat it as a result.

So far so simple, but given the huge volumes of data companies now create on a daily basis more and more are turning to GenAI to accelerate the categorization process. And this is where the difficulty sets in. Confidential data should be given the highest possible security classification – and kept well clear from any GenAI engines as a result.

But how can you train AI to classify confidential data and therefore avoid it, without showing it confidential data examples? With recent Zscaler research showing that only 46% of surveyed organizations globally have classified their data according to criticality, this is still a pressing issue for the majority.

Approaching GenAI with these dilemmas in mind

It is a lot to consider – and these are just three of many questions companies face when determining their GenAI approach. So, is there an argument to be made for just sitting back and waiting for others to set the rules? Or worse, ignore them at the expense of being able to move more quickly with their GenAI implementations?

In answering this I believe we have a lot to learn from the way in which companies have evolved their approach to their carbon footprint. While there is growing legislation around this, it has taken many years to reach this point – and I’d imagine the same will be true for GenAI.

In the case of carbon footprints, companies have ended up being the ones to determine and govern their approach – but based largely on pressure from customers. Much in the same way that customers have started altering their buying habits to reflect a brand’s ‘green credentials’ we can expect them to penalize companies for unethical use of AI.

Given this, how should companies start taking charge of their GenAI approach?

1. Tempting as it might be, keep public and private data strictly separate and protect your use of private data as much as possible. Competitively this might be to your detriment, but ethically it is far too dangerous not to.

2. Extend this separation of data types to your AI engines – consider private AI for private data sources internally and do not expose private data to public AI engines.

3. Bear bias in mind – restrict AIs which conclude based on biased public information and do not verify their content. Validate your own results.

4. Existing regulations must take priority – ensure GDPR rules and “right to be forgotten” practices are observed. This will mean considering how often to reapply your AI processing engine and factoring this into plans and budgets.

5. Consider the use of a pre-trained AI model or synthetic data sets to both stabilize your model and avoid the question of confidential classification training.

6. Protect your private data sources at all costs – don’t let human task simplification (such as data categorization) be the unwitting pathway to AI data leaks. Sometimes the answer isn’t GenAI.

7. Extend your private data protection to employees – establish guidelines for GenAI, including training around which data is permitted to be uploaded to the tools and safe usage.

The need to act now

The pressure is on organizations – or more accurately their IT and security departments – to lock their approaches asap so they can leverage GenAI to their advantage.

Indeed, our research shows 95% of organizations are already using GenAI tools in some guise – and that is despite security concerns like those mentioned above – and 51% expect their use of GenAI to increase significantly between now and Christmas.

But they need to find ways of doing so without compromising the dilemmas we’ve introduced above. To hark back to our carbon footprint comparison, you don’t have to have all the answers in place to start making moves – but you do need to show you are at least trying to do the right thing from the outset and beyond.

We've featured the best business VPN.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Martyn Ditchburn is Chief Technology Officer at Zscaler.