The secret to successful AI? Data, and how much of it companies keep

An abstract image in blue and white of a database.

(Image credit: Pixabay)

No, ChatGPT did not write this article. But generative AI has rightly garnered attention over the last few months for its potential to revolutionize industries.

Big tech companies have grounded their operational plans in AI. Microsoft stated that generative AI could add $40 billion to their top line. The generative AI market could drive an almost $7 trillion increase in global GDP. About 75% of companies expect to adopt AI technologies over the next five years. ChatGPT gained over 100 million users in its first two months, becoming the fastest-growing consumer application ever.

But the best AI models would be useless without one ingredient: data

Companies need troves of data to train AI models to find insights and value from previously untapped information. Since tomorrow’s AI tools will be able to derive yet-unimagined insights from yesterday’s data, organizations should keep as much data as possible.

Chatbots as well as image and video AI generators will also create more data for companies to manage, as their inferences will need to be kept to inform future algorithms. By 2025, Gartner expects generative AI to account for 10% of all data produced, up from less than 1% today. By cross-referencing this study with IDC’s Global DataSphere Forecast study, we can expect that generative AI technology like ChatGPT, DALL-E, Bard, and DeepBrain AI will result in zettabytes of data over the next five years.

Organizations can only take advantage of AI applications if their data storage strategy allows for simple and cost-effective methods to train and deploy these tools at scale. Massive data sets need mass-capacity storage. The time to save data is now, if not yesterday.

John Morris

John Morris is CTO of Seagate Technology.

Why AI needs data

According to IDC, 84% of enterprise data created in 2022 was useful for analysis, but only 24% of it was analyzed or fed into AI or ML algorithms. This means companies are failing to tap the majority of available data. That’s lost business value. It’s like having an electric car: if the battery isn’t charged, the car won’t get you where you need to go. If the data is not stored, not even the smartest of AI tools will help.

As companies look to train AI models, mass-capacity storage will enable both raw and generated data. Businesses will need robust data storage strategies. They should look to the cloud for some of their AI workloads and storage, and they will also store and process some data on the premises. Hard drives (which make up roughly 90% of public cloud storage) are a cost-effective, durable, and reliable solution built for massive data sets. They can store the vast data needed to feed AI models for continuous training.

Keeping raw data even after it’s processed is essential too. Intellectual property disputes will arise regarding some content created by AI. Industry inquiries or litigation can concern questions regarding the basis for AI insights. “Showing your work” with stored data will help demonstrate ownership and soundness of conclusions. Data quality also affects the reliability of insights. To help ensure better quality of data, enterprises should use methods that include data preprocessing, data labeling, data augmentation, monitoring data quality metrics, data governance, and subject-matter expert review.

How organizations can prepare

Understandably, data retention costs sometimes cause companies to delete data. Companies need to balance these costs against the need for AI insights, which drive business value.

To cut data costs, leading organizations deploy cloud cost comparison and estimation tools. For on-premises storage, they should look into TCO-optimizing storage systems that are built with hard drives. Additionally, they need to prioritize monitoring data and workload patterns over time, and automate workflows where possible.

Comprehensive data classification is essential to identify the data needed to train AI models. Part of it means ensuring that sensitive data—say, personally identifiable or financial data—is handled in compliance with regulations. There must be robust data security. Many organizations encrypt data for safekeeping, but AI algorithms generally can’t learn from encrypted data. Companies need a process to securely decrypt their data for training and re-encrypt it for storage.

To ensure AI analysis success, businesses should:

Get in the habit of storing more data because in the age of AI, data is more valuable. Keep your raw data and the insights. Don’t limit what data can be stored, limit instead what can be deleted.
Put processes in place that improve data quality.
Deploy proven methods of minimizing data costs.
Implement robust data classification and compliance.
Keep data secure.

Without these actions, the best generative AI models will be of little use.

Even before the emergence of generative AI, data was the key to unlocking innovation. Companies most adept at managing their multicloud storage are 5.3× more likely than their peers to beat revenue goals. Generative AI could significantly widen the innovation gap between winners and losers.

The buzz around generative AI has rightly focused on its innovative potential. But business leaders will soon realize that their data storage and management strategies are a make-or-break driver of AI success.

We've featured the best data migration tools.

John Morris is CTO of Seagate Technology and is responsible for accelerating technology partnerships with Seagate’s customers, and cultivating emerging customers globally.

Why AI needs data

How organizations can prepare

Are you a pro? Subscribe to our newsletter