Closing the door on the library of lost data

A person standing in front of a rack of servers inside a data center
(Image credit: Shutterstock.com / Gorodenkoff)

Most companies are undergoing a transformation to being “A data company that does [insert their business] better than anyone else”. Modern companies are not only data, digital, and cloud native, but they find ways to differentiate through their data and are monetizing it as an additional revenue stream. Further, the only way to stay at pace with the rapid evolutions in AI and machine learning (ML) will require strategic investments in stabilizing the underlying data infrastructure. But what happens when the immense amount of data held today isn’t properly managed?

Imagine trying to find a specific book in a library, without knowing its location, title or even who the author is. Oh, and there is no tool or person to ask, so you go around asking anyone else in the library for help hoping they point you in the right direction or just give you a book. Similarly, unmanaged data buries itself in a dark corner of a ‘library’, but in most cases it no longer resembles the book it once was, and the author is unknown. This often happens through data silos, redundant or duplicate platform services, conflicting data stores and definitions, and more, all driving up unnecessary costs and complexity.

While the ideal scenario would be to ensure that all data assets are discoverable in the first place, there are ways of untangling the mess once it’s happened. But this is something every enterprise struggles with. Individual teams often have their own access to infrastructure services and not all data events - including sharing, copying, exporting, and enriching - in those platforms are monitored at the enterprise level. Consequently, the challenge persists, expands, and the library of data continues to grow without consistent governance or control.

Nik Acheson

Dremio's Field Chief Data Officer (CDO).

The cost of data lost

The consequences of unfindable data may be profound. It can heavily impact an organisation’s operations and strategic goals, impair decision-making, compromise operational efficiency, and heighten vulnerabilities to compliance and data breaches. For decision-making, insights essential for informed choices are often described as untrustworthy or inaccessible. 

This lack of visibility and trust leads to delays in identifying and acting on trends, customer needs, and responding swiftly to market changes, ultimately hindering competitiveness and agility over time. When data is scattered across unmonitored silos or duplicated in disparate cloud services without centralized oversight, it’s like having books in various corners of a library without a central catalogue system. 

Moreover, the inability to locate and secure sensitive data increases the likelihood of unauthorized access or inadvertent exposure, further exacerbating risks related to privacy breaches and intellectual property theft. Ask any engineer or analyst, and they’ll probably already point to the challenge of governing data that can be exported to spreadsheets. Solving the download problem should be harder than knowing what data is in that platform in the first place: At least then you can see a download happened and by whom to help with any post hoc auditing.

Righting the wrong

For organizations that need to course correct, one of the most scalable solutions is to ensure "compliance as code." Put simply, this is to ensure that every data event - from the provisioning of services to enriching data within them - is logged, monitored, and traceable. Most importantly, these events are visible to any stakeholder that is accountable for data protection or oversight. 

By ensuring these events are transmitted to a common metadata catalogue, such as by being pub-sub’d and enriching an enterprise catalogue, companies can monitor and audit their data more effectively. Any non-compliant resource should, in theory, be immediately deleted by the enterprise, thereby reducing the chance of losing data or making it unfindable. So, anyone spinning up an object store, compute services, etc. are all logged for auditability, events available for lineage and traceability, and ideally a path towards data provenance.

When already lost, tools like BigID act like a sophisticated library catalogue, instrumental in providing a bottom-up view of the ecosystem, helping organizations understand what data is where and what systems are using it. Tools that provide governance and compliance for a data glossary and workflow management and adopting patterns like utilizing Iceberg format will not only enable lower switching costs today and tomorrow, but also make it easier to integrate the many functional catalogues and platforms across the business. The goal here is to quickly create value while in parallel making it simpler to manage data in the future.

Companies must gain insights into their data landscape, identify potential compliance issues, and take corrective actions before data becomes unmanageable, let alone set up a system to better scale. This will always be the accountability of a central team, or at best shared with functional leaders when fully democratized. To be clear, not all these tools are required to start. Rather, understanding the nature of your current state (or starting point) will dictate how to quickly prioritize the use cases by which will be used to prioritize the modernization. You should balance quick wins and changes with large foundational changes which enable transformations to progress faster in the mid-term to keep momentum and continuously build trust.

An effective parallel strategy is to also build microservices or bots that constantly scan, audit, and ensure compliance. These microservices can perform a range of functions, from base compliance checks to full anomaly detection around asset utilization compared to normal service delivery, roles, and asset usage. By continuously monitoring data events and usage patterns, these microservices can detect anomalies and potential compliance breaches in real-time, enabling swift corrective actions. As noted above, all data resources and events should automatically be registered upon provisioning, so any data not catalogued is able to be immediately deleted by the bot as noncompliant.

The next chapter

Like a well-organized library where every book is catalogued and easily accessible, a well-managed data environment allows companies to thrive. Preventing data chaos requires a proactive and strategic approach to data management that does not also create more friction or processes for users. By implementing compliance as code, leveraging data visibility tools, and building microservices for continuous compliance, companies can ensure that their data assets remain findable, secure, and valuable. With these strategies in place, businesses can navigate the complexities of data management and drive sustained growth and innovation.

Lastly, fostering a culture of data stewardship within the organisation is vital. Educating employees on the importance of data management and establishing clear protocols for data handling can significantly reduce the enterprises’ risk. Regular training sessions and updates on best practices ensure that all team members are aligned with the company’s data governance goals.

We list the best cloud log management services.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Nik Acheson is Dremio's Field Chief Data Officer (CDO).