What is a data lake? Everything you need to know

What is a data lake?
(Image credit: Pixabay)

When it comes to cloud computing, the terms we use are almost as important as the data we store and analyze. Companies that communicate about how cloud computing data is stored, retrieved, accessed and archived tend to maximize the use of that data. This leads to better products, higher revenue for the company, and more growth. More than anything, it leads to better communication between business units, the Information Technology department, and even the front office, sales, marketing, customers and business partners.

One of the terms that came into wide use over the last few years is a data lake. Before the rise of cloud computing, and even before the Internet was widely used as a means of transmitting data, cloud computing experts used the term data warehouse, but it wasn’t quite sufficient. A data warehouse, as the name implies because of how a “warehouse” is highly organized, consists of data that a company processes, analyzes, and reuses as part of its cloud storage management. For a retailer, a data warehouse might contain all of the product information, SKUs (stock keeping unit), and prices. A data warehouse is typically optimized for a fast, reliable access.

A data lake is not so highly organized. Cloud computing experts started using the term data lake to differentiate the storage of both structured and unstructured data compared to a data warehouse. With a data lake, there is no assumption about the data being optimized.

Yet, there are clear advantages. A data lake can contain a wide assortment of data, but companies can still run cloud analytics on the data, they can still operate a business dashboard, and they can still use the data in an app or for other processing duties. While it is a catch-all term that can consist of massive data stores and is highly scalable and useful for multiple purposes, a data lake also a generic way of describing unorganized and organized data.

Key components

In order to understand a data lake and how it helps companies access cloud computing information in a way that does not require optimization or re-structuring of the data, it’s also important to understand the key components. A data lake often involves machine learning, which is a way to understand and process data using automated methods.

In the case of a retailer who needs to access product information, machine learning can determine which SKUs are stored in a data lake and pull that data into an app. Information Technology service management personnel do not need to organize the data first.

Another key component is analytics. With most structured business data, it’s important to have a database whereby IT professionals can generate reports, run SQL queries, or make use of the data in a logical, predictable way. Think of the typical health-care company that needs to have structured data available to medical staff in order to run analytics and reporting -- it typically has to be in a centralized cloud database and optimized for use (e.g., stored in a data warehouse). However, companies can still run analytics on a data lake without having to first optimize the data, and that is one of the key advantages. In fact, as machine learning and data optimization improve, a data lake of structured and unstructured data becomes even more valuable.

One last component of a data lake: It is not always assumed that the data will be used in the cloud. While a data warehouse might be optimized for on-premise use or in the cloud, a data lake can involve moving data for on-premise use in an internal app (one that pulls data from your own servers) or can be used externally (using online cloud storage and computing data stores).

How the company benefits

One of the keys to understanding the term data lake is to think about how companies access data in the first place. It is not quite as “clean” as you would think. Sometimes, data arrives in a haphazard fashion (called unstructured data) and it’s dumped out to a repository; companies don’t always known the original source of the data. Sometimes, it’s stored in a relational database used for a business app, sometimes it’s a collection of social media data or something that feeds a mobile app used by external customers. The main point to make here is that a data lake provides increased flexibility over how a company can use the data.

So, while a data warehouse is more structured and optimized way of cloud hosting data, and meant for a specific purpose, a data lake is flexible enough for multiple purposes. There’s no need to first create a clear and obvious usage model for the data and to house it in a specific way in a database. It is always available, can be used for multiple purposes and disparate apps, and intended for on-premise processing on your own servers or access from the cloud. It’s ready for anything.

John Brandon
Contributor

John Brandon has covered gadgets and cars for the past 12 years having published over 12,000 articles and tested nearly 8,000 products. He's nothing if not prolific. Before starting his writing career, he led an Information Design practice at a large consumer electronics retailer in the US. His hobbies include deep sea exploration, complaining about the weather, and engineering a vast multiverse conspiracy.

Latest in Pro
Finger Presses Orange Button Domain Name Registration on Black Keyboard Background. Closeup View
I visited the world’s first registered .com domain – and you won’t believe what it’s offering today
Racks of servers inside a data center.
Modernizing data centers: an efficient path forward
Dr. Peter Zhou, President of Huawei Data Storage Product Line
Why AI commonization is so important for business intelligent transformation and what Huawei’s data storage has to offer
Wix automation
The world's leading website builder aims to save businesses time with new tool
Data Breach
Thousands of healthcare records exposed online, including private patient information
China
Juniper patches security flaws which could have let hackers take over your router
Latest in News
Google Pixel 8a in aloe green showing
Google Pixel 9a benchmark link teases the performance of the upcoming mid-ranger
Quordle on a smartphone held in a hand
Quordle hints and answers for Monday, March 17 (game #1148)
NYT Strands homescreen on a mobile phone screen, on a light blue background
NYT Strands hints and answers for Monday, March 17 (game #379)
NYT Connections homescreen on a phone, on a purple background
NYT Connections hints and answers for Monday, March 17 (game #645)
Apple iPhone 16 Pro HANDS ON
Leaked iPhone 17 dummy units may have given us our best look yet at all four models
A super close up image of the Google Gemini app in the Play Store
It's official: Google Assistant will be retired for phones this year, with Gemini taking over