Share this
Data Lakes: what are they and how to use them?
by Nande Konst on Dec 7, 2021 1:50:02 PM
Data lakes and data warehouses are both used for storing big data, but the terms are not interchangeable. Basically, the only similarity between them is that they are both used for storing data. Their purpose, however, is different.
For some companies, a data warehouse is a better fit while for others a data lake is a better solution and in other cases, a hybrid setup is the best fit. In this article, I’ll explain what a data lake is and how it’s different from a data warehouse. I will focus on data lake architecture and the best practices and use cases for data lakes.
What is a data lake?
Like water molecules in a lake, a data lake consists of raw, free-flowing data. A data lake is a large storage repository that holds raw data in its native format. The unstructured character of the data is also one of the main differences with a data warehouse, which stores data that is already processed. The data in a data lake is waiting there until it's needed for analytics purposes.
A data lake uses a flat architecture to store data. Mainly files and object storage is used. There is no schema defined until the data is queried. Each piece of information has a unique identifier and is tagged with a set of metadata tags. A data lake can be queried for relevant data. This smaller set of data can be analyzed to answer business questions.
Why would you use a data lake?
Many data lakes, specifically designed for customer data management, contain large sets of structured, unstructured, and semistructured data. These environments cannot be used with a relational database because such a system requires a rigid schema for data, this limits them to only storing structured data.
Most data warehouses are built on relational databases and don’t allow various schemas. Data lakes support just that and don’t require any defined schema upfront. This is what makes them so strong in handling different types of data in different formats.
For many organizations, this is what makes data lakes key components in their data architecture. Most companies use them as a platform for big data analytics and data science applications that are using advanced analytics methods, such as data mining, machine learning, and predictive analytics.
Schema-on-read and Schema-on-write-access in data lakes
Data lakes don’t need a schema. This means that when a user wants to view the data they can apply a schema at that moment. This process is called schema-on-read. I think this is a huge advantage of data lakes.
For businesses that add new data sources on a regular basis, this is an extremely useful process. Defining a schema upfront is a time-consuming process which makes not having to design a schema a big advantage of data lakes.
A data warehouse has a predefined schema, this means the schema has been designed before the data is loaded into the warehouse. This process is called schema-on-write. It may prevent the insert of certain data that does not conform to the schema.
Therefore a data warehouse is better suited for cases where a business has a large amount of repetitive data that needs to be analyzed to answer predefined business requirements.
Lakehouse
Big data volumes are growing every day.
Gartner states that in 2022 90% of the corporate strategies contain analytics as an essential competency. This is why many business leaders have to rethink their strategy of storing and processing data. Many of those businesses have existing data warehouses which offer flexible data storage but lack the ability to wrangle with data for analytics purposes.
The solution to this problem is adding a complementary data lake to the existing architecture.
Such a hybrid setup is called a lakehouse and makes it possible to run concurrent workloads on the same datasets and reuse the available data for different purposes.
At Crystalloids, we see the advantages of lakehouse solutions for some of our clients too. The dual implementation allows it to reroute sensitive data to the data warehouse and keep it from entering the data lake.
This leads to better control over data usage as well as compliance as opposed to using only a data lake. A lake house solution is a great way to manage data in such a way that it serves many different needs.
What a data lake architecture should look like
Building the right data lake architecture is extremely important for turning data into value. The data in your data lake will be of no use if you don’t have the architectural features to manage the data effectively. Therefore it’s important to build the right features into the data lake architecture from the start.
Using a cloud-optimized architecture will simplify the data lake. A modern cloud data lake should have the following characteristics:
- Multi-clustered and shared-data architecture
- Independent storage resource and compute scaling
- A well-defined metadata service that is fundamental to an object storage environment
- All architectural components and their interaction should support native data types
- Data discovery, storage, transformation, and visualization should be managed independently.
- The architecture should be tailored to the industry, with its unique features and capabilities present for the specific domain.
Features that should be available in any data lake include:
Data governance: refers to the processes and standards that are used to ensure the data can fulfil its intended purpose. It helps to increase data quality and data security. An example of data governance can be to limit the file size in order to standardize them. Files that are too large can make the data difficult to work with. An example to increase data quality would be to scan the data for incomplete or unreadable data.
Data catalog: this is the information about the data that exists in the data lake. A data catalog makes it easy to understand the context of the data. It enables stakeholders to work faster with the data. The types of information in a data catalog vary per use case, but they usually include information like:
- The connectors necessary to work with data
- Metadata about the origin of the data and the amount of time it has been stored
- Description of the applications that use the data
Security: In order to make sure that sensitive data remains private security measures such as access control that prevent non-authorized parties from accessing and modifying data and encryption methods need to be implemented.
Data lake use cases
The more unstructured data an organization has, the bigger the need for a data lake solution. A good example of an industry where a data lake comes into play really well is the healthcare industry.
The unstructured nature of much of the data in healthcare, think about:
- X-rays
- MRI-scans
- clinical data
- medicine info
- analysis
The need for real-time insights make data lakes suitable for use in the healthcare industry since they allow a combination of structured and unstructured data.
In transportation, data lakes can help make predictions. The predictive capabilities of a data lake can have enormous cost-saving benefits, especially in supply chain management.
How Crystalloids uses data lakes to create a lake house solution
Building a lake house solution is often part of our job when we develop data platforms such as Customer Data Platforms.
The purpose of a Customer Data Platform is to centralize data coming from different sources. This data combined can provide very useful insights for all kinds of purposes. In the context of a customer data platform, a lakehouse solution is used to get valuable insights about customer behavior and automate customer activations in owned, paid, and earned channels.
An example: in the central data point a customer profile is formed based on data such as purchase behavior and loyalty, which is coming from various systems. Having this data present in a customer service application can provide a sales employee guidance on how to approach a specific type of customer.
In order to create such profiles, it’s important to record history. This is done in a data lake, other information that contributes to the creation of this profile can come from information stored in the data warehouse.
Both streaming data and batch data are processed. Whenever streaming data comes in it gets written in real-time to multiple databases that need the information.
Conclusion
Which solution is best for your organization, depends highly on the use case, the type of data, and the existing architecture. If you don’t have a clear purpose for your data yet and your organization has a lot of structured and unstructured information that needs to be used at a later stage, a data lake is a good solution.
In case your business has an existing warehouse that lacks the ability to wrangle with data, adding a data lake to the existing architecture is your way to go. In this case, we are talking about a lakehouse. It makes it possible to run concurrent workloads on the same datasets and reuse the data for different purposes.
A data warehouse is best suited for a situation in which a business has large amounts of repetitive data and predefined business requirements.
Share this
- November 2024 (3)
- October 2024 (2)
- September 2024 (1)
- August 2024 (1)
- July 2024 (4)
- June 2024 (2)
- May 2024 (1)
- April 2024 (4)
- March 2024 (2)
- February 2024 (2)
- January 2024 (4)
- December 2023 (1)
- November 2023 (4)
- October 2023 (4)
- September 2023 (4)
- June 2023 (2)
- May 2023 (2)
- April 2023 (1)
- March 2023 (1)
- January 2023 (4)
- December 2022 (3)
- November 2022 (5)
- October 2022 (3)
- July 2022 (1)
- May 2022 (2)
- April 2022 (2)
- March 2022 (5)
- February 2022 (3)
- January 2022 (5)
- December 2021 (5)
- November 2021 (4)
- October 2021 (2)
- September 2021 (2)
- August 2021 (3)
- July 2021 (4)
- May 2021 (2)
- April 2021 (2)
- February 2021 (2)
- January 2021 (1)
- December 2020 (1)
- October 2020 (2)
- September 2020 (1)
- August 2020 (2)
- July 2020 (2)
- June 2020 (1)
- March 2020 (2)
- February 2020 (1)
- January 2020 (1)
- December 2019 (1)
- November 2019 (3)
- October 2019 (2)
- September 2019 (3)
- August 2019 (2)
- July 2019 (3)
- June 2019 (5)
- May 2019 (2)
- April 2019 (4)
- March 2019 (2)
- February 2019 (2)
- January 2019 (4)
- December 2018 (2)
- November 2018 (2)
- October 2018 (1)
- September 2018 (2)
- August 2018 (3)
- July 2018 (3)
- May 2018 (2)
- April 2018 (4)
- March 2018 (5)
- February 2018 (2)
- January 2018 (3)
- November 2017 (2)
- October 2017 (2)