Data Warehouse vs. Data Mart vs. Data Lake


What is what and where are the Differences?

Photo by Salmen Bejaoui on Unsplash

In addition to Data Warehouses, which are now firmly established in the corporate world, Data Lakes are also becoming increasingly common, and one often hears of Data Marts, but what is actually what?


The Data Warehouse


The Data Warehouse is an analytical, usually relational database (SQL) or hybrid system (Mix of SQL and NoSQL) created from different data sources. The goal is usually to store historical data for later analysis. Data Warehouses often have extensive computing and storage resources for running complicated queries and generating reports. They are often used as data sources for business intelligence and machine learning systems. New approaches, technologies and especially the cloud are changing the field a lot and offer new opportunities. Data Warehouses are classic relational systems that work with structured data. Exceptions are new cloud-based Data Warehouse technologies such as BigQuery or Snowflake which can also work with unstructured data and are column-based.


The Data Lake


The Data Lake, on the other hand, is a large pool of raw data for which no use has yet been determined. A Data Warehouse is a repository for structured, filtered data that has already been processed for a specific purpose [1].While Data Warehouses use the classic ETL process in combination with structured data in a relational database, a Data Lake uses paradigms such as ELT and a schema on-read as well as often unstructured data [2].

A Data Lake can also be used as the basis for a Data Warehouse, so that the data is then made available in structured form in the Data Warehouse from there.


The difference between Data Warehouse versus Data Lake .

So what is a Data Lakehouse?


It is not just about integrating a Data Lake with a Data Warehouse, but rather integrating a Data Lake, a Data Warehouse, and purpose-built storage to enable unified governance and ease of data movement[3]. From my own experience it has often shown that Data Lakes can be realised much faster. Once all data is available, Data Warehouses can still be built on top of it as a hybrid solution.

Hybrid Data Lake-house concept

This makes rigid and classically planned Data Warehouses a thing of the past. This greatly accelerates the provision of dashboards and analyses and is a good step towards a data-driven culture. An implementation with new SaaS services from the cloud and approaches such as ELT instead of ETL also accelerates the development.

In my opinion, this approach has been around for some time, especially in the area of Cloud Data Warehousing and Data Lakes. Here, these two technologies have long been combined with each other in a hybrid approach (Read here more about it). In my opinion, the new trend term Data Lakehouse simply describes this established approach.


How to build up a Data Lakehouse?


To be a bit more concrete, we can take a look at how and with which technologies and services such Data Lakehouses can be built. In the figure below, an architecture is shown that was realised in the Google Cloud. Here, Cloud Storage and BigQuery are used as storage. Due to the good connectivity in Google Cloud, the services can easily exchange data with each other and thus be used for analysis, machine learning and other topics.


Data Lakehouse on GCP -Source Google

The Data Mart


The Data Mart contains data focused on a specific line of business. Normally they dependent on the Data Warehouse, but also could be indpenedent that is often when it’s originating from an operational database or system. They are therefore a subset of a Data Warehouse or a system. Data marts offer the advantage of faster data retrieval. Since they usually contain only a subset of the Data Warehouse, the amount of data in a data mart is consequently considerably smaller. This smaller amount of data provides a much faster result when queries are made [3].


Summary


Data Lakehouses combine the Data Lake with a Data Warehouse to enable unified governance and ease of data movement [4]. From the Data Warehouse, the data can then be distributed to BI layers, ML services or even Data Marts. Read here more about Data Lakehouses.


Sources and Further Readings


Talend, Data Lake vs. Data Warehouse

IBM, Charting the data lake: Using the data models with schema-on-read and schema-on-write (2017)

Datenbanken Verstehen, Data Mart (2022)

AWS, What is a Lake House approach? (2021)




Leave a comment

Design a site like this with WordPress.com
Get started