Delta Lake is a storage framework developed in the Scala programming language. It is an open source tool for the Lakehouse architecture. Popular for its integrations with Spark, Flink, PrestoDB, Hive, among others, as well as its ability to perform ACID transactions. It offers scalable metadata management, snapshot versioning, schema variations, among others, making it a robust tool for building Data Lakehouses.
First of all, we have a Data Warehouse architecture, designed to support structured data blocks and to support reporting and BI areas, focusing on generating value for decision makers. This architecture has a high cost to handle unstructured or semi-structured data, as well as difficulties to increase in volume and concurrent users.
Secondly, let’s consider a Data Lake architecture. This system is based on the storage of raw data (structured, semi-structured and unstructured) and focuses on the availability of information for the areas of Data Science and Machine Learning. Although it covers deficiencies of a Data Warehouse, it does not reinforce data quality, consistency, isolation and presents difficulties to perform batch job mixing and streaming.
An alternative to cover the gaps of the two previous architectures mentioned is to implement both systems, in order to enable both the reporting and BI team as well as the data scientists and machine learning engineers. This alternative results in duplication of information and high infrastructure costs. In this context, the Lakehouse architecture is presented as an alternative that combines the flexibility and scalability of data lakes with the transactionability of data warehouses.
The lakehouse architecture implements the data structuring and information management functionalities on the low-cost layer used in data lakes. The different data science, machine learning, BI and reporting teams will have direct access to the latest version of the information, as shown in the image.
The key functionalities that make up the lakehouse architecture are:
The previously reviewed lakehouse architecture is built on top of Delta Lake, a framework built on top of Apache Spark. Delta Lake uses Parquet files to store data, as well as transaction logs and commits.
Delta Lake has 3 data management environments:
Delta Lake’s main functionalities are as follows:
Employing the most popular cloud platforms (AWS, Azure, GCP), the process of deploying a Delta Lake follows the following steps:
For more information, see the delta starter guide
The main strengths of the lakehouse architecture over Delta Lakes are in ACID transactions, versioning (Time Travel), metadata management and batch-streaming unification, all mounted on an open source format and 100% compatible with Apache Spark.
Since Delta Lake covers a use case where data availability and transactionality is a priority, the volume, application and data types are relevant when considering a warehouse, lake or lakehouse.
In cases where a dual architecture (warehouse + lake) is needed, a lakehouse should be considered to avoid incurring duplicate data or additional costs.
We recommend reviewing Snowflake, Splunk and Athena as alternatives to Delta Lake.
Se recomienda revisar Snowflake, Splunk y Athena como alternativas a Delta Lake.
In 2020, sportswear company Columbia gave a talk at the AI Summit. As part of the developments in data, Columbia required to increase its capacity to be a data driven company, consolidate and deliver corporate information and, finally, consolidate and grow with a solid data governance, aligned to its processes and products.
They started with a traditional Data Warehouse that consolidated different sources of information from legacy systems, billing systems, CRMs and flat files, to which the Warehouse connected visualization tools.
Using Delta Lake, they re-planned their data architecture, considering existing pipelines, business access needs and security. The result was a unified data platform on delta lake, with stream and batch processes, connection layers for advanced analytics, visualizations and use of cloud services.
The most representative results were the following:
For more information, please consult: