Delta Lake

  • Articles
  • Data Engineering

Introduction

Delta Lake is a storage framework developed in the Scala programming language. It is an open source tool for the Lakehouse architecture. Popular for its integrations with Spark, Flink, PrestoDB, Hive, among others, as well as its ability to perform ACID transactions. It offers scalable metadata management, snapshot versioning, schema variations, among others, making it a robust tool for building Data Lakehouses.

Data Warehouse

First of all, we have a Data Warehouse architecture, designed to support structured data blocks and to support reporting and BI areas, focusing on generating value for decision makers. This architecture has a high cost to handle unstructured or semi-structured data, as well as difficulties to increase in volume and concurrent users.

Data Lake

Secondly, let’s consider a Data Lake architecture. This system is based on the storage of raw data (structured, semi-structured and unstructured) and focuses on the availability of information for the areas of Data Science and Machine Learning. Although it covers deficiencies of a Data Warehouse, it does not reinforce data quality, consistency, isolation and presents difficulties to perform batch job mixing and streaming.

Arquitectura Lakehouse

An alternative to cover the gaps of the two previous architectures mentioned is to implement both systems, in order to enable both the reporting and BI team as well as the data scientists and machine learning engineers. This alternative results in duplication of information and high infrastructure costs. In this context, the Lakehouse architecture is presented as an alternative that combines the flexibility and scalability of data lakes with the transactionability of data warehouses.

The lakehouse architecture implements the data structuring and information management functionalities on the low-cost layer used in data lakes. The different data science, machine learning, BI and reporting teams will have direct access to the latest version of the information, as shown in the image.

The key functionalities that make up the lakehouse architecture are:

  • Metadata layers: Using Parquet files, the metadata that stores the versioning of the information and offers ACID transactions is stored.
  • Optimization of queries: By caching data in RAM/SSDs and vectorial executions in modern CPUs.
  • Optimized access to Data Science and Machine Learning tools: The use of the Parquet format and compatibility with popular tools such as Pandas, TensorFlow, PyTorch, among others.

Delta Lake

The previously reviewed lakehouse architecture is built on top of Delta Lake, a framework built on top of Apache Spark. Delta Lake uses Parquet files to store data, as well as transaction logs and commits.

Delta Lake has 3 data management environments:

  • Bronze: Raw data storage space. Raw data is ingested without modifications or refinement and formats are converted to Delta Lake.
  • Silver: In this environment transformations, refinement and feature engineering are performed. The data from this block can be taken or consulted by BI or Data Science teams.
  • Gold: The most refined stage of the information. It is mainly used for generating final views of the data, updating dashboards or delivering value to decision making teams.

Delta Lake’s main functionalities are as follows:

  • ACID transactions: Guarantees ACID transactions on reads and writes. This means that multiple write requests can perform simultaneous modifications and information consumers will be able to view a consistent version of the information, even if the table is modified during a run.
  • Scalable Metadata: Uses distributed computing in Spark to manage metadata information at petabyte scales with ease.
  • Time Travel: Information versioning. Similar to git, Delta Lake has versioning functionality, allowing you to revert changes, replicate previous versions and generate reproducible experiments over time.
  • Unified Batch / Streaming: Delta Lake uses Apache Spark’s structured streaming format, a streaming engine in Spark SQL for data computation. This functionality allows to unify batch + streaming processes and facilitates the construction of data pipelines.
  • Audit history: All changes in the delta lake are stored in a transaction table associated to the table. Each operation is versioned and can be viewed with the DESCRIBE HISTORY command.

Delta Lake In Cloud – Hands On

Employing the most popular cloud platforms (AWS, Azure, GCP), the process of deploying a Delta Lake follows the following steps:

  • Apache Spark configuration: Delta Lake interacts with Spark Scala or PySpark. Depending on the environment in use, either Python or Scala can be chosen.
  • Data ingestion: From an existing Data Lake or by creating Delta Lake environments (Bronze, Silver Gold) using S3 Buckets (AWS), Blob Storage (Azure) or Cloud Storage (GCP). Data can be ingested into the Delta Lake in Batch or Streaming.
  • Data creation and update: To create Delta tables, we can use Spark SQL and manipulate data from parquet, csv, json to delta. The framework allows the creation, reading, updating and versioning of previous tables (time travel).

For more information, see the delta starter guide

Recommendations

The main strengths of the lakehouse architecture over Delta Lakes are in ACID transactions, versioning (Time Travel), metadata management and batch-streaming unification, all mounted on an open source format and 100% compatible with Apache Spark.

Since Delta Lake covers a use case where data availability and transactionality is a priority, the volume, application and data types are relevant when considering a warehouse, lake or lakehouse.

In cases where a dual architecture (warehouse + lake) is needed, a lakehouse should be considered to avoid incurring duplicate data or additional costs.

We recommend reviewing Snowflake, Splunk and Athena as alternatives to Delta Lake.

Se recomienda revisar Snowflake, Splunk y Athena como alternativas a Delta Lake.

Success case – Columbia 2020

In 2020, sportswear company Columbia gave a talk at the AI Summit. As part of the developments in data, Columbia required to increase its capacity to be a data driven company, consolidate and deliver corporate information and, finally, consolidate and grow with a solid data governance, aligned to its processes and products.

They started with a traditional Data Warehouse that consolidated different sources of information from legacy systems, billing systems, CRMs and flat files, to which the Warehouse connected visualization tools.

Using Delta Lake, they re-planned their data architecture, considering existing pipelines, business access needs and security. The result was a unified data platform on delta lake, with stream and batch processes, connection layers for advanced analytics, visualizations and use of cloud services.

The most representative results were the following:

  • The times used to obtain data were reduced from 1 week to 1 day.
  • All data was brought to Delta Lake and made massively available. Microservices are used for streaming
  • Expanded access to data for business needs, self-served channels and reporting.
  • Resources were made available for interaction by Data Scientists.

For more information, please consult: