In the ever-evolving world of data, organizations are constantly faced with the challenge of selecting the optimal format for their data lakehouses. With a plethora of options available, such as the Linux Foundation Delta Lake, Apache Iceberg, and Apache Hudi, the decision-making process can be overwhelming. Enter Delta UniForm, a game-changer in the realm of data interoperability. In this blog, we’ll delve deep into the world of Delta UniForm and its transformative impact on the data ecosystem.
Table of Contents
The Need for a Unified Data Format
The open data lakehouse paradigm promises data democratization and interoperability. But with so many excellent storage formats available, how does one choose? The answer lies not in choosing one over the other but in finding a solution that bridges the gap between them. This is where Delta UniForm shines.
Delta UniForm: A Seamless Integration
Delta UniForm, short for Delta Lake Universal Format, offers a harmonious unification of table formats without the need for additional data copies or silos. At its core, Delta UniForm leverages the power of Apache Parquet data files, a common foundation for Delta Lake, Iceberg, and Hudi. The magic lies in the metadata layer, where subtle differences between these formats exist. Delta UniForm elegantly addresses these differences, providing a live, up-to-date view of data across all formats:
- UniForm allows Delta tables to be read with Iceberg reader clients.
- Both Delta Lake and Iceberg use Parquet data files and a metadata layer.
- UniForm generates Iceberg metadata asynchronously without rewriting data. This allows Iceberg clients to read Delta tables as if they were Iceberg tables.
- Unity Catalog can act as an Iceberg catalog.
Setting Up Delta UniForm: A Walkthrough
The beauty of Delta UniForm lies in its simplicity. With just a few commands, one can set up a Delta UniForm table, write data to it, and automatically generate the necessary metadata for Iceberg and Hudi. This seamless integration ensures uninterrupted data pipelines and provides real-time access.
Requirements to Enable UniForm:
- The Delta table must be registered to the Unity Catalog. Both managed and external tables are supported.
- The table must have column mapping enabled. See Rename and drop columns with Delta Lake column mapping.
- The Delta table must have a
minReaderVersion>= 2 and
minWriterVersion>= 7. See How does Azure Databricks manage Delta Lake feature compatibility?.
- Writes to the table must use Databricks Runtime 13.2 or above.
Enabling and Using UniForm:
Now let’s understand how to set the specific Delta Table properties and Delta Features. The following table property enables UniForm support for Iceberg.
iceberg is the only valid value.
'delta.universalFormat.enabledFormats' = 'iceberg'
You must also enable column mapping and
IcebergCompatV1 to use UniForm. These are set automatically if you enable UniForm during table creation, as in the following example:
CREATE TABLE T(c1 INT) TBLPROPERTIES( 'delta.universalFormat.enabledFormats' = 'iceberg');
If you create a new table with a CTAS statement, you must manually specify column mapping, as in the following example:
CREATE TABLE T TBLPROPERTIES( 'delta.columnMapping.mode' = 'name', 'delta.universalFormat.enabledFormats' = 'iceberg') AS SELECT * FROM source_table;
If you are altering an existing table, you must specify all of these properties, as in the following example:
ALTER TABLE T SET TBLPROPERTIES( 'delta.columnMapping.mode' = 'name', 'delta.enableIcebergCompatV1' = 'true', 'delta.universalFormat.enabledFormats' = 'iceberg');
Generating Iceberg Metadata:
After you save data in Azure Databricks, it automatically starts a background process to prepare the data for Iceberg. If you want, you can also start this process yourself. To keep things speedy, if there are many quick data updates, they might be grouped together before being prepared for Iceberg. Azure Databricks ensures only one preparation process happens at once. If data is saved while another preparation is ongoing, it waits its turn. This keeps things running smoothly, especially when data is updated frequently.
Checking Metadata Generation Status:
UniForm adds two new details to the Unity Catalog and Iceberg table to keep track of data preparation:
- converted_delta_version: This shows the latest version of the Delta table that was prepared for Iceberg.
- converted_delta_timestamp: This shows when the latest data preparation for Iceberg happened.
You can check these details in Azure Databricks using the Catalog Explorer or the REST API. If you’re using other tools like Apache Spark, there’s a specific command to view these details.
Reading Delta UniForm as Iceberg:
Delta UniForm creates Iceberg data details based on the Apache Iceberg rules. This means when data is added to a Delta UniForm table, any tool that follows the Iceberg rules can read it.
According to Iceberg rules, tools reading the data need to determine the most recent version of the table. There are two main ways to do this:
- Some tools ask users to give the location of the latest data file. This can be inconvenient because users have to update this location every time the table changes.
- A better way, suggested by the Iceberg community, is using the REST catalog API. This lets the tool automatically get the latest table data, making it easier for users.
Unity Catalog now uses this Iceberg REST API, showing its support for open standards. This allows free access to UniForm tables in the Iceberg format and ensures the latest data is always available. It also lets other catalogs connect to Unity Catalog and support Delta UniForm tables.
BigQuery: With Delta UniForm, reading Delta Lake as Iceberg in BigQuery becomes a breeze. By simply providing the metadata location, BigQuery can access the latest snapshot of the Iceberg table. And with Unity Catalog, finding the required Iceberg metadata file path is just a click away.
Trino: Trino’s support for the Apache Iceberg REST Catalog API means that reading a Delta UniForm table is as simple as issuing a query. No need for external tables or metadata paths. Just pure, unadulterated data access.
In the dynamic landscape of data management, the quest for seamless interoperability and efficient data handling is paramount. Delta UniForm emerges as a beacon of hope, addressing the complexities of data format selection and bridging the gaps between prominent storage formats. By leveraging the foundational strengths of Apache Parquet and innovatively addressing metadata nuances, Delta UniForm promises a live, unified view of data. Its integration with platforms like BigQuery and Trino further underscores its versatility. As organizations grapple with the challenges of data democratization, Delta UniForm stands out as a robust solution, ensuring that data remains accessible, consistent, and up-to-date. In essence, Delta UniForm is not just a tool; it’s a transformative step towards the future of open data lakehouse interoperability.