Lakehouse Federation Best Practices - Beyond the Horizon...

Step into the future of data management with the revolutionary Lakehouse Federation. Envision a world where data lakes and data warehouses merge, creating a formidable powerhouse for data handling. In today’s digital age, where data pours in from every corner, relying on traditional methods can leave you in the lurch. Enter Lakehouse Federation, a game-changer that harnesses the best of both worlds, ensuring swift insights, seamless data integration, and accelerated decision-making.

Dive into this article to unravel the magic behind Lakehouse Federation. Discover its unmatched advantages, journey through real-world applications, and master the art of leveraging it. By the time you reach the end, you’ll be equipped with the knowledge to transform your data strategies and set the stage for unparalleled success.

Table of Contents

What is the Lakehouse Federation?

Lakehouse Federation is a feature in Azure Databricks that lets you query data from multiple different sources as if they were in one place. Imagine you have data in different “buckets” (like databases or storage systems). Instead of moving all the data into one big bucket to analyze it, Lakehouse Federation allows you to leave the data where it is and still analyze it together. It’s like being able to read books from multiple libraries without having to bring all the books to your house.

Lakehouse Federation is built on top of Unity Catalog, which is a unified metadata management platform for Databricks. Unity Catalog provides a single view of all of your data, regardless of where it is stored. This makes it easy to discover, manage, and govern your data.

Why use it?

Think of Lakehouse Federation as a tool that lets you peek into multiple data “boxes” without having to dump everything into one big box. Here’s why you might want to use it:

Quick Reports: If you just want a quick look at data from different places for a report, you can do so without the hassle of moving everything around.
Testing Ideas: Before committing to a big project, you might want to test an idea using data from various sources. Lakehouse Federation lets you do this without the heavy lifting of data migration.
Exploring New Data: If you’re considering adding new data to your main system, you can first explore it where it currently resides.
Smooth Transition: If you’re slowly moving data from one place to another, Lakehouse Federation helps you work with data in both places during the transition.

In simpler terms, it’s like being able to make a sandwich using ingredients from multiple fridges without having to first move everything into one fridge.

Lakehouse Federation is a good option for use cases when:

You want to Query data without ingesting it into Azure Databricks: Lakehouse Federation allows you to query data that is stored in an external database system without having to copy it into Azure Databricks. This can save you time and money, as it can be expensive to copy large amounts of data.
You want to take advantage of the computing power of the external database system: The external database system may have more computing power than Azure Databricks. This means that your queries may run more efficiently if you run them on the external database system.
You want to use the data governance features of Unity Catalog: Unity Catalog is a data governance tool that provides features such as fine-grained access control, data lineage, and search. These features can help you to protect your data and to make it easier to find and use.

Comparison of Lakehouse Federation with other tools

Now, I will compare Lakehouse Federation with other Azure Databricks tools for accessing data in external data systems:

Feature	Lakehouse Federation	Delta Sharing	Unity Catalog External Table	Legacy Query Federation
Data Sharing	Supports sharing data with external systems that do not support Delta Sharing	Supports sharing data with external systems that support Delta Sharing	Does not support sharing data with external systems	Requires configuring connections to each table separately
Data Governance	Provides data governance tools, such as row and column level access controls, discovery features like tags, and data lineage	Does not provide data governance tools	Does not provide data governance tools	Does not provide data governance tools
Query Pushdown	Supports query pushdown, which can improve the performance of queries	Does not support query pushdown	Does not support query pushdown	Does not support query pushdown
Use Cases	When you need to share data with external systems that do not support Delta Sharing, or when you need to run complex queries on external data and improve the performance of those queries	When you can share the data files directly, such as with cloud-based data lakes	For data that you access directly from simple file systems, like Amazon S3, Google Cloud Storage, or Azure Data Lake Storage Gen2	When you need to run queries against tables in external systems that are not configured for Lakehouse Federation or Delta Sharing

Set up Lakehouse Federation

Before we start setting up the Lakehouse Federation, it’s important to understand which kinds of databases it can work with. Here’s a list of supported database types for making connections:

MySQL
PostgreSQL
Amazon Redshift
Snowflake
Microsoft SQL Server
Azure SQL Data Warehouse (SQLDW)
Databricks

Here, I will explain to you the steps to set up Lakehouse Federation using Microsoft SQL Server:

So, to make a dataset available for read-only querying using Lakehouse Federation, you need to:

Create a connection: It is a securable object in Unity Catalog that specifies a path and credentials for accessing an external database system.
Create a foreign catalog: It is a securable object in Unity Catalog that mirrors a database in an external data system, enabling you to perform read-only queries on that data system in your Azure Databricks workspace, managing access using Unity Catalog.

STEP 1: Create a Connection

STEP 1: Click External data.

STEP 2: go to the connection menu and click it.

STEP 3: Click “Create connection“.

STEP 4 – STEP 8: Enter the Azure SQL DB connection details.

STEP 9 – STEP 11: Enter Host,trustServerCertificate, and comments values. Finally, when you click Test Connection it will test the connection, and if there is an error you can troubleshoot it. You need to open the firewall in the SQL server for the Databricks worker node’s IP addresses so the connectivity from Databricks to Azure SQL DB can be established.

See the newly created connection displayed below.

I always recommend using GUI but instead of following GUI steps, you can follow the following command to create a connection:

CREATE CONNECTION <connection-name> TYPE sqldw
OPTIONS (
  host '<hostname>',
  port '<port>',
  user '<user>',
  password '<password>'
);

For sensitive information or values, you should use Azure Databricks Secrets instead of plaintext strings.

CREATE CONNECTION <connection-name> TYPE sqldw
OPTIONS (
  host '<hostname>',
  port '<port>',
  user secret ('<secret-scope>','<secret-key-user>'),
  password secret ('<secret-scope>','<secret-key-password>')
)

Here you will find the step-by-step guide to create a secret scope – Create Secret Scope

STEP 2: Create a Foreign Catalog

Here are the steps to create Foreign catalogs:

STEP 1: Select the Data tab in the data explorer.

STEP 2: Click “Create catalog“

STEP 3 to STEP 6: Enter the following information for the foreign catalog.

STEP 7: Enter some comments:

STEP 8: Click Create button.

STEP 9: Now you can see that the external catalog is created successfully.

I always recommend using GUI but instead of following GUI steps, you can follow the following command to create a foreign catalog:

CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
OPTIONS (database '<database-name>');

Where,

<catalog-name>: Name for the catalog in Azure Databricks.
<connection-name>: The connection object that specifies the data source, path, and access credentials.
<database-name>: Name of the database you want to mirror as a catalog in Azure Databricks.

STEP 3: Query data from the Azure SQL Database with Lakehouse Federation.

First, we will use the catalog and schema. Second, we will test which schema is being selected in Databricks currently.

Now we will access the columns from the schema and table name as depicted below. so here we have queried the data kept in the SQL server without importing the data inside Lakehouse with the help of the Lakehouse federation.

Limitations

While the lakehouse federation offers numerous benefits, it’s essential to be aware of its constraints before implementing it. Here are some of its limitations:

It allows querying data from external catalogs, but writing data back to them isn’t possible.
There’s a cap on the simultaneous queries that can be executed against an external catalog.
Some queries might not be fully optimized to leverage the capabilities of the underlying data source.
Databricks is in the process of introducing more connection types. For upcoming connections or to suggest a new one, it’s best to get in touch with your Azure Databricks representative.
The current release doesn’t support OAuth, a method for user authentication without sharing passwords. However, this feature is on the roadmap for future updates.
Creating new schemas or tables in an external catalog isn’t possible. You’re restricted to querying existing schemas and tables.
In the Unity Catalog, all table names are standardized to lowercase. Hence, when querying tables in a MySQL external catalog, ensure you use lowercase names.
Table or schema names that aren’t compatible with Unity Catalog are not allowed. For instance, table names with spaces are prohibited.

Conclusion

In the ever-evolving landscape of data management, the Lakehouse Federation emerges as a beacon of innovation, bridging the gap between data lakes and data warehouses. As we’ve journeyed through its intricacies, it’s evident that this unified approach offers a transformative solution for businesses grappling with vast and varied data sources. While it’s not without its limitations, the potential benefits far outweigh the constraints. As Databricks continues to refine and expand its capabilities, the Lakehouse Federation promises to be an indispensable tool for organizations aiming for data excellence. So, as you chart your data strategy, consider the Lakehouse Federation as your compass, guiding you toward informed decisions and unparalleled success in the digital realm.

What is the Lakehouse Federation?

Why use it?

Comparison of Lakehouse Federation with other tools

Set up Lakehouse Federation

STEP 1: Create a Connection

STEP 2: Create a Foreign Catalog

STEP 3: Query data from the Azure SQL Database with Lakehouse Federation.

Limitations

Conclusion

Share this:

Like this:

Related

+ There are no comments

Leave a ReplyCancel reply

Discover more from Beyond the Horizon...