Databricks Unity Catalog Best Practices: Streamlining Data Management for Enhanced Collaboration


Databricks Unity Catalog provides a powerful solution that enables teams to efficiently manage and collaborate on their data assets. By implementing best practices for utilizing Databricks Unity Catalog, organizations can unlock the full potential of their data and enhance collaboration across teams. In this article, we will explore the best practices for streamlining data management using Databricks Unity Catalog and how it can revolutionize your organization’s data-driven workflows.

What is Unity Catalog?

Databricks Unity Catalog is a comprehensive solution that offers centralized access control, auditing, lineage, and data discovery capabilities specifically designed for Databricks workspaces. The key features of Unity Catalog are as follows:

  • Provides a unified platform for administering data access policies across workspaces and user personas.
  • Offers a standards-compliant security model based on ANSI SQL, allowing familiar syntax for granting permissions.
  • Automatically captures detailed audit logs and lineage data to track data access and usage.
  • Facilitates data discovery through tagging, documentation, and an intuitive search interface for easy access to relevant data assets.

Unity Catalog is a powerful tool for managing and securing data within Databricks workspaces, providing centralized access control, auditing, lineage tracking, and data discovery functionalities.

Read: Databricks Performance Optimization Guide

Databricks Unity Catalog: Best Practices

Here are some best practices to follow when working with Unity Catalog to ensure efficient and secure data management.

Best Practice 1: Configure a Unity Catalog Metastore

You should create a single Metastore for each region where you use Azure Databricks and link it to all the workspaces in that region. Therefore, if you have multiple regions using Databricks, you will have multiple Metastores.

Databricks Unity Catalog - Configure a Unity Catalog Metastore

A Metastore is the top-level container of objects in the Unity Catalog, which is a fine-grained governance solution for data and AI on the Databricks Lakehouse. It stores metadata about data assets (tables and views) and the permission that governs access to them.

Recommendation: In a Unity Catalog Metastore, there is a designated root storage location for managed tables. It is crucial to prevent direct user access to this location to maintain security and auditability. To ensure data integrity, it is recommended not to reuse a container that is currently or was previously a root file system for the Unity Catalog Metastore’s storage location. You need to ensure that no user has direct access to this container.

For example, if your DBFS root file system is abfs://Container2@ Azure_Storage_account_name.dfs.core.windows.net/path/, you should not use this location as the root storage location for your Metastore.

You might be thinking why we should not use the root DBFS location. Actually, DBFS root is accessible to all users in a workspace, all users can access any data stored here. It is important to avoid using this location for storing sensitive data as well as for Metastore. The default location for managed tables in the Hive Metastore on Databricks is the DBFS root; to prevent end users who create managed tables from writing to the DBFS root, declare a location on external storage when creating databases in the Hive Metastore.

Best Practice 2: Use Catalogs to Organize your Data

In Unity Catalog, the catalogs are used to organize the data. It would help if you used catalogs to provide segregation across your organization’s information architecture. You can use catalogs to group data assets by business unit or function, Software Development Environment Scope or team, etc. And this segregation depends on how you want to organize the information architecture. Once you have done this segregation and defined the catalogs now you can further organize data assets within a catalog by creating schemas.

Software Development Environment Scope-based catalog segregation

Business Unit based catalog segregation

Team-based catalog segregation

Best Practice 3: Configure Access Control

You should use the principle of least privilege to grant access to data assets in Unity Catalog. Each securable object in Unity Catalog has an owner, who can grant or revoke permissions to other users or groups. You can use roles to simplify the management of permissions across multiple objects. Use standard ANSI SQL to define and enforce data access policies across all workspaces and personas.

In the Unity catalog, developers can use the familiar syntax they have been using for Databases. You can grant privileges using SQL statements, utilizing the GRANT and REVOKE keywords within a notebook or the Databricks SQL query editor. Additionally, the SHOW keyword enables you to view a comprehensive list of grants on specific objects.

To grant privileges, follow this syntax:

GRANT privilege_type ON securable_object TO principal

In the above code:

  • privilege_type: Refers to the type of privilege within Unity Catalog, such as SELECT, MODIFY, CREATE TABLE, and more.
  • securable_object: Represents the object within Unity Catalog, such as METASTORE, CATALOG, SCHEMA, TABLE, etc.
  • principal: Denotes the user, service principal (identified by its applicationId value), or group. Enclose names with special characters in backticks (`).

For instance, the following command grants the group “finance-team” the ability to create tables in the “default” schema under the parent catalog “main“:

GRANT CREATE TABLE ON SCHEMA main.default TO finance-team;

Best Practice 4: Use Cluster Configurations to Control Data Assets

The cluster configuration is a set of options that you choose when you create a cluster in Azure Databricks that determines how the cluster can access data and AI assets on the Lakehouse.

Below are some of the reasons why you should use Cluster configuration:

  1. Unity Catalog provides a central hub to manage and track data access, enhancing security and enabling efficient governance and clusters are important data assets.
  2. By enforcing standardized cluster configurations, Unity Catalog prevents resource misuse, optimizes utilization, and helps control costs.
  3. Unity Catalog ensures precise chargeback processes by accurately tagging clusters, enabling transparent cost allocation.
  4. Users benefit from pre-configured cluster setups designed for specific workloads, facilitating analysis and processing tasks.
  5. Unity Catalog adheres to the principle of least privilege, reducing the risk of data leakage and unauthorized access.

Databricks recommends using cluster policies to simplify cluster creation, limit configuration options, and ensure only Unity Catalog-enabled clusters are created. Cluster policies also help control costs by setting maximum cost limits.

To maintain access controls and isolation, Unity Catalog requires appropriate access modes. There are two such access modes:

  1. Shared Access Mode
  2. Single-User Access Mode

1. Shared Access Mode:

This is used for clusters that can be shared by multiple users. Each user is fully isolated from other users, so they cannot see each other’s data and credentials. This mode supports Python (on Databricks Runtime 11.3 LTS and above) and SQL languages. Databricks suggest this mode for shared clusters that run interactive workloads.

For example, you have a team of data analysts who need to query data in Unity Catalog using SQL. For that, you will create a shared cluster with shared access mode and assign it to your team. Now, each analyst can use the cluster to run queries on the data they have permission to access, without interfering with other analysts’ data or credentials.

The JSON below provides a policy definition for a shared cluster with the User Isolation security mode:

{
"spark_version": {
    "type": "regex",
    "pattern": "1[0-1]\\.[0-9]*\\.x-scala.*",
    "defaultValue": "10.4.x-scala2.12"
},
"access_mode": {
    "type": "fixed",
    "value": "USER_ISOLATION",
    "hidden": true
}
}

2. Single-User Access Mode:

This is used for clusters that can be used exclusively by a specified single user. This mode supports Python, SQL, Scala, and R languages. Databricks suggest this mode for automated jobs and Machine learning tasks that run on dedicated clusters.

For example, you have a machine learning engineer who needs to train a model using data in Unity Catalog using Python. For this, you will create a single-user cluster with single-user access mode and assign it to the engineer. The engineer can use the cluster to run their code on the data they have permission to access, without sharing the cluster with anyone else.

The JSON below provides a policy definition for an automated job cluster with the Single User security mode:

{
"spark_version": {
    "type": "regex",
    "pattern": "1[0-1]\\.[0-9].*",
    "defaultValue": "10.4.x-scala2.12"
},
"access_mode": {
    "type": "fixed",
    "value": "SINGLE_USER",
    "hidden": true
},
"single_user_name": {
    "type": "regex",
    "pattern": ".*",
    "hidden": true
}
}

Best Practice 5: Use Audit logs

Audit logs help you to monitor and track the access and usage of data assets. It records various events, such as queries, updates, grants, revokes, etc. You can use audit logs to analyze user behavior, detect anomalies, enforce compliance, and troubleshoot issues.

An example of an audit log event in Unity Catalog is:

{
  "version": "2.0",
  "auditLevel": "ACCOUNT_LEVEL",
  "timestamp": 1629775584891,
  "orgId": "3049056262456431186970",
  "shardName": "test-shard",
  "accountId": "77636e6d-ac57-484f-9302-f7922285b9a5",
  "sourceIPAddress": "10.2.91.100",
  "userAgent": "curl/7.64.1",
  "sessionId": "ephemeral-f836a03a-d360-4792-b081-baba525324312",
  "userIdentity": {
    "email": "crampton.rods@email.com",
    "subjectName": null
  },
  "serviceName": "unityCatalog",
  "actionName": "createMetastoreAssignment",
  "requestId": "ServiceMain-da7fa5878f40002",
  "requestParams": {
    "workspace_id": "30490590956351435170",
    "metastore_id": "abc123456-8398-4c25-91bb-b000b08739c7",
    "default_catalog_name": "main"
  },
  "response": {
    "statusCode": 200,
    "errorMessage": null,
    "result": null
  },
  "MAX_LOG_MESSAGE_LENGTH": 16384
}

In this event, the user with the email crampton.rods@email.com is seen creating a Metastore assignment. They assigned the Metastore (Metastore ID abc123456-8398-4c25-91bb-b000b08739c7) to a workspace (Workspace ID 30490590956351435170) and set the default catalog name as “main.”

Best Practice 6: Share Data using Delta Sharing

You should use Delta Sharing to share data between Metastores or with external parties. Delta Sharing is a secure and open protocol for sharing Delta Lake tables across organizations and platforms. You can use Delta Sharing to enable cross-metastore queries, federated analytics, and data collaboration.

Let’s say you have data in your Unity Catalog Metastore and you want to share it with another Databricks workspace within the same account. You can easily do this using Databricks-to-Databricks Delta Sharing, which is automatically enabled. Simply create a share that includes the specific tables and notebooks you want to share, and then give access to the recipient workspace. The recipient can then access the shared data using SQL or any supported programming language in Databricks.

Please note that when you use Databricks-to-Databricks Delta Sharing to share between Metastores, keep in mind that access control is limited to one Metastore. If a securable object, like a table, has access to it and that resource is shared to an intra-account Metastore, then the grants from the source will not apply to the destination share. The destination share will have to set its own grants. This way it restricts access.

Best Practice 7: Use DBFS while launching Unity Catalog clusters with single-user access mode

When you design the cluster configuration, you have two options to choose between single-user and shared access modes.

  1. Single-user mode allows you to run queries and commands on the cluster as yourself.
  2. Shared mode allows multiple users to share the cluster resources.

If you want to use external storage for storing your init scripts, configurations, and libraries, you should use DBFS mounts in single-user access mode. This behavior is not supported in shared access mode.

Here is an example where it will demonstrate how to integrate external storage with Databricks using DBFS mounts, allowing for seamless access to data, libraries, configurations, and ML models.

# Create a DBFS mount point for the external storage
dbutils.fs.mount(
  source = "s3a://my-bucket/my-prefix",
  mount_point = "/mnt/my-data",
  extra_configs = {"fs.s3a.access.key": "xxx", "fs.s3a.secret.key": "xxx"}
)

# Launch a Unity Catalog cluster with single-user access mode and DBFS mount as an init script
spark.conf.set("spark.databricks.unityCatalog.singleUserMode.enabled", "true")
spark.conf.set("spark.databricks.cluster.profile", "serverless")
spark.conf.set("spark.databricks.initScript", "dbfs:/mnt/my-data/init.sh")

# Use DBFS mount to access libraries and configurations stored in external storage
spark.sparkContext.addPyFile("dbfs:/mnt/my-data/my-lib.py")
spark.read.option("header", "true").csv("dbfs:/mnt/my-data/my-config.csv")

# Use DBFS mount to access Unity Catalog datasets for ML workloads
df = spark.read.format("delta").load("dbfs:/mnt/my-data/my-dataset")
model = mlflow.spark.load_model("dbfs:/mnt/my-data/my-model")
predictions = model.transform(df)

In this code, we are performing the following actions:

  1. Creating a mount point in Databricks File System (DBFS) for external storage, specifically an S3 bucket.
  2. Launching a Unity Catalog cluster with single-user access mode and configuring it to use the DBFS mount as an initialization script.
  3. Using the DBFS mount to access libraries, configurations, and data stored in the external storage.
  4. Loading a dataset from Unity Catalog into a DataFrame for machine learning (ML) workloads.
  5. Loading a pre-trained ML model from the DBFS mount and making predictions on the dataset.

Best Practice 8: Do not use DBFS with Unity Catalog External Locations

Unity Catalog secures access to data in external locations by using full cloud URI paths to identify grants on managed object storage directories. DBFS mounts use an entirely different data access model that bypasses Unity Catalog entirely. SoDatabricks recommends that you do not reuse cloud object storage volumes between DBFS mounts and UC external volumes.

If you use them with external locations, users can bypass the permissions and policies that Unity Catalog enforces on your data. You can also lose track of the data lineage and usage that it records. Using DBFS with external locations can cause problems with Unity Catalog’s access control and audibility features.

Therefore, you should not use DBFS with external locations. Also, you should not use a container that is or was a DBFS root file system for the root storage location in your Unity Catalog Metastore. This can cause conflicts and errors with your data and metadata.

Here is an example of what not to do:

spark.sql("CREATE EXTERNAL TABLE my_table USING DELTA LOCATION 'dbfs:/mnt/my-data/my-table'")

Instead of DBFS, you should use direct paths to your external storage locations, such as S3 or ADLS Gen 2 URIs. For example:

spark.sql("CREATE EXTERNAL TABLE my_table USING DELTA LOCATION 's3a://my-bucket/my-table'")

You can also use a separate container that is not related to DBFS for the root storage location in your Unity Catalog Metastore. For example:

Best Practice 9: Secure your Unity Catalog-Managed Storage

Unity Catalog uses a root storage location and external locations to store your data and metadata. You should make sure that these storage locations are not accessible by any users directly. Otherwise, users could modify or delete your data and metadata without going through Unity Catalog’s access controls and audit logs. You should also encrypt your data both at rest and in transit to prevent unauthorized access.

You should use IAM roles and policies to restrict access to the root storage location and external locations by the unity catalog.

For example, you can use the following policy to allow only the Unity Catalog service role to access the S3 bucket used as the root storage location.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowUnityCatalogServiceRole",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::my-account:role/UnityCatalogServiceRole"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::my-bucket",
                "arn:aws:s3:::my-bucket/*"
            ]
        }
    ]
}

You should use encryption at rest and in transit to protect your data.

For example, you can use the following configuration to enable encryption at rest and in transit for the S3 bucket used as the root storage location.

spark.conf.set("spark.hadoop.fs.s3a.server-side-encryption-algorithm", "AES256")
spark.conf.set("spark.hadoop.fs.s3a.connection.ssl.enabled", "true")

Here is a brief summary of the DBFS use case with the Unity catalog:

ScenarioWhen to Use Mounted Cloud Object StorageWhyExample
DBFS Usage in Unity Catalog-enabled workspacesUse DBFS when migrating files or data stored in DBFS root to Unity CatalogUnity Catalog introduces new data governance concepts, and using DBFS with Unity Catalog is not recommended unless for migration purposesMigrating files or data stored in DBFS root into Unity Catalog
DBFS Usage in Single User Access ModeUse DBFS mounts for init scripts, configurations, and libraries stored in external storage while launching Unity Catalog clusters in single-user access modeClusters in single user access mode have full access to DBFS and Unity Catalog datasetsML workloads requiring access to Unity Catalog datasets
DBFS Usage in Shared Access ModeDo not use DBFS in shared access mode as it does not support DBFS root or mountsShared access mode combines Unity Catalog data governance with table ACLs, and DBFS is not supported in this modeNot applicable
Use of DBFS MountsUnity Catalog and DBFS have different data access models and reusing storage volumes can lead to security and access issuesDBFS mounts are not supported in shared access mode and are useful for accessing external storageInitializing scripts, configurations, and libraries stored in external storage
DBFS and Unity Catalog External LocationsDo not reuse cloud object storage volumes between DBFS mounts and Unity Catalog external volumesUnity Catalog and DBFS have different data access models, and reusing storage volumes can lead to security and access issuesSeparate usage of cloud object storage for DBFS mounts and Unity Catalog external locations
Secure Unity Catalog-managed StorageCreate a new storage account for Unity Catalog, define custom identity policies, and restrict access to Unity CatalogEnsures security and limited access to Unity Catalog-managed storageSetting up storage account and access policies for Unity Catalog
Adding Existing Data to External LocationsOnly load storage accounts to external locations in Unity Catalog if all other storage credentials and access patterns have been revokedEnsures security and avoids conflicts with existing storage accountsLoading existing storage accounts into Unity Catalog external locations
Unity Catalog and Cluster ConfigurationsUnity Catalog does not respect cluster configurations for filesystem settingsHadoop filesystem settings for cloud object storage do not apply to Unity CatalogCluster configurations for filesystem settings do not affect Unity Catalog
Limitations with Multiple Path AccessPaths with equal or parent/child relationships cannot be referenced in the same command or notebook cell using different access methodsUnity Catalog and DBFS paths with relationships cannot be combined in the same command or cellAvoid combining paths with relationships in the same command or cell

Conclusion

The article provides a comprehensive overview of best practices for using Databricks Unity Catalog, a data governance feature offered by Databricks. The key points covered in the article include configuring a Unity Catalog Metastore, organizing data using catalogs, setting up access control, leveraging cluster configurations, utilizing audit logs, sharing data through Delta Sharing, and using DBFS with Unity Catalog clusters. Additionally, the article emphasizes the importance of securing Unity Catalog-managed storage.

By following these best practices, users can effectively utilize Unity Catalog to govern their data assets, improve data organization, enhance access control, and ensure data security. The recommendations outlined in the article help users optimize their workflows and leverage the capabilities of Unity Catalog for efficient and secure data management.

In conclusion, implementing these best practices ensures that users can maximize the benefits of Databricks Unity Catalog, leading to enhanced data governance, improved collaboration, and streamlined data workflows within the Databricks platform.

+ There are no comments

Add yours

Leave a Reply