Unity Catalog: Unlocking Powerful Advanced Data Control in Databricks


In the digital era, where data is the new gold, safeguarding it is not just an option—it’s imperative. The Unity Catalog from Databricks, a revolutionary tool that is reshaping the landscape of data governance. This powerful feature isn’t just about setting boundaries; it’s about unlocking the potential of your data while keeping it under a watchful, secure eye.

In this blog, we dive into the Unity Catalog, your ace in mastering the complex art of data governance within Databricks. Whether you’re grappling with sensitive data exposure or struggling with sprawling data sources, the Unity Catalog offers a lifeline for data teams to navigate the choppy waters of data security with grace. Get ready to turn the key to advanced data control that puts you firmly in the driver’s seat.

Row Level Security and Column Level Masking

Row Level Security (RLS):

  • Purpose: RLS provides fine-grained access control to datasets. It allows you to control which rows a user can see in a table based on certain criteria.
  • Implementation: You create a function, such as us_filter, that takes parameters (like region) and returns a boolean value. This function tests for certain conditions (like if a user is an ‘admin’ or if the region is ‘US’).
  • Assignment: The function is then associated with a table using the ALTER TABLE statement with SET ROW FILTER and the filter criteria, thereby enforcing row-level security.
  • Workflow:
    • The function checks if the user is part of a certain group (e.g., ‘admin’).
    • If they are a member, the function allows them to see all rows; otherwise, it filters the rows based on the specified criteria (like region='US').

Column Level Masking (CLM):

  • Purpose: CLM is used to mask or redact sensitive data in certain columns, ensuring that users without proper access cannot see sensitive information.
  • Implementation: A function, like ssn_mask, is created to mask data. It takes parameters and returns an expression with the same type as the input parameter, defining how the data should be masked.
  • Assignment: The mask is applied to specific columns using the ALTER TABLE statement with SET MASK to associate the masking function with the column.
  • Workflow:
    • The function checks if the user is part of a certain group (e.g., ‘admin’).
    • If they are not a member, it masks the data in the specified column (like displaying ‘***’ instead of the actual Social Security Number).

Managed Data Sources & External Locations

Prior to Unity Catalog Credentials to access cloud storage like S3 or ADLS were defined at the cluster or warehouse or in a passthrough mechanism based on the user’s cloud identity.

With Unity Catalog, those Credentials live as first-class primitive objects in your Unity metastore. They have ACLs on them like any other object, but they are typically not used by users directly.

When you set up a Unity Catalog Metastore for the first time, you will associate a managed data source with that metastore. This managed data source is a container or bucket that will only be read from and written to via Databricks, through Unity Catalog. Along with this managed datasource is a default credential object that will be used for Unity to broker access to this data. I will explain the request path of how Unity Catalog does this in the next slide.

But to flesh out these concepts, if you create a table in legacy Hive Metastore, and you don’t specify a location for that table, it by default will be stored in DBFS. This is known as a managed table in Hive and the default location in Databricks is DBFS.

In Unity, this is very similar, we have a concept of a managed table. that when you create a table without a specified location, it will use the managed data source associated with your metastore to store the table. Managed tables have to be Delta Tables.

There is another first class primitive object within Unity called an External Location. An External Location is a combination of two things, a cloud URL and a Credential. It is an object that can have it’s own ACLs on it and allows users to READ/WRITE or CREATE TABLEs inside it. This is how users can get access to arbitrary files in cloud storage.

If you create a table using a location that belongs to an external location that you have the CREATE TABLE right on, you can now federate this externally defined table to anywhere in your metastore hierarchy. External tables can be delta, parquet, avro, orc, or a number of other supported formats.

This concept is depicted in the diagram below so let’s understand this diagram:

  1. User and Cluster/SQL Warehouse: Users interact with Databricks through a cluster or SQL warehouse. This is the computational environment where data processing takes place.
  2. Unity Catalog: This is the central hub for data governance in Databricks. It maintains an audit log of activities and enforces access control, ensuring that only authorized users can access specific data assets.
  3. Managed Data Sources: These are specific storage containers or buckets in cloud storage services like Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS) that are directly managed by Databricks through Unity Catalog. Data in these sources can only be accessed and manipulated via Databricks, providing an additional layer of security and control.
  4. Managed Tables: These are the datasets or tables managed within the Unity Catalog that do not have a specified physical location by the user. Instead, they utilize the managed data sources for storage, which simplifies data management and security.
  5. External Locations & Credentials: These are defined storage paths in cloud services that are outside the managed data sources. They are paired with credentials that are managed by Unity Catalog but allow for more flexible data operations, such as accessing or writing data to arbitrary locations in the cloud.
  6. External Tables and Files in Cloud Storage: These represent the data structures and files stored in the external locations. Users can interact with these tables and files, which can be in formats like Delta, Parquet, Avro, etc. The access to these is governed by ACLs, ensuring that operations on these data assets are secure and compliant with organizational policies.

Default access to storage by catalog or schema

Let’s understand default access to storage by catalog or schema with the help of this diagram:

  1. Metastore Level (Top Level): This represents the highest level of data storage organization within Unity Catalog. It can be considered the root level, and the default storage location is determined at this level if not specified elsewhere.
  2. Catalog Level: Within the metastore, there can be multiple catalogs. A catalog is a collection of schemas and can be seen as a namespace that organizes data into a logical group. Each catalog can have its own default managed container/bucket in cloud storage (like S3, ADLS, GCS) where data tables are stored if not otherwise specified.
  3. Schema Level: Under each catalog, there are schemas that can be thought of as subdirectories or folders within the catalog. Each schema can also have its own default managed container/bucket for storing managed tables.
  4. Managed Tables: These are the data tables that are governed and managed directly by Unity Catalog, which leverages the hierarchy mentioned above to determine where the data is stored by default.
  5. Default Access to Storage: Unity Catalog uses managed data sources for data isolation or cost allocation. This implies that at each level (metastore, catalog, and schema), you can have different default storage locations, which helps in segregating data for better management and possibly attributing costs to different departments or projects.
  6. Data Isolation: By defining default storage at each level, organizations can isolate data effectively. For instance, different departments within an organization can have their catalogs, with each department’s data stored in separate cloud storage containers or buckets.
  7. Cost Allocation: The ability to define storage at different levels within the hierarchy also aids in cost allocation. It makes it easier to track cloud storage costs and allocate them to the appropriate department or project based on which catalog or schema the costs were incurred.

Govern filesystems and objects distinctly

Let’s understand how to govern filesystems and objects distinctly with Unity Catalog:

In the above diagram file systems and objects arte giverned distinctly. Let’s understand how:

  1. Unity Catalog: This is a centralized governance layer for data management. It manages access to data through access control mechanisms, ensuring that only authorized users can perform certain operations on the data.
  2. User: Represents the end users or data engineers who interact with the data through various operations. In this case, the user is performing a write operation.
  3. Access Control: This is a security feature that regulates who can access data and what operations they can perform. It ensures that users have the appropriate permissions to read from or write to the data.
  4. Cloud Storage (S3, ADLS, GCS): Indicates the various cloud storage services (Amazon S3, Azure Data Lake Storage, Google Cloud Storage) where the data is stored.
  5. External Location: This refers to specific paths or locations within cloud storage where data is kept. Access to these locations is controlled by Unity Catalog’s access control mechanisms.
  6. Volume: Under the external location, there are volumes which can be considered as partitions or specific areas within the external location designated for organized storage of data.
  7. Table and Data: This represents the actual data storage construct — a table within the database — and the data within it.
  8. Read/Write Operations: These are the actions users can perform on the data. The users can perform both read and write operations on volumes, but there are some restrictions when it comes to tables.
  9. Select Only: This is an operation allowed on the data within a table. However, there is a restriction here — even if the user has write permissions at the external location level, they cannot write to the table or the data path unless they have specific write permissions for the data.

Access data from specified environments only

Unity catalog helps to restrict the data for specific environment based on business usuage:

Let’s understand how different business users are accessing data in different workspaces and catalogs:

  1. Metastore: This serves as the central repository for metadata about data assets, managing how and where data is stored and accessed.
  2. Catalogs: These are logical groupings of data within the metastore, which can be used to organize data by environment (such as development, staging, and production) or by business unit (indicated as “bu_l” in the diagram ).
  3. Workspaces: Each catalog is associated with a workspace, which is a dedicated space where specific data operations are performed. Workspaces in the image are tagged with “_ws” and are aligned with the corresponding catalogs (e.g., “dev_ws” for the development catalog).
  4. Groups: These represent different groups of users, such as developers, testers, analysts, or business unit-specific roles like BU developers. Each group is granted specific access rights to different workspaces and catalogs.
  5. Access Control: The data governance framework ensures that access to data is isolated across different workspaces and groups. This means that developers may only have access to development catalogs and workspaces, while analysts might only access production data.
  6. Environment-Specific Access: There are distinct catalogs and workspaces for different stages of the data lifecycle—development (dev), staging (staging), and production (prod). Each stage has its own access rules and is isolated from the others to prevent unauthorized access or data leaks between environments.

Automated lineage for all workloads

Unity catalog provides the automated lineage. Here are the features of automated lineage:

  • Automated Data Lineage: This feature provides end-to-end visibility of data flow within an organization.
  • Runtime Data Lineage Capture: Lineage is automatically captured at runtime on Databricks clusters or SQL warehouses.
  • Granular Tracking: It tracks the lineage down to the table and column level.
  • Integration with Unity Catalog: The lineage feature leverages the common permission model from Unity Catalog.
  • Comprehensive Lineage: Lineage can be tracked across various data-related assets, such as tables, dashboards, workflows, notebooks, feature tables, files, and data lake transformation (DLT) processes.
  • Immutable Record: When a derivative dataset is created through Spark APIs based on another dataset, the relationship between the datasets is captured and can be viewed immutably through a user interface (UI) and APIs.

Lineage flow – How it works

Let’s understand how the lineage works with the help of Diagram. We will go through step-by-step in this diagram:

  1. Job Submission: A user submits a job or code to a cluster or SQL warehouse.
  2. Logical Plan Creation: Every SQL select statement or Spark operation, such as transformations (e.g., map) or actions (e.g., read), generates a series of operations. Before Spark executes these operations on worker nodes, it constructs a “LogicalPlan”, which is a linked graph representing all the transformations.
  3. Lineage Service: The metadata of the LogicalPlan is submitted to a lineage service operating in the control plane. This service aggregates all LogicalPlans from all commands to create comprehensive lineage graphs.
  4. Lineage Graphs: These graphs are then distilled into tabular information, showing how tables and columns are interrelated. It highlights upstream and downstream relationships, showing how data flows from one table or column to another.
  5. User Interface (UI): The lineage is made visible in the Databricks UI, allowing users to see a graphical representation of the data lineage.
  6. REST API Exposure: The lineage information is also available through REST APIs, facilitating integration with third-party catalog products such as Alation, Collibra, and Microsoft Purview.

Built-in search and discovery in Unity Catalog

Unity catalog provides built-in search and discovery capability:

  • User Interface (UI) for Search: Unity Catalog provides a UI where users can search for data assets stored within the catalog. This allows for quick and efficient retrieval of data assets for use in various tasks.
  • Unified UI: The search and discovery feature boasts a unified user interface across different Databricks services, such as DSML (Data Science & Machine Learning) and DBSQL (Databricks SQL), providing a consistent experience for users regardless of the service they are using.
  • Permission-Based Search: The search functionality is integrated with Unity Catalog’s permission model. This means that users will only see search results for data assets that they have permission to access. If a user does not have the required permissions to view or read a table, that table will not appear in their search results, thereby adhering to the organization’s security and access control policies.
  • Semantic Tags: Users can apply semantic tags to data assets, which can then be used to search and filter results. Semantic tagging helps in organizing and categorizing data assets, making it easier to find relevant data based on context or content.
  • Data Discovery Integration: The Data Discovery tool is an integrated UI component that allows users to discover assets in the Unity Catalog. It supports a seamless experience across different user roles, or personas, and leverages the established permissions model to ensure secure and compliant data access.

Discovery Tags in Unity catalog

The discovery tag feature in the Unity catalog provides the ability to search assets with the help of business terms or generally agreed taxonomies.

Searching for data assets in business terms or generally agreed-upon taxonomies usually requires additional catalog tools but in Unity Catalog Discovery Tags allow you to tag Column, Table, Schema, Catalog objects in UC. Integrated search mechanism in UC allows you to search for objects by tag.

Delta Sharing

Here’s a high-level overview of how Delta Sharing works:

  1. Data Provider Setup: A data provider, which has datasets in a Delta Lake, sets up a Delta Sharing Server. This server interfaces with the data lake, which can be hosted on cloud storage services like AWS S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCP).
  2. Permission Configuration: The provider configures access permissions on the sharing server. These permissions define what datasets can be accessed by which data consumers.
  3. Activation Link: For each data consumer, the sharing server generates an activation link, which includes the necessary credentials for access.
  4. Data Consumer Access: The data consumer uses the credentials to configure Delta Sharing clients (libraries or applications like Pandas, Apache Spark, Power BI, etc.) to access the shared data.
  5. Data Retrieval: When a consumer requests data, the sharing server authenticates the request based on the credentials. Upon successful authentication, the server generates pre-signed, short-lived URLs that the client can use to fetch data directly from the cloud storage. These URLs are temporary and typically have a short expiration time to maintain security.
  6. Data Format: Delta Sharing currently supports the Parquet format, a widely adopted open-source columnar storage format. Any client that can read Parquet can therefore support Delta Sharing.
  7. Cross-Cloud and Cross-Platform: Because Delta Sharing is built on open standards, it facilitates data sharing across different clouds and data systems. Data consumers can use the shared data in their preferred compute environments.

Delta Sharing – Under the hood

Now let’s learn how Delta sharing works under the hood.

Data Provider Side:

  • Delta Table: This is where the data resides, stored in Delta Lake, which is an optimized storage layer that sits on top of cloud storage (like AWS S3, Azure Data Lake Storage, or Google Cloud Storage) and uses Parquet format files.
  • Delta Sharing Server: This server manages the sharing of data. It controls access permissions to the Delta tables and generates pre-signed, short-lived URLs that will be used to access the data.

Data Recipient Side:

  • Delta Sharing Client: This is the client used by the data recipient to interact with the Delta Sharing Server. It sends requests to access specific tables and receives URLs to directly fetch the data from the object store.

Delta Sharing Protocol:

  • The client authenticates to the Sharing Server.
  • It requests access to a table, which can include specific filters for the data needed.
  • The Sharing Server checks access permissions for the requested data.
  • If the access check is successful, the server generates pre-signed short-lived URLs.
  • The client uses these URLs to directly read the Delta files from the object store.

Important Notes:

  • Sharing occurs on the granular level of Delta Lake, such as full tables, partitions, and specific Delta versions.
  • The client is system-independent, meaning it can be any system capable of reading Parquet files, facilitating cross-platform compatibility.
  • In the Databricks environment, the Sharing Server and access control checks are integrated with the Unity Catalog, ensuring that sharing is secure and governed.

If you are interested in reading my other Unity catalog blogs please refer them here.

Conclusion

As we close the chapter on “Unity Catalog: Unlocking Advanced Data Control in Databricks,” it’s clear that the landscape of data management and governance is both rich and complex. The Unity Catalog stands as a testament to the ingenuity in data solutions, offering robust control mechanisms that extend from the granular nuances of row-level security to the overarching strategies of data sharing across platforms.

This foray into the Unity Catalog’s deep reservoir of features—from Row Level Security and Column Level Masking to Managed Data Sources and external Locations—has unearthed the profound capabilities that Databricks has engineered for organizations navigating the vast seas of data governance.

In threading together automated lineage tracking, seamless integration of search and discovery, and the groundbreaking Delta Sharing, Databricks has not just provided a toolkit for managing data; it has laid out a strategic framework that can propel organizations toward a future where data compliance, security, and utility are in harmonious balance.

+ There are no comments

Add yours

Leave a Reply