Unlock Data Governance: Revolutionary Table-Level Access in Modern Platforms


Dive into the cutting-edge realm of data governance with us, where we tackle the formidable challenges of safeguarding data access across the sprawling networks of modern enterprises. This blog unfolds as your blueprint for mastering the art of robust governance and elevating table-level access control to an art form. Embark on a journey with our real-world enterprise scenario to harness the transformative power of Microsoft Fabric and Databricks methodologies, tools that stand at the vanguard of data strategy and security, with a special focus on precision in table-level access.

In the intricate maze of today’s data-centric businesses, the secret to wielding a competitive edge lies in the harmonious orchestration of decentralized teams, all unified under a banner of secure and meticulously crafted governance architecture. Here, table-level access control emerges not just as a feature, but as the cornerstone of data security, ensuring that every piece of sensitive information is shielded by layers of strategic oversight. Join us as we navigate this journey, layering your data governance strategy with the finesse of table-level access control, and fortifying your enterprise against the ever-evolving landscape of digital threats.

Use case

In this blog post, we’re exploring a straightforward situation. Imagine we have two types of tables: one holds sales data, and the other contains sensitive PII data ( Personally Identifiable Information). The sales data is okay to share with everyone in the company, but the PII data needs tighter controls. We only want certain people to access it.

User A, data admin, is in charge of both these tables. User B is only allowed to view sales data. User A  follows the principle of giving people the least access (table-level access) they need to do their job.

Table-Level Access

Data Population

Here’s how you would set up these tables and add some example records in SQL: We create a ‘Sales’ table with columns for sale ID, product name, quantity, price, and customer ID.

CREATE TABLE Sales (

    sale_id INT,

    product_name VARCHAR(50),

    quantity INT,

    price DECIMAL(10, 2),

    customer_id INT

);

Then, we add records to this table – like sales of laptops, smartphones, etc.

INSERT INTO Sales (sale_id, product_name, quantity, price, customer_id) VALUES

(1, 'Laptop', 2, 999.99, 1),

(2, 'Smartphone', 3, 299.99, 2),

(3, 'Headphones', 5, 79.99, 3),

(4, 'Tablet', 1, 499.99, 4),

(5, 'Smartwatch', 2, 199.99, 5);

Next, we create a ‘PII’ table with columns for ID, name, email, phone number, and address.

CREATE TABLE PII (

    id INT,

    name VARCHAR(50),

    email VARCHAR(100),

    phone_number VARCHAR(15),

    address VARCHAR(100)

);

Finally, we insert records into this table, with sample names, emails, and addresses.

INSERT INTO PII (id, name, email, phone_number, address) VALUES

(1, 'John Doe', 'johndoe@example.com', '123-456-7890', '123 Elm Street'),

(2, 'Jane Smith', 'janesmith@example.com', '234-567-8901', '456 Oak Street'),

(3, 'Alice Johnson', 'alicej@example.com', '345-678-9012', '789 Pine Street'),

(4, 'Bob Brown', 'bobbrown@example.com', '456-789-0123', '101 Maple Street'),

(5, 'Carol White', 'carolwhite@example.com', '567-890-1234', '202 Birch Street');

Table-Level Access Implementation with Microsoft Fabric 

Architecture of Fabric

  • Microsoft Fabric’s architecture is built to integrate seamlessly with the Office 365 ecosystem.
  • At its core is OneLake, a sophisticated data lake that operates on top of Azure Data Lake Storage Gen2 (ADLSv2).
  • The architecture is designed to support a single tenant but can accommodate multiple business units or projects through the concept of “Workspaces.”
  • Each workspace can house one or multiple lakehouses, which are essentially collections of files, folders, and tables organized for different business units or teams.
  • The architecture is versatile and supports various data operations, catering to the diverse needs of different departments within an organization.

Table-Level Access Patterns

There are three ways to access your data as shown in the above figure:

Access Pattern 1 – SQL Endpoint

  • The endpoint is designed for running SQL queries against the data stored in the lakehouse.
  • It’s particularly useful for users who are familiar with SQL to query data and build reports
  • The endpoint does not support data manipulation or schema modification operations (DML or DDL statements).

Access Pattern 2 – PySpark Notebook Access

  • PySpark notebooks are interactive coding environments where users can write and execute Python code, with Spark providing data processing capabilities.
  • Notebooks are used by data scientists and engineers who require the flexibility to run complex data processing tasks, machine learning models, and data transformations.

Access Pattern 3 – Lakehouse Explorer (Browsing a Preview of 5 Rows)

  • The Lakehouse Explorer offers a user-friendly way to preview data stored in the lakehouse.
  • This pattern is particularly useful for users who need to quickly verify the presence or format of data in a table without the need for complex queries or analyses and is currently limited to a preview of the first 5 rows.

Structure of Fabric’s Data Storage (OneLake)

  • OneLake has a layered or hierarchical structure.
  • It was designed to work for a single organization (tenant).
  • It organizes data into “Workspaces” which are separate sections for different teams or projects within a company, like marketing or sales.
  • Each Workspace contains lakehouses – these are groups of files, folders, and tables organized for specific purposes.

Practical Example: Managing Access to Data

There are different roles in the workspace (like Admin, Member, Contributor, or Viewer) to manage data access. In this case:

  • User A, the data admin, creates two tables in Workspace 1: ‘sales’ and ‘PII’.
  • User A invites user B but only wants to share the ‘sales’ data with them.

Steps Taken by the Admin (User A)

  • First, User A sets up a lakehouse to store these tables.
  • Then, they use an SQL analytics endpoint for viewing data but can’t use it to create or change data. (SQL endpoint does not support DML or DDL statements)
  • To create and fill the tables, User A uses a notebook, writing and running SQL commands.
  • Data gets stored in OneLake and can be seen by User A in the lakehouse using Lakehouse Explorer

User A can perform these operations on both tables (“sales” and “PII”):

OperationsLakehouse ExplorerSQL analytics endpointNotebook
CREATENNY
SELECTY (subset)YY

Role Assignments in the Workspace

User A invites User B to join the workspace, assigning them the ” Contributor ” role.

  • As a Contributor, user B can browse through the data in the lakehouse, including tables like “sales” and “PII” data.
  • They can view details of these tables and even have the privilege to rename them.

Here is the summary of what user B can achieve when invited to the workspace:

TableSQL analytics endpointNotebookLakehouse Explorer
salesYYY (subset)
PIIYYY (subset)

Denying access in Fabric:

At first glance, Fabric allows broad access to data (a “default allow” approach), where restrictions need to be set up manually by the Admin: 

  • User A has to DENY access to prevent User B from viewing PII data by using an SQL analytics endpoint (since DCL statements are not allowed in notebooks).
DENY SELECT ON dbo.pii TO [userb@abc.com]

Evaluating Access Restrictions (after DENY access to PII data):

  • Through the SQL analytics endpoint, User B finds their access to the PII table is rightly restricted.
  • From a notebook, User B still has access to the PII data despite the DENY.
  • The Lakehouse Explorer also allows User B to view data from the PII table despite the DENY.
  • Here is the summary of what user B can achieve after DENY access to PII data.
TableSQL analytics endpointNotebookLakehouse Explorer
salesYYY (subset)
PIINYY (subset)

Understanding the Access Control Dynamics:

  • When sharing the workspace with a user, that user has full access to all underlying tables in the workspace, specific restrictions do not take effect.
  • SQL endpoint restrictions are effective at the table level.
  • Access controls are not passed between Notebooks and the Lakehouse Explorer because these distinct product silos are not yet integrated.
  • “Short-cuts” to external data sources also circumvent the access controls set up for the tables.

Table-Level Access Implementation in Databricks

What is Azure Databricks?

  • Azure Databricks is a data lakehouse solution based on ADLSv2, with Delta Lake as the default storage format. 
  • Unity Catalog provides a data governance layer for the users and an abstraction layer for the storage: structured, and unstructured. The catalog provides an “Explorer”; while users can also access data via notebooks or via SQL warehouses, with access control enforced in all cases by the Unity Catalog.

Access Patterns

As shown in the above figure, there are also three ways to access the data in Azure Databricks:

  • Catalog Explorer, a user interface in Databricks
  • Notebooks
  • SQL warehouses

Access Control in Databricks

Access control in all three ways are enforced and unified by the Unity Catalog. Unity Catalog acts as a governance layer, helping manage who can access what data. The approach with Unity Catalog is a default “DENY ALL”. This means users can’t access any data or resources unless explicitly given permission.

Practical Example: Managing Access to Data

  • User A, who has an Admin role in their Databricks workspace, starts by creating two tables.
  • They can create these tables using either a Notebook (a coding and data analysis tool) or a SQL warehouse (a tool for running SQL queries).
  • Tables can be viewed and managed in Unity Catalog using the “Explorer” interface. This interface also allows previewing of table content.
  • User A can successfully query both of the tables via the “Databricks SQL warehouse” and notebook (Spark SQL or PySpark).

User A can perform these operations on both table (“sales” and “PII”):

OperationCatalog ExplorerSQL editorNotebook
CREATEY (catalogs, schemas)YY
SELECTY (paginated)YY

Role Assignments in the Workspace:

  • User A then invites User B to the workspace. User B is a standard user, without admin privileges.
  • Initially, due to Databricks’ “deny all” approach, User B can’t access both tables to navigate the corresponding schema containing them.

Here is the summary of what user B can achieve when invited to the workspace:

TableSQL editorNotebookCatalog Explorer
salesNNN
piiNNN

Evaluating Access Restrictions in Azure Databricks

Granting Access in Databricks

  • Admin User A allows User B to see the catalog (a collection of schemas and tables) using the command USE CATALOG. Now User B can see the catalog’s schemas, but nothing else
  • User A grants User B access to the schema (the structure that holds the tables) with USE SCHEMA. Now User B can see the schema’s tables, but cannot access the data
  • Finally, User A allows User B to access specific tables with the GRANT SELECT command. Now User B can use either the “Catalog Explorer”, SQL statements in a notebook, or a SQL editor to access the data.

Evaluating Access Restrictions:

User B can now access (using SQL editor, Notebook, or Catalog Explorer) only those tables (“sales”)  where they have been explicitly given permission.

Here is the summary of what user B can achieve after Admin grants access to table Sales.

TableSQL editorNotebookCatalog Explorer
salesYYY (paginated)
PIINNN

Understanding the Access Control Dynamics

  • In Databricks, user access to tables is tightly controlled. Users can only access tables if they have specific permissions, ensuring data security and privacy in the workspace.
  • Unity Catalog offers detailed control over table data access. This control is uniformly applied across both the user interface and the computing engine.
  • The system integrates seamlessly with Azure, utilizing native Azure resources. This includes elements like Azure Databricks workspaces, virtual networks, ADLSv2 storage accounts, and resource groups.
  • Implementing Unity Catalog effectively may require a centralized approach to governance within an organization and setting up a metastore for each Azure region, ensuring a cohesive and well-managed data environment.

Please refer to my blog for more Databricks articles.

Conclusions

Microsoft Fabric

Fabric demonstrates promising potential and a robust roadmap, it’s clear that, as of now, it does not fully meet the enterprise-ready standards for governance. This observation is particularly evident in its lack of fine-grained controls, a critical component in today’s data access and management landscape.

Microsoft has indicated plans to streamline data governance within Fabric through initiatives like “One Security”, though detailed public documentation and roadmap specifics are yet to be released. The integration of Purview in Fabric’s governance strategy also indicates future enhancements.

The technology landscape is rapidly evolving, and solutions that may seem less mature today could quickly become integral parts of our IT infrastructure. In this light, while acknowledging Fabric’s current limitations, we also recognize its potential for growth and integration in the future.  Fabric’s governance capabilities are still evolving, the strategic selection of Databricks can provide the immediate governance solutions needed today. This approach also smartly positions your architecture for a future where Fabric’s potential is fully realized, ensuring a flexible, secure, and forward-thinking data management strategy.

Databricks

Azure Databricks currently offers a more unified approach to table-level access control. This is seen across its SQL data warehouse, catalog explorer, and various computing environments.  This unified governance layer in Databricks simplifies data access control strategies, reducing the risks of unauthorized data access.

In the realm of governance and security, especially concerning ML models as well as data, Databricks currently provides robust capabilities. Opting for Databricks at this juncture offers immediate access to advanced governance features, ensuring that your architecture is not only compliant with today’s standards but also primed for future advancements. This choice does not preclude integration with Fabric down the line; rather, it positions your architecture to seamlessly incorporate Fabric’s capabilities when they reach maturity and offer the necessary integrations.

Microsoft Fabric’s forthcoming developments hint at potential enhancements in unified data governance, while Azure Databricks currently provides a more integrated approach to access control. Each platform’s capabilities and future directions should be considered in light of specific organizational needs and use cases.

+ There are no comments

Add yours

Leave a Reply