With Databricks Unity Catalog’s volumes feature, managing data has become a breeze. Regardless of the format or location, the organization can now effortlessly access and organize its data. This newfound simplicity and organization streamline data management, empowering the company to make better-informed decisions and uncover valuable insights from their data resources.
In this article, we’ll go deep into why Volumes are so important in Databricks Unity Catalog and learn all about how they make data management a lot easier and better. In this comprehensive guide, you will find a step-by-step approach to how to create, manage and access a volume in Databricks. Moreover, you will explore different methods to secure your volumes, safeguarding your data effectively.
Let’s get started!
Table of Contents
What are Volumes?
Many scenarios, particularly in the fields of machine learning and data science, require the use of non-tabular data forms such as images, audio, video, or PDF files. Databricks has unveiled a new feature in Unity Catalog known as Volumes. The volume serves as a catalog for file collections, facilitating the creation of scalable applications that can handle large data sets in various formats, including unstructured, semi-structured, and structured data. It allows for the effective management, governance, and lineage tracking of non-tabular data, alongside tabular data within the Unity Catalog.
Volumes are a new type of object that catalog collections of directories and files in Unity Catalog. They represent a logical volume of storage in a Cloud object storage location and provide capabilities for accessing, storing, and managing data in any format. This enables you to govern, manage and track lineage for non-tabular data along with the tabular data and models in Unity Catalog, providing a unified discovery and governance experience. Now let’s understand the use cases of Volumes
- Machine Learning on Unstructured Data: Volumes can be used for running machine learning on large collections of unstructured data such as image, audio, video, or PDF files.
- Data Sets for Model Training: Volumes can be used for persisting and sharing training, test, and validation data sets used for model training and defining locations for operational data such as logging and checkpointing directories.
- Data Exploration: Volumes can be used for uploading and querying non-tabular data files in data exploration stages in data science.
- Working with Tools: Volumes can be used when working with tools that don’t natively support Cloud object storage APIs and instead expect files in the local file system on cluster machines.
- Secure Access to Files: Volumes can be used for storing and providing secure access across workspaces to libraries, certificates, and other configuration files of arbitrary formats, such as .whl or .txt, before they are used to configure cluster libraries, notebook-scoped libraries, or job dependencies.
- Data Ingestion: Volumes can be used for staging and pre-processing raw data files in the early stages of an ingestion pipeline before they are loaded into tables, e.g., using Auto Loader or COPY INTO.
- Sharing Files: Volumes can be used for sharing large collections of files with other users within or across workspaces, regions, clouds, and data platforms.
Tables vs Volumes in Databricks Unity Catalog
You might be wondering why there are both tables and volumes in Databricks Unity Catalog and how they differ. Both tables and volumes allow you to create, manage, and use SQL commands. You can also share both tables and volumes with people outside your organization using Delta Sharing.
However, there is a significant difference between tables and volumes, which only volumes provide, not tables. The table below will help you understand why it’s beneficial to use volumes instead of tables in the Databricks Unity Catalog:
|Data Type||Structured and semi-structured data that can be organized into rows and columns.||Non-tabular data, including unstructured, semi-structured, and structured data.|
|Use Cases||Ideal for SQL-based operations and analysis. Often used in ETL processes.||Useful for machine learning and data science workloads. Can be used for running machine learning on large collections of unstructured data, persisting and sharing training, test, and validation data sets, data exploration, working with tools that don’t natively support Cloud object storage APIs, providing secure access to files across workspaces, data ingestion, and sharing files.|
|Data Management||Tables support schema evolution, meaning you can add, remove, or change the data type of columns over time.||Volumes provide capabilities for accessing, storing, and managing data in any format. They represent a logical volume of storage in a Cloud object storage location.|
|Integration with Apache Spark||Tables are typically used in conjunction with dataframes in Apache Spark.||Volumes provide an abstraction over Cloud-specific APIs and Hadoop connectors, making it easier to work with Cloud-stored data files in Apache Spark applications, as well as tools that don’t natively support object storage APIs.|
|Discovery and Governance||Tables are cataloged in Unity Catalog and can be discovered and governed using the catalog’s features.||Volumes are a new type of object that catalog collections of directories and files in Unity Catalog. They enable you to govern, manage and track lineage for non-tabular data along with the tabular data and models in Unity Catalog, providing a unified discovery and governance experience.|
Types of Volumes in Databricks Unity Catalog
There are two types of Volumes in the Databricks Unity Catalog:
- Managed Volume
- External Volume
A Managed Volume in Databricks Unity Catalog is a type of volume that stores files in the default storage location for the Unity Catalog schema. It provides a convenient and governed location for files, especially when you want to quickly explore data without the need to first configure access to Cloud storage. This means you can upload files directly from your local machine into a Managed Volume for processing and analysis. It’s designed to handle any type of data format and provides capabilities for accessing, storing, and managing data, all while adhering to the security principles of the Unity Catalog.
You should be aware that if you delete a managed volume, the files stored in it will also be removed from the cloud system within 30 days.
External Volume: An External Volume in Databricks Unity Catalog is a type of volume that stores files in an external storage location, which is specified when creating the volume. This is particularly useful when you need to stage files produced by other systems for access within Databricks. For instance, you can provide direct access to a Cloud storage location where large collections of data, such as image and video data generated by IoT or medical devices, are stored. Like Managed Volumes, External Volumes can handle any type of data format and provide capabilities for accessing, storing, and managing data, while adhering to the security principles of the Unity Catalog.
If you delete an external volume, the underlying data remains unaffected and won’t be deleted by Unity Catalog.
Let’s understand the difference between managed volume vs External Volume:
|Feature||Managed Volumes||External Volumes|
|Storage Location||Managed Volumes store files in the default storage location for the Unity Catalog schema.||External Volumes store files in an external storage location referenced when creating the Volume.|
|Use Cases||Managed Volumes are a convenient solution when you want a governed location for files without the overhead of first configuring access to Cloud storage, e.g., for quick data explorations starting from files uploaded from your local machine.||External Volumes are helpful when files produced by other systems need to be staged for access from within Databricks. For example, you can provide direct access to a Cloud storage location where large collections of image and video data generated by IoT or medical devices are deposited.|
|Data Management||Managed Volumes provide capabilities for accessing, storing, and managing data in any format. They represent a logical volume of storage in a Cloud object storage location.||External Volumes also provide capabilities for accessing, storing, and managing data in any format. They represent a logical volume of storage in an external Cloud object storage location.|
|Discovery and Governance||Managed Volumes are cataloged inside schemas in Unity Catalog alongside tables, models, and functions and follow the core principles of the Unity Catalog object model, meaning that data is secure by default.||External Volumes are also cataloged inside schemas in Unity Catalog alongside tables, models, and functions and follow the core principles of the Unity Catalog object model, meaning that data is secure by default.|
How to Create a Volume in Databricks Unity Catalog?
Till now, we have understood what Volumen is in the Databricks Unity catalog. Now, in this section, I will provide you with a step-by-step guide on how to create your first volume in the Databricks Unity catalog. Stick with me till the end of this article, you will have complete information about the Databricks Unity Catalog’s volume.
To create a volume you have to follow these steps::
- Create Access Connector
- Create a Storage Account
- Grant Access connector the permission to access the Storage account
- Create a Container in the storage account
- Create Metastore
- Create a volume
- Access data inside the volume.
Now, we will go into detail.
Create Access Connector
We need to create an access connector to create a storage credential.
STEP 1: Search for “Access Connector for Azure Databricks.”
STEP 2: Click on “+ Create” to create an access connector for Azure Databricks.
STEP 3: Now, enter the basic details such as a Subscription, and Instance details such as Name and Region. Click on “Review + Create“.
STEP 4: Here click on “Create” to create your Access connector for Azure Databricks is created.
STEP 6: Copy the Resource ID and save it in the text file for later use.
Create Storage Account
STEP 1: Go to Azure Services. Click on the “+ Create a resource“.
STEP 2: In the marketplace, type “Storage Account” in the search box. After that, click on “Create” in the Storage Account.
STEP 3: Now, enter the project details. See the below image, I have entered the storage account name as unitycatalogdemo7 and the Region as “East US“.
STEP 4: Click on “Next: Advanced“. Make sure to select the Enable hierarchical namespace.
STEP 5: Under the Networking section, select the network access and the routing preferences. Click “Next: Data Protection”.
STEP 6: Under Data Protection, enter the number of days means how many days you want to retain your data if accidentally deleted. Click “Next: Encryption“.
STEP 7: Here, select the Encryption type, which files you want to enable for customer-managed keys, and select whether you want to enable infrastructure encryption or not. Click “Next: Tags“.
STEP 8: Enter the tag and tag value. Click “Next: Review“.
STEP 9: Now, check the details you entered till now. After that click “Create“.
You will see in my below image that the storage account is created.
Grant Access connector Permission to storage account
STEP 10: In the left panel, go to “Access Control (IAM)”.
STEP 11: Now, in this step, we will assign a role to our storage account which we have created in the previous steps.
Go to + Add -> Add role assignment.
STEP 12: Now, in the search box, search “Storage Blob Data Contributor”. And from the list, select it.
STEP 13: Here, select to whom you want to give access: to the user, group, or service principal or to managed identity.
I have selected “Managed identity“.
Now, enter the details. See the below image. I have entered Managed Identity as “Access Connector for Azure Databricks” and selected “My-access connector”.
Now, you will see the “my-access-connector” is added. Click “Review + assign“.
Create a Container in storage account
STEP 14: Now, you have to create a container. For that, in the left panel, go to “Container”. Click on “+ Create“.
STEP 15: Enter the name of the container and public access level. Click “Create“.
Create Metastore in Databricks Unity Catalog
STEP 15: Now, you have to create a metastores. Go to “Data -> Metastores”.
Click on “Create Metastores”.
STEP 16: Enter the details. Here you have to copy the path of the “Storage Account” and enter the Access Connector ID. Click “Create“.
STEP 17: Now, select the metastore and click on “Assign“.
STEP 18: A prompt box will appear where you have to enable the unity catalog. Click on “Enable“.
Till now, you have created the important accounts which are essential to create a volume in Databricks Unity Catalog.
Create Root Storage Credential for Metastore
We will need to create a storage credential so this credential can be used to access the storage account.
STEP 19: Go to Data Explorer and Add the storage credential.
STEP 2: Now, enter the name and paste the “Access Connector ID” which you have copied in STEP 6 of creating the access connector.
Click “Create“. And this is how it looks like.
Create a Volume
STEP 20: Now, go to Data-> main -> default. Click on “Create” to create volume.
STEP 21: Now, enter the details such as Volume name, a volume type, and write a comment if needed. Then click “Create“
See the below image, I have selected “Managed volume“. Here, you don’t need to give any path, The Databicks will decide a path for it.
If you will select the “External volume” type, you have to select the external location and also the path where you want to store the volume.
Congratulations! Your volume is created in Databricks Unity Catalog.
STEP 22: Now, you have to grant permission to your volume so that it can access the files or folders.
Go to Permission -> Grant. See the below image.
Now select the required permission. I have selected all the privileges.
How to Access Data in Volumes?
Here are some examples of how to use and access Volumes in the Databricks Unity Catalog:
|Create a Volume|
|Grant Permissions to a Volume|
|Extract Archive in a Volume|
|Access Volume’s Content in a Notebook|
|Load a Model from a Volume|
|List Files in a Volume using Databricks File System Utilities|
|Read Text File in a Volume using Apache Spark APIs|
|Read CSV File in a Volume using Apache Spark SQL|
|Read CSV File in a Volume using Pandas|
|Download File to a Volume using Shell Commands|
|Install Library from a Volume using %pip|
|List Directory in a Volume using Operating System File Utilities|
my_volume, and other placeholders with your actual catalog, schema, volume names, and file paths.
How to upload a file to the volume?
Till now, you must have understood how to create volume and also how to grant permission so that it can easily access your files and folders.
Now, in this section, we will learn how to upload a file. For that, click on “Upload to this volume”.
A prompt will appear.
You have to upload a Zip file. So, here I have a zip file that can be downloaded from this URL- https://www.microsoft.com/en-us/download/confirmation.aspx?id=54765
Once you download this Zip file and then drag and drop that file in the prompt which appeared.
Click “Upload”. This will take some time.
Now, after uploading the zip file, we have to extract the images. So create a blank notebook and add this code.
%sh unzip /Volumes/my_catalog/my_schema/my_volume/catsanddogs.zip -d /Volumes/my_catalog/my_schema/my_volume/catsanddogs/
This code will unzip the zip file.
How to secure the Databricks Volumes
Volumes may contain sensitive or confidential data that shouldn’t be accessed or modified by unauthorized users. So it is necessary to secure your Databricks Volumes. For example, you may have volumes that store personal information, financial records, health records, or other regulated data.
Securing Databricks Volumes will help you protect your data from data breaches, data losses, data corruption, or data misuse. It can also help you concede with data privacy and security regulations, such as GDPR, HIPAA, PCI DSS, or CCPA.
There are two main ways to secure Databricks Volumes:
- Using permissions
- Using auditing
Permissions allow you to control who can access and modify your volumes and files at the schema, volume, or file level. You can also use roles and groups to assign permissions to multiple users at once.
Auditing allows you to monitor user activities and events related to volumes and files. You can use the UI or API to view and download audit logs. You can also use third-party tools to analyze and visualize audit logs.
Testing the Volume
For testing the volume copy this code.
%sh unzip /Volumes/main/default/rajaniesh-vol/kagglecatsanddogs_5340.zip -d /Volumes/main/default/rajaniesh-vol/kagglecatsanddogs_5340
from PIL import Image image_to_classify = "/Volumes/main/default/rajaniesh-vol/kagglecatsanddogs_5340/PetImages/Cat/0.jpg" image = Image.open(image_to_classify) display(image)
Here is the output of running the code:
Volumes in Databricks Unity Catalog are essential for organizations handling large and complex datasets. They provide a unified and governed data repository, simplifying data management and analysis for data science teams. The step-by-step guide to creating volumes ensures easy setup, while their Cloud-backed capabilities enable scalable data processing. However, caution must be exercised in managing access permissions to avoid potential data access challenges or unauthorized usage.