How to connect Databricks to Azure Data Lake?

Databricks is a version of the popular open-source Apache Spark analytics and data processing engine. Azure Databricks is the fully managed version of Databricks and is a premium offering on Azure, that brings you an enterprise-grade and secure cloud-based Big Data and Machine Learning platform.

Data can be ingested in a variety of ways into Azure Databricks. For real-time Machine learning projects, you can ingest data through a wide range of technologies including Kafka, Event Hubs or IoT Hubs. In addition, you can ingest batches of data using Azure Data Factory from a variety of data stores including Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse which can then be used in the Spark based engine within Databricks.

In this article we are going to connect the data bricks to Azure Data Lakes.

Here is the STEP By STEP Approach to establish the connectivity from Data bricks to Data Lake:

STEP 1: Create a Resource Group

Azure Data Lake can manage it’s key on its own or we can store the key into Azure Key Vault.Here we are going to create a data lake store and create a azure key vault to store the Data Lake key.

STEP 2: Create a Azure Data Lake and Key Vault to store the Azure Data lake key into the key vault.

STEP 3: Create a Key to access the Azure Data Lake.

Here I have provided the key name as mykey but you can put any name.

STEP 4: Now after creating the key Azure Data Lake will not allow key vault to access the data lake unless you grant the access so it will show this error:

So you need to grant the access either by grant permission or using Powershell command-let shown below:

Here I am providing the Powershell command-let:

Connect-AzAccount
Get-AzSubscription
#You need to copy the subscription id from Get-AzSubscription command and #copy in the below command
Select-AzSubscription -Subscriptionid YOURSUBSCRIPTIONUID
 
Set-AzKeyVaultAccessPolicy -VaultName myownkeyvaultfordatalake -ObjectId bbdd521b-b7ab-4074-b244-5e619654e1fd -PermissionsToKeys encrypt,decrypt,get

Once the permission is granted it will look like this.

STEP 5: Create App Registration in Azure AD. Go to Azure ADD and click the App registration and click account types. Please note that you can give any fake URL in the Redirect URl option.

Once App Registration is over it will show you this screen so please copy the Application ID and Directory ID in a notepad because we will used in the Data bricks work space.

The service Principal will look like this.

STEP 6: Create the client secret and save it in the notepad.

STEP 7: Now Create Databricks cluster and make sure it is premium plan. Once the cluster is up and running create a python workbook and paste this code.We will use three different ways to store the credentials stored in service principal in the previous step.. Please note the client ID, Credential and Directory ID taken from previous steps

configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
           "dfs.adls.oauth2.client.id": "Client ID from STEP 5",
           "dfs.adls.oauth2.credential": "Client secrete from step 6",
           "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/Directory ID from Step 5/oauth2/token"  
          }

Step 8: Before using the data lake from Databricks we need to mount it.

 #Mount the ADLS
dbutils.fs.mount(
  source = "adl://yourdatalakename.azuredatalakestore.net",
  mount_point = "/mnt",
  extra_configs = configs)

STEP 9: Verify if Datalake contents are accessible thru Databricks.

Suppose you have a RAW folder in the Datalake and this folder contains the csv file named export then it can be accessed like this.

#Access the file
dbutils.fs.ls("/mnt/RAW/export.csv")

Now let’s create a table from it.

#Create a SQL table and import the data

%sql
DROP TABLE IF EXISTS incident_data;
CREATE TABLE incident_data
USING CSV
OPTIONS (
 path "/mnt/RAW/export.csv",
   header "true",
   inferschema 'true'
)

and now browse the results in the Databricks

#browse
%sql
SELECT * from incident_data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.