
Streamline Databricks Workflows with Azure DevOps Release Pipelines
The process of developing and deploying applications is complex, time-consuming, and often error-prone. The use of release pipelines helps to streamline this process and automate the deployment of code and data. Databricks is a popular cloud-based platform used for data engineering, data science, and machine learning tasks. Azure DevOps is a powerful tool for managing the entire software development lifecycle, including build and release management. In the blog “Streamline Databricks Workflows with Azure DevOps Release Pipelines“, we will explore how to build release pipelines for Databricks using Azure DevOps. We will look at the steps required to set up a pipeline for Databricks. By the end of this post, you will have a good understanding of how to build efficient and reliable release pipelines for Databricks using Azure DevOps.
Table of Contents
Introduction
In my last blog, we discussed the build pipeline and in this blog, we will discuss the release pipeline. Here I will be providing a step-by-step guide to developing and deploying Azure DevOps Release Pipelines for Databricks.
DevOps Release Pipeline Steps
STEP 1: Define the Release Pipeline
First, we will have to set up the release pipeline. Select the pipeline and click Releases. It will take you to the new release pipeline.

STEP 2: Select Empty Job Template
Now you need to create a project type for the release pipeline. Use an empty Job type so we can customize it fully.

STEP 3: Add the Artifacts
The release pipeline has two separate steps. Artifacts and Stages. In the Artifacts you need to select the files dropped from build pipelines and specify the location of the Artifacts. This is exactly depicted in the below diagram.

STEP 4: Enable Continuous Deployment Trigger
We need to enable the continuous deployment trigger so as soon as the build pipeline triggers the build release pipeline can be triggered.

STEP 5: Define environment variables for the release pipeline
Azure DevOps will invoke the Databricks CLI command to connect to the remote Databricks Cluster to deploy the code. So we need to create variables in Azure DevOps to store the Databricks instance-specific information where the code will be released to. Here are the details of these variables and their significance.
Parameter Name used by Databricks Connect | What does it means? |
DATABRICKS_HOST | It is this part in the databricks workspace URL https://adb-XXXXX.azuredatabricks.net |
DATABRICKS_TOKEN | API token generated from Databricks. It is used for REST API calls. |
DATABRICKS_CLUSTER_ID, | Code can be deployed to any cluster so this is a unique cluster id in the cluster URL: https://adb-xxx.azuredatabricks.net/?o=xxx#setting/clusters/XXX/configuration |
This is how it will look inside Azure DevOps.

Here are the step-by-step instructions for creating the Token.

STEP 6: Configure Release Agent
We will Deploy an Azure virtual machine for the release pipeline. The virtual machine image should match the one on the Azure Databricks cluster as closely as possible. For example, Databricks Runtime 10.4 LTS runs Ubuntu 20.04.4 LTS, which maps to the Ubuntu 20.04 virtual machine image in the Azure Pipeline agent pool. Here is the latest link to Databricks releases which exactly shows the OS. For example, runtime 13.0 works on Ubuntu 22.04.2 LTS as per the release 13 release notes. So always make sure you use the OS version supported by the release.

STEP 7: Set the Python Version for the release agent
Now we have the VM we need to deploy the required version of Python and build tools on the virtual machine for testing and packaging the Python code. Make sure that the correct version of Python matches the version installed on your remote Azure Databricks cluster.

STEP 8: unpackaged the build artifact from the build pipeline
In my previous blog when we created the Build Pipeline we zipped the build artifacts and drop them in the artifact location. Now in this step, we need to extract the zipped build artifact so we can use it for deployment. We are extracting these artifacts in this step.

STEP 9: Install the Databrics CLI and Unit Test XML Reporting
We will be using Databricks CLI and unittest-xml-reporting Python packages because these libraries will be used in the upcoming steps.

STEP 10: Deploy the notebook to the workspace
Now we have to import the Python notebook from the artifact directory into Databricks Workspace. This is exactly what this CLI does.
databricks workspace import --language=PYTHON --format=SOURCE --overwrite $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/notebooks/dbxdemo-notebook.py /Shared/dbxdemo-notebook.py
Let’s understand this command so you can tweak it based on your own need.
databricks workspace import
: This is the command to import a file or notebook to the Databricks workspace.--language=PYTHON
: This option specifies the language of the file or notebook being imported. In this case, it’s a Python notebook.--format=SOURCE
: This option specifies the format of the file or notebook being imported. In this case, it’s a source file.--overwrite
: This option specifies that if a notebook with the same name already exists in the Databricks workspace, it should be overwritten.$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/notebooks/dbxdemo-notebook.py
: This is the path to the notebook file in the artifacts directory./Shared/dbxdemo-notebook.py
: This is the path where the notebook will be imported into the Databricks workspace.

STEP 11: Copy Python Wheel to the workspace
Now we will copy a Python library wheel file from the artifacts directory to the Databricks File System (DBFS) in the Databricks workspace. Let me explain how this command works.
databricks fs cp --overwrite $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/libraries/python/libs/dbxdemo-0.1.0-py3-none-any.whl dbfs:/libraries/python/libs/dbxdemo-0.1.0-py3-none-any.whl
databricks fs cp
: This is the command to copy files between the local file system and DBFS.--overwrite
: This option specifies that if a file with the same name already exists in the destination, it should be overwritten.$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/libraries/python/libs/dbxdemo-0.1.0-py3-none-any.whl
: This is the path to the Python library wheel file in the artifacts directory.dbfs:/libraries/python/libs/dbxdemo-0.1.0-py3-none-any.whl
: This is the path where the file will be copied into the DBFS.

STEP 12: Install the Python Wheel library to cluster
This command is running a Python script called “installWhlLibrary.py” which installs a Python library wheel file in a Databricks cluster. Let me explain the command:
$(Release.PrimaryArtifactSourceAlias)/Databricks/cicd-scripts/installWhlLibrary.py --shard=$(DATABRICKS_HOST) --token=$(DATABRICKS_TOKEN) --clusterid=$(DATABRICKS_CLUSTER_ID) --libs=$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/libraries/python/libs/ --dbfspath=/libraries/python/libs
$(Release.PrimaryArtifactSourceAlias)/Databricks/cicd-scripts/installWhlLibrary.py
: This is the path to the Python script in the artifacts directory that installs the library.--shard=$(DATABRICKS_HOST)
: This option specifies the hostname of the Databricks workspace to connect to.--token=$(DATABRICKS_TOKEN)
: This option specifies the Databricks access token to use for authentication.--clusterid=$(DATABRICKS_CLUSTER_ID)
: This option specifies the ID of the Databricks cluster where the library will be installed.--libs=$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/libraries/python/libs/
: This option specifies the path to the directory containing the Python library wheel file in the artifacts directory.--dbfspath=/libraries/python/libs
: This option specifies the path where the library will be installed in the Databricks cluster.

This Python script installs Python .whl
libraries on a Databricks cluster. It uses the Databricks REST API to interact with the cluster and performs the following steps:
- Parse command line arguments using
getopt
. - Walk the local file path specified in
libspath
to generate a list of.whl
files to evaluate. - For each library in the list, evaluate whether it needs to be installed, uninstalled and reinstalled, or left as is.
- If the library is not found on the cluster, install it using the
installLib
function. - If the library is found on the cluster, uninstall it using the
uninstallLib
function, restart the cluster usingrestartCluster
, and then install it using theinstallLib
function.
The getLibStatus
function is used to determine whether a library is already installed on the cluster and what its current status is. The main
function is the entry point of the script and calls the other functions to perform the library installation.
# installWhlLibrary.py
#!/usr/bin/python3
import json
import requests
import sys
import getopt
import time
import os
def main():
shard = ''
token = ''
clusterid = ''
libspath = ''
dbfspath = ''
try:
opts, args = getopt.getopt(sys.argv[1:], 'hstcld',
['shard=', 'token=', 'clusterid=', 'libs=', 'dbfspath='])
except getopt.GetoptError:
print(
'installWhlLibrary.py -s <shard> -t <token> -c <clusterid> -l <libs> -d <dbfspath>')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print(
'installWhlLibrary.py -s <shard> -t <token> -c <clusterid> -l <libs> -d <dbfspath>')
sys.exit()
elif opt in ('-s', '--shard'):
shard = arg
elif opt in ('-t', '--token'):
token = arg
elif opt in ('-c', '--clusterid'):
clusterid = arg
elif opt in ('-l', '--libs'):
libspath=arg
elif opt in ('-d', '--dbfspath'):
dbfspath=arg
print('-s is ' + shard)
print('-t is ' + token)
print('-c is ' + clusterid)
print('-l is ' + libspath)
print('-d is ' + dbfspath)
# Generate the list of files from walking the local path.
libslist = []
for path, subdirs, files in os.walk(libspath):
for name in files:
name, file_extension = os.path.splitext(name)
if file_extension.lower() in ['.whl']:
print('Adding ' + name + file_extension.lower() + ' to the list of .whl files to evaluate.')
libslist.append(name + file_extension.lower())
for lib in libslist:
dbfslib = 'dbfs:' + dbfspath + '/' + lib
print('Evaluating whether ' + dbfslib + ' must be installed, or uninstalled and reinstalled.')
if (getLibStatus(shard, token, clusterid, dbfslib)) is not None:
print(dbfslib + ' status: ' + getLibStatus(shard, token, clusterid, dbfslib))
if (getLibStatus(shard, token, clusterid, dbfslib)) == "not found":
print(dbfslib + ' not found. Installing.')
installLib(shard, token, clusterid, dbfslib)
else:
print(dbfslib + ' found. Uninstalling.')
uninstallLib(shard, token, clusterid, dbfslib)
print("Restarting cluster: " + clusterid)
restartCluster(shard, token, clusterid)
print('Installing ' + dbfslib + '.')
installLib(shard, token, clusterid, dbfslib)
def uninstallLib(shard, token, clusterid, dbfslib):
values = {'cluster_id': clusterid, 'libraries': [{'whl': dbfslib}]}
requests.post(shard + '/api/2.0/libraries/uninstall', data=json.dumps(values), auth=("token", token))
def restartCluster(shard, token, clusterid):
values = {'cluster_id': clusterid}
requests.post(shard + '/api/2.0/clusters/restart', data=json.dumps(values), auth=("token", token))
waiting = True
p = 0
while waiting:
time.sleep(30)
clusterresp = requests.get(shard + '/api/2.0/clusters/get?cluster_id=' + clusterid,
auth=("token", token))
clusterjson = clusterresp.text
jsonout = json.loads(clusterjson)
current_state = jsonout['state']
print(clusterid + " state: " + current_state)
if current_state in ['TERMINATED', 'RUNNING','INTERNAL_ERROR', 'SKIPPED'] or p >= 10:
break
p = p + 1
def installLib(shard, token, clusterid, dbfslib):
values = {'cluster_id': clusterid, 'libraries': [{'whl': dbfslib}]}
requests.post(shard + '/api/2.0/libraries/install', data=json.dumps(values), auth=("token", token))
def getLibStatus(shard, token, clusterid, dbfslib):
resp = requests.get(shard + '/api/2.0/libraries/cluster-status?cluster_id='+ clusterid, auth=("token", token))
libjson = resp.text
d = json.loads(libjson)
if (d.get('library_statuses')):
statuses = d['library_statuses']
for status in statuses:
if (status['library'].get('whl')):
if (status['library']['whl'] == dbfslib):
return status['status']
else:
# No libraries found.
return "not found"
if __name__ == '__main__':
main()
STEP 13: Create Integration Test Directories
Here we are going to create Integration Test Directories. Let’s understand these commands:
These are two separate commands. The first command creates two directories in the artifacts directory, while the second command installs two Python packages.
mkdir -p $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/logs/json
- This command creates a directory named “json” inside another directory named “logs”, which in turn is inside a directory named “$(Release.PrimaryArtifactSourceAlias)/Databricks” in the artifacts directory. The “-p” option ensures that the parent directories are created if they don’t already exist.
- The command also creates another directory named “xml” in the same location.
pip install pytest requests
- This command uses pip to install two Python packages: “pytest” and “requests”.
- “pytest” is a testing framework for Python, while “requests” is a package for making HTTP requests in Python.
mkdir -p $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/logs/json
mkdir -p $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/logs/xml
pip install pytest requests

STEP 14: Run Notebooks & understanding Executenotebook.py Python script
This step is going to run a Python script called “executenotebook.py” which executes a Databricks notebook and saves the results in a JSON log file. Let’s understand the command:
$(Release.PrimaryArtifactSourceAlias)/Databricks/cicd-scripts/executenotebook.py
: This is the path to the Python script in the artifacts directory that executes the notebook.--shard=$(DATABRICKS_HOST)
: This option specifies the hostname of the Databricks workspace to connect to.--token=$(DATABRICKS_TOKEN)
: This option specifies the Databricks access token to use for authentication.--clusterid=$(DATABRICKS_CLUSTER_ID)
: This option specifies the ID of the Databricks cluster where the notebook will be executed.--localpath=$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/notebooks
: This option specifies the path to the directory containing the Databricks notebook in the artifacts directory.--workspacepath=/Shared
: This option specifies the path where the notebook will be uploaded in the Databricks workspace.--outfilepath=$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/logs/json
: This option specifies the path where the JSON log file containing the results of the notebook execution will be saved.

The Python script “executenotebook.py” is designed to execute a set of Jupyter notebooks on a Databricks cluster. The script takes command-line arguments to configure the execution, including the Databricks cluster and authentication information, local and workspace paths to the notebooks, and an output file path.
The script first parses the command-line arguments using the getopt
module and prints them to the console. Then, it walks the local path to discover the notebooks and generates a list of notebooks to execute.
The script then executes each notebook in the list by submitting a job request to Databricks. The job request specifies the notebook to execute, the cluster to execute on, and a timeout value. The script then waits for the job to complete by polling the Databricks REST API until the job’s state indicates that it has terminated, an internal error occurred, or it was skipped.
If an output file path is specified, the script writes the JSON response from the job request to a file with the run ID in the specified output directory.
Finally, the script defines a main()
function that calls all of the above functions in order. The if __name__ == '__main__':
block at the end of the script ensures that the main()
function is only called if the script is run directly, not if it is imported as a module.
# executenotebook.py
#!/usr/bin/python3
import json
import requests
import os
import sys
import getopt
import time
def main():
shard = ''
token = ''
clusterid = ''
localpath = ''
workspacepath = ''
outfilepath = ''
try:
opts, args = getopt.getopt(sys.argv[1:], 'hs:t:c:lwo',
['shard=', 'token=', 'clusterid=', 'localpath=', 'workspacepath=', 'outfilepath='])
except getopt.GetoptError:
print(
'executenotebook.py -s <shard> -t <token> -c <clusterid> -l <localpath> -w <workspacepath> -o <outfilepath>)')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print(
'executenotebook.py -s <shard> -t <token> -c <clusterid> -l <localpath> -w <workspacepath> -o <outfilepath>')
sys.exit()
elif opt in ('-s', '--shard'):
shard = arg
elif opt in ('-t', '--token'):
token = arg
elif opt in ('-c', '--clusterid'):
clusterid = arg
elif opt in ('-l', '--localpath'):
localpath = arg
elif opt in ('-w', '--workspacepath'):
workspacepath = arg
elif opt in ('-o', '--outfilepath'):
outfilepath = arg
print('-s is ' + shard)
print('-t is ' + token)
print('-c is ' + clusterid)
print('-l is ' + localpath)
print('-w is ' + workspacepath)
print('-o is ' + outfilepath)
# Generate the list of notebooks from walking the local path.
notebooks = []
for path, subdirs, files in os.walk(localpath):
for name in files:
fullpath = path + '/' + name
# Remove the localpath to the repo but keep the workspace path.
fullworkspacepath = workspacepath + path.replace(localpath, '')
name, file_extension = os.path.splitext(fullpath)
if file_extension.lower() in ['.scala', '.sql', '.r', '.py']:
row = [fullpath, fullworkspacepath, 1]
notebooks.append(row)
# Run each notebook in the list.
for notebook in notebooks:
nameonly = os.path.basename(notebook[0])
workspacepath = notebook[1]
name, file_extension = os.path.splitext(nameonly)
# workspacepath removes the extension, so now add it back.
fullworkspacepath = workspacepath + '/' + name + file_extension
print('Running job for: ' + fullworkspacepath)
values = {'run_name': name, 'existing_cluster_id': clusterid, 'timeout_seconds': 3600, 'notebook_task': {'notebook_path': fullworkspacepath}}
resp = requests.post(shard + '/api/2.0/jobs/runs/submit',
data=json.dumps(values), auth=("token", token))
runjson = resp.text
print("runjson: " + runjson)
d = json.loads(runjson)
runid = d['run_id']
i=0
waiting = True
while waiting:
time.sleep(10)
jobresp = requests.get(shard + '/api/2.0/jobs/runs/get?run_id='+str(runid),
data=json.dumps(values), auth=("token", token))
jobjson = jobresp.text
print("jobjson: " + jobjson)
j = json.loads(jobjson)
current_state = j['state']['life_cycle_state']
runid = j['run_id']
if current_state in ['TERMINATED', 'INTERNAL_ERROR', 'SKIPPED'] or i >= 12:
break
i=i+1
if outfilepath != '':
file = open(outfilepath + '/' + str(runid) + '.json', 'w')
file.write(json.dumps(j))
file.close()
if __name__ == '__main__':
main()
STEP 15: Create and Evaluate Notebook Test Results & Understanding evaluatenotebookruns.py Python file
This Python script evaluates the notebook runs. Let’s understand this Python script:
The Python script “evaluatenotebookruns.py” contains a unit test class named “TestJobOutput” with two test methods: “test_performance” and “test_job_run”.
The purpose of this script is to evaluate the output of Databricks notebook runs by processing JSON log files located in the specified path (self.test_output_path) and asserting that they meet certain criteria.
In “test_performance” method, each JSON log file is loaded and the execution duration of the notebook is checked. If the duration is greater than 100000 (presumably in milliseconds), the test will fail; otherwise, it will succeed.
In “test_job_run” method, each JSON log file is loaded and the job status is checked. If any job fails, the test will fail; otherwise, it will succeed.
After running all tests, the results are output to an XML file named “TEST-report.xml” using the xmlrunner module. The output is transformed into an XML format that can be read by continuous integration tools such as Azure DevOps or Jenkins.
This script is intended to be run in a continuous integration (CI) environment to ensure that all Databricks notebook runs meet certain criteria and pass specific tests.

# evaluatenotebookruns.py
#!/usr/bin/python3
import io
import xmlrunner
from xmlrunner.extra.xunit_plugin import transform
import unittest
import json
import glob
import os
class TestJobOutput(unittest.TestCase):
test_output_path = '<path-to-json-logs-on-release-agent>'
def test_performance(self):
path = self.test_output_path
statuses = []
for filename in glob.glob(os.path.join(path, '*.json')):
print('Evaluating: ' + filename)
data = json.load(open(filename))
duration = data['execution_duration']
if duration > 100000:
status = 'FAILED'
else:
status = 'SUCCESS'
statuses.append(status)
self.assertFalse('FAILED' in statuses)
def test_job_run(self):
path = self.test_output_path
statuses = []
for filename in glob.glob(os.path.join(path, '*.json')):
print('Evaluating: ' + filename)
data = json.load(open(filename))
status = data['state']['result_state']
statuses.append(status)
self.assertFalse('FAILED' in statuses)
if __name__ == '__main__':
out = io.BytesIO()
unittest.main(testRunner=xmlrunner.XMLTestRunner(output=out),
failfast=False, buffer=False, catchbreak=False, exit=False)
with open('TEST-report.xml', 'wb') as report:
report.write(transform(out.getvalue()))
STEP 16: Publish Test Results
This step will publish Python Unit Test results in JUNIT format to a Test result File path so results can be viewed.

STEP 17: Test the end-to-end Execution
When you run the end-to-end pipeline. This is how it will look in the release pipeline.

Conclusion
In this blog, Streamline Databricks Workflows with Azure DevOps Release Pipelines, we learned that the process of developing a release pipeline involves multiple steps, including selecting the right job template, adding artifacts, defining environment variables, setting up the release agent, and configuring continuous deployment triggers. It also involves deploying the notebook to the workspace, installing the necessary dependencies, running tests, and publishing test results. While the process may seem complex, it can help automate the deployment process, reduce errors, and ensure that the workspace is always up-to-date with the latest changes. By following the steps outlined in this blog post, you can create a reliable and robust release pipeline that meets your organization’s specific needs and requirements.
+ There are no comments
Add yours