Boost Productivity with Databricks CLI: A Comprehensive Guide


Exciting news! The Databricks CLI has undergone a remarkable transformation, becoming a full-blown revolution. Now, it covers all Databricks REST API operations and supports every Databricks authentication type. The best part? Windows users can join in on the exhilarating journey and install the new CLI with Homebrew, just like macOS and Linux users.

This blog aims to provide comprehensive guidance on Databricks CLI, covering installation instructions, authentication setup, and practical examples of various Databricks CLI commands. By the end of this blog, you will have a clear understanding of how to utilize Databricks CLI effectively in your workflow.

No more exclusion; it’s a unified experience across platforms! Brace yourself for a smoother, more efficient, and more powerful workflow, as the Databricks CLI with Homebrew on Windows propels you into the future of data-driven excellence!

What is Databricks CLI?

The Databricks CLI is a command-line interface tool that allows you to automate the Databricks platform from your terminal, command prompt, or automation scripts. It is built on top of the Databricks REST APIs and implements the Databricks client unified authentication standard, which protects user and business data.

Some of the benefits of Databricks CLI are:

  1. You can use the CLI to run custom commands that are not available in the web UI.
  2. Can automate tasks such as creating and managing clusters of any size.
  3. Use Databricks CLI on Linux, Windows, and MacOS operating systems.
  4. You can save time and avoid switching between multiple browser tabs and workspaces.

How to Install

To install Databricks CLI on Windows, follow these simple steps:

STEP 1: Go to the GitHub repository of Databricks CLI.

STEP 2: In the “Releases” section, locate the correct .zip file for your machine’s operating system and architecture. Download it.

STEP 3: Follow your operating system’s documentation to extract the contents of the downloaded .zip file. This process may involve using a built-in utility or a third-party application to unzip the file.

STEP 4: After extraction, you will see a folder with the same name as the .zip file.

STEP 5: Open the folder, and inside, you’ll find the Databricks CLI executable file.

STEP 6: At this point, you have a choice: you can either keep the Databricks CLI executable in this folder or move/copy it to another location on your computer for easier access. You should add the folder path to the environment variable so Databicks CLI commands can be run from anywhere.

STEP 7: Open a terminal or command prompt, navigate to the location where the Databricks CLI executable is located, and run the command:

databricks -v 

# Or

databricks version
  • databricks -v: Display the version information.

Alternatively, you can run the databricks version command in the terminal or command prompt to get the version information.

To install Databricks CLI on MacOS and Linux, you have the option to use Homebrew. For easy installation, click the link – Install Databricks CLI on MacOS and Linux.

How to set up the authentication

In this section, you’ll find a step-by-step guide to authenticate using the Databricks personal access token authentication method, widely recognized for its security and reliability.

Databricks Personal Access Token Authentication

A Databricks Personal Access Token is a long-lived token that is generated by a user. It can be used to authenticate to Databricks APIs and to access Databricks notebooks. Databricks personal access token authentication uses a Databricks personal access token to verify the identity of the desired Databricks entity, be it a Databricks user account or a Databricks service principal.

You need to follow the two main steps to set up databricks Personal Access Token Authentication:

  1. Create a Databricks Personal Access Token
  2. Create Configuration Profile

STEP 1: Create an Access Token

Below are the steps to create a Databricks personal access token for a Databricks user:

  1. Navigate to your Databricks workspace and click on your Databricks username in the top bar. From the dropdown menu, select “User Settings.
  2. In the “User Settings” page, go to the “Access tokens” tab and click on “Generate new token.”
  3. Optionally, you can provide a comment to help identify this token in the future and adjust the token’s default lifetime (which is set to 90 days). If you prefer a token with no lifetime (not recommended), simply leave the “Lifetime (days)” box empty.
  4. Click on “Generate” to create the personal access token.
  5. The newly generated token will be displayed. Make sure to copy it for later use, and then click “Done” to complete the process.

Remember, if you lose the token, you will have to generate a new one. You cannot retrieve the old token. Also, tokens provide full access to Databricks APIs, so they should be kept secure.

STEP 2: Create Configuration Profile

A configuration profile refers to a set of settings that includes authentication details like the Databricks workspace URL and access token value. Each configuration profile is assigned a programmatic name, such as “DEFAULT,” “DEV,” or “PROD,” to distinguish and manage various setups efficiently.

To create a configuration profile, run the following command:

databricks configure --host <workspace-url> --profile <configuration-profile-name>

After that, enter the Databricks Token and it will create an entry in the .databrickscfg file.

databricks token = "Enter Your Token here which you created in previous step"

You can manually create a configuration profile by using your favorite text editor to create a file named .databrickscfg in your ~ (your user home) folder on Unix, Linux, or macOS, or your %USERPROFILE% (your user home) folder on Windows, if you do not already have one. Do not forget the dot (.) at the beginning of the file name. Add the following contents to this file:

[configuration-profile-name]
host = https://adb-1236286498149800.0.azuredatabricks.net
token = "Enter Your Token here which you created in previous step"

Where,

  • The host field is the workspace URL of your Databricks workspace.
  • The token field is the value of your Databricks personal access token.

The host field has a unique per-workspace URL, which is like the below format 

adb-<workspace-id>.<random-number>.azuredatabricks.net

The workspace ID is the unique identifier for the workspace. It appears immediately after the adb- prefix and before the . (dot). For example, if your workspace name is “adb-1236286498149800.0,” the URL would look like this: “https://adb-1236286498149800.0.azuredatabricks.net“.

You can determine the pre-workspace URL for your workspace in two ways:

  • When you are logged in:
    • In the top bar of the Databricks UI, click on your username.
    • From the dropdown menu, select “Workspaces“.
    • The pre-workspace URL will be displayed in the URL bar of your web browser.
  • By selecting the resource:
    • In the Databricks UI, click on the “Resources” tab.
    • In the “Workspaces” section, select the workspace that you want to determine the pre-workspace URL.
    • The pre-workspace URL will be displayed in the URL field of the workspace details page.

Once you have created a configuration profile, you can use it to authenticate to Databricks in your code. The specific code that you need to use will vary depending on the tool or SDK that you are using.

Databricks CLI commands

Databricks CLI commands are designed to simplify various tasks and enable easy interactions with Databricks. These commands can be categorized into two types:

  1. Group Flags
  2. Command groups.

Global Flags

Group Flags are specific options that can be applied to multiple Databricks CLI commands. They allow you to customize the behavior of commands according to your requirements. These flags are typically used as modifiers for the commands, and they are specified as options when invoking the commands. Group Flags are helpful for setting global options that affect multiple commands consistently.

The --profile flag will specify the profile to use for authentication, and --output flag specifies the format in which the output should be displayed.

Here is an example of a command that uses the --profile and --output flags:

databricks sql --query "SELECT * FROM my_table" --profile my-profile --output json

This command will run the SQL command SELECT * FROM my_table on the Databricks SQL warehouse, using the profile my-profile and displaying the results in JSON format.

For example, the -d flag can be used to specify the Databricks URL.

databricks sql --query "SELECT * FROM my_table" -d https://my-databricks-url-as-described-above

This command will run the SQL command SELECT * FROM my_table on the Databricks SQL warehouse, using the Databricks URL https://my-databricks-url (As described above).

Here is the list of some of the Global flags in the Databricks CLI:

Global FlagDescription
-uDatabricks username
-pDatabricks Password
-cCluster ID
-nNotebook ID
-tTable Name
-lLog Path
-fFile path
-oto write the output as a test or as JSON

Command Group

Command Groups are organized based on the functionality or task they perform. It allows you to access multiple related commands under a single namespace, making it easier to remember and use them efficiently. Here is the list:

  1. Cluster – Create, Manage, and delete clusters
  2. Notebook – Run, manage, and share notebooks
  3. Data – Load, unload, and manage data
  4. Job – Submit, monitor, and cancel jobs
  5. User – Manage users and groups
  6. Role – Manage role and permissions

1. Cluster Command: Consider the below example where we will use the Cluster command to create a new cluster, get the list of the clusters, and delete a cluster.

# Create a new cluster named my-cluster with 4 workers of type standard_D2_V2
databricks clusters create --cluster-name my-cluster --node-type-id Standard_D2_v2 --num-workers 4

# List all of the clusters in your Databricks account. 
databricks clusters list

# Delete the cluster with the ID `my_cluster`. 
databricks clusters delete --cluster-id my-cluster

A List of other cluster commands are:

  1. db cluster describe – to describe a cluster
  2. db cluster resize – to resize a cluster
  3. db cluster start – to start a cluster
  4. db cluster stop – to stop a cluster

Describe a cluster:

databricks clusters describe --cluster-id my-cluster

This command will describe the cluster with the ID my-cluster.

Resize a cluster:

databricks clusters resize --cluster-id my-cluster --num-workers 8

This command will resize the cluster with the ID my-cluster to have 8 workers.

Start a cluster:

databricks clusters start --cluster-id my-cluster

This command will start the cluster with the ID my-cluster.

Stop a cluster:

databricks clusters stop --cluster-id my-cluster

For more information on the Databricks clusters command, you can run the following command

databricks clusters --help

2. Notebook Command: Here is an example where we will use the Notebook command to run a new notebook, get the list of the notebooks, and delete a notebook.

# Create new notebook named my-notebook in the Python language
databricks notebook create --notebook-path my-notebook --language python

# Run a notebook name my-cluster on a cluster with the ID my-cluster-ID
databricks notebook run --notebook-path my-notebook --cluster-id my-cluster-ID

# List all of the notebooks in your Databricks account. 
databricks notebook list 

# Delete the notebook with the ID `my_notebook`. 
databricks notebook delete --notebook-id my_notebook

A list of other Notebook commands:

  1. db notebook describe – describes a notebook.
  2. db notebook share – Shares a notebook
  3. db notebook unshare – Unshare in a notebook

Describe a notebook:

databricks notebook describe --notebook-id my-notebook

This command will describe the notebook with the ID my-notebook.

Share a notebook:

databricks notebook share --notebook-id my-notebook --user my-username

This command will share the notebook with the user my-username.

Unshare a notebook:

databricks notebook unshare --notebook-id my-notebook --user my-username

This command will unshare the notebook with the user my-username.

3. Data Command: Below is an example where we will use the Data command to load data into your table, unload the data from the table, and describe data in the table.

# Load data from a CSV file into a Databricks table. 
databricks data load --table-name my-table --format csv --path path-to-csv-file

# Unload data from a Databricks table to a CSV file. 
databricks data unload --table-name my-table --format csv --path path-to-output-csv-file

# Describe a Databricks table. 
databricks data describe --table-name my-table

A list of other data commands:

  1. db data sample – Samples data from a Databricks table.
  2. db data history – Shows the history of data loads and unloads for a Databricks table.

Data sample:

databricks data sample --table-name my-table --num-rows 100

This command will return a sample of 100 rows from the table named my-table.

Data history:

databricks data history --table-name my-table

This command will show the history of changes to the table named my-table.

4. Job Command: Consider the below example where we will use the Job command to submit a job, display the job list, check the status of your job, and also cancel your job.

# Create new job that will run the Python file at the path path-to-my-python-file
databricks jobs create --job-name my-job --python-file path-to-my-python-file

# Submit a job for execution. The job will be created and executed in the background. 
databricks jobs submit --job-name my-job --python-file path-to-my-python-file

# Run a job
databricks jobs run --job-id my-job

# List all of the jobs in your Databricks account. 
databricks jobs list

# Get the status of a job with the ID my-job 
databricks jobs get-status --job-id my-job

# Cancel a job with the ID my-job 
databricks jobs cancel --job-id my-job

# Delete a job
databricks jobs delete --job-id my-job

5. User Command: Here is an example where we will use the User command:

# Create a new user. 
databricks user create --username my-username --password my-password

# List all of the users in your Databricks account. 
databricks user list

# Get the details of a user. 
databricks user info --username my-username

# Delete a user. 
databricks user delete --username my-username

A list of other User commands:

  1. db user grant – Grants role to a user.
  2. db user revoke – Revokes role from a user.
  3. db user change password – Changes the password for a user
  4. db user impersonate – Impersonate a user

Grant a role to a user:

databricks user grant --username my-username --role my-role

This command will grant the role my-role to the user named my-username.

Revoke a role from a user:

databricks user revoke --username my-username --role my-role

This command will revoke the role my-role from the user named my-username.

Change the password for a user:

databricks user change-password --username my-username --password my-new-password

This command will change the password for the user named my-username to my-new-password.

Impersonate a user:

databricks user impersonate --username my-username

This command will impersonate the user named my-username. You will be able to run commands as the user my-username until you exit the impersonation session.

6. Role Command: Here is an example where we will use the Role command:

# Create a new role. 
databricks role create --role-name my-role --description "This is my new role."

# List all of the roles in your Databricks account. 
databricks role list

# Get the details of a role. 
databricks role info --role-name my-role

# Delete a role. 
databricks role delete --role-name my-role

# Grant permissions CREATE_CLUSTERS to a role named my-role. 
databricks role grant --role-name my-role --permission CREATE_CLUSTERS

# Revoke permissions CREATE_CLUSTER from a role named my-role. 
databricks role revoke --role-name my-role --permission CREATE_CLUSTERS

How to use Databricks CLI Commands?

Here are some usage examples of the Databricks CLI:

Example 1: To list CLI command groups, run:

databricks -h

Example 2: To display the help for a command, run:

databricks clusters list -h

Example 3: To list all the Databricks clusters that you have in your workspace, run:

databricks clusters list 

Example 4: To display the name of an Azure Databricks cluster with the specified Cluster ID.

You can use the utility jq to extract a specific element from the .json file produced from a cluster command to display the name of an Azure Databricks cluster with the specified Cluster ID. Here’s how to do it:

Run the following command to get the JSON output of the cluster command:

databricks clusters describe --cluster-id my-cluster

This command will output the JSON representation of the cluster with the ID my-cluster to the console.

Save the JSON output to a file named cluster.json.

Run the following command to use jq to extract the name of the cluster from the file:

jq '.name' cluster.json

This command will print the name of the cluster to the console.

For example, if the name of the cluster is my-cluster, the output of the command will be:

my-cluster

Here is a more complete example of how to use jq to extract the name of an Azure Databricks cluster with the specified Cluster ID:

# Get the JSON output of the cluster command
cluster_json=$(databricks clusters describe --cluster-id my-cluster)

# Save the JSON output to a file
echo "$cluster_json" > cluster.json

# Use jq to extract the name of the cluster
name=$(jq '.name' cluster.json)

# Print the name of the cluster
echo "$name"

This script will first get the JSON output of the cluster command and save it to a file named cluster.json. Then, it will use jq to extract the name of the cluster from the file and print it to the console.

Example 5: To export a workspace directory to the local filesystem, run:

databricks workspace export --path path-to-workspace-directory --local-path path-to-local-directory

For example, to export the workspace directory my-workspace to the local directory /tmp/my-workspace, you would run the following command:

databricks workspace export --path my-workspace --local-path /tmp/my-workspace

This command will export the contents of the workspace directory my-workspace to the local directory /tmp/my-workspace.

Example 6: To import a local directory of notebooks to a workspace, run:

databricks workspace import --local-path /path/to/local/directory --workspace-path /path/to/workspace/directory

For example, to import the notebooks in the local directory /tmp/my-notebooks to the workspace directory my-workspace, you would run the following command:

databricks workspace import --local-path /tmp/my-notebooks --workspace-path my-workspace

Here are some additional details about the command:

  • The --local-path flag specifies the path to the local directory that contains the notebooks to be imported.
  • The --workspace-path flag specifies the path to the workspace directory where the notebooks will be imported.
  • The command will overwrite any notebooks in the workspace directory with the same name as the notebooks in the local directory.
  • The command will only import notebooks that have the extensions .ipynb.py, or .scala.

Example 7: To copy a small dataset to the Databricks filesystem (DBFS), run:

databricks fs cp /path/to/local/dataset /dbfs/path/to/dataset 

For example, to copy the dataset my-dataset.csv from the local directory /tmp to the DBFS directory /user/my-username/my-dataset.csv, you would run the following command:

databricks fs cp /tmp/my-dataset.csv /dbfs/user/my-username/my-dataset.csv

Example 8: To run a SQL command on a Databricks SQL warehouse, run:

databricks sql --query <sql_command>

For example, to run the SQL command SELECT * FROM my_table, you would run the following command:

databricks sql --query "SELECT * FROM my_table"

This command will run the SQL command SELECT * FROM my_table on the Databricks SQL warehouse.

Conclusion

This article has provided a comprehensive overview of Databricks CLI, covering its functionalities, authentication setup, and a wide range of commands applicable to clusters, notebooks, data, jobs, users, and roles. By following the step-by-step instructions, you can now easily set up authentication and efficiently manage your Databricks environment. I hope you found this article informative and enjoyable, empowering you to use Databricks CLI effectively in your data engineering and analytics tasks.

+ There are no comments

Add yours

Leave a Reply