How to deploy Databricks in your private VNet without exposing public IP address (VNet Injection)?


Recently I came across a situation where customers wanted to deploy Databricks into their own private network due to security reasons. The default installation of databricks creates its own Virtual network and you do not have any control over it. Recently Microsoft provided the new feature of deploying the Databricks into its own Private VNet (described below).

There are several limitations of this approach when you deploy Azure Databricks in its own VNet.

  1. Databricks creates its own NSG(Network security group) which does not hold a good naming convention as per your enterprise naming convention.

2. Once Databricks is deployed and you create a cluster into it. You will find that it creates public IP addresses into it. This is a big security risk because some organizations do not allow public IP addresses to be part of the deployment.

So How do we solve this issue? The challenge is to deploy the databricks in Private VNet without exposing the public IP address. Here are the step-by-step instructions to achieve it:

  1. Create a Resource Group.

2. Create a VNet and add adequate address space to make room for Databricks.

3. Now create two Network security groups. Make sure it adheres to the organization’s naming convention. These two network security groups will be attached to two subnets which we will create in subsequent steps.

4. Now create two subnets one will be a private subnet and another one will be a public subnet.

  • The public subnet allows communication with the Azure Databricks control plane.
  • The private subnet allows only cluster-internal communication.

Do not deploy other Azure resources on the subnet used by your Azure Databricks workspace. Sharing the subnet with other resources, such as virtual machines, prevents managed updates to the intended policy for the subnet.

5. Assign public NSG(created in step 3) to public subnet and delegate the subnet to Microsoft.databricks/Workspace service.

6. Assign private NSG(created in step 3) to a private subnet and delegate the subnet to Microsoft.databricks/Workspace service.

7. Use this Azure deployment template to deploy the databricks. Here is the template JSON file. You can copy this template from this link databricks/101-databricks-secure-cluster-connectivity-with-vnet-injection at master · anildwarepo/databricks (github.com) My special thanks to Anil Dwarakanath for this useful template. Here I have added some tweaks to the template for Tags.

There is another template that can be used as well and this template is linked with Microsoft documentation (https://azure.microsoft.com/en-us/resources/templates/101-databricks-all-in-one-template-for-vnet-injection/) but the issue with this template is that usually you design your network resources before deploying the Azure resources so the network resources like (VNet, subnet, etc) will be designed well before Azure Databricks will be deployed but this template does not provide any flexibility for that. It creates everything during the Databricks deployment. Also if you already created your Azure Networking resources this template will try to recreate it again. Moreover, it does not solve the public IP address issue.

{
    "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
      "workspaceName": {
        "type": "string",
        "metadata": {
          "description": "The name of the Azure Databricks workspace to create."
        }
      },
	    "pricingTier": {
        "defaultValue": "premium",
        "allowedValues": [
          "trial",
          "standard",
          "premium"
        ],
        "type": "string",
        "metadata": {
          "description": "The pricing tier of workspace."
        }
      },
      "customVirtualNetworkId": {
        "type": "string",
        "metadata": {
          "description": "The complete ARM resource Id of the custom virtual network."
        }
      },
      "customPublicSubnetName": {
        "type": "string",
        "defaultValue": "databricks-public-subnet",
        "metadata": {
          "description": "The name of the public subnet in the custom VNet."
        }
      },
      "customPrivateSubnetName": {
        "type": "string",
        "defaultValue": "databricks-private-subnet",
        "metadata": {
          "description": "The name of the private subnet in the custom VNet."
        }
      },
      "location": {
        "type": "string",
        "defaultValue": "[resourceGroup().location]",
        "metadata": {
          "description": "Location for all resources."
        }
      }
    },
    "variables": {
      "managedResourceGroupId": "[subscriptionResourceId('Microsoft.Resources/resourceGroups', variables('managedResourceGroupName'))]",
      "managedResourceGroupName": "[concat('databricks-rg-', parameters('workspaceName'), '-', uniqueString(parameters('workspaceName'), resourceGroup().id))]"
    },
    "resources": [
      {
        "comments": "The resource group specified will be locked after deployment.",
        "type": "Microsoft.Databricks/workspaces",
        "apiVersion": "2018-04-01",
        "name": "[parameters('workspaceName')]",
        "location": "[parameters('location')]",
        "sku": {
          "name": "[parameters('pricingTier')]"
        },
		"tags": {
                "Application": "myApplicationName",
                "Cost Center": "111111",
                "Tier": "Test"
     },
        "properties": {
          "managedResourceGroupId": "[variables('managedResourceGroupId')]",
          "parameters": {
            "customVirtualNetworkId": {
              "value": "[parameters('customVirtualNetworkId')]"
            },
            "customPublicSubnetName": {
              "value": "[parameters('customPublicSubnetName')]"
            },
            "customPrivateSubnetName": {
              "value": "[parameters('customPrivateSubnetName')]"
            },
            "enableNoPublicIp": {
              "value": true
            }
          }
        }
      }
    ]
  }

and here is the parameter file:

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "workspaceName": {
      "value": null
    },
    "pricingTier": {
      "value": "premium"
    },
    "customVirtualNetworkId": {
      "value": null
    },
    "customPublicSubnetName": {
      "value": "databricks-public-subnet"
    },
    "customPrivateSubnetName": {
      "value": "databricks-private-subnet"
    },
    "location": {
      "value": "[resourceGroup().location]"
    }
  }
}

8. Now let’s deploy the template with the help of the template deployment option in Azure and specify these values.

Virtual network ID would be the ID of the VNET created in step 2. You can get the virtual network ID from the VNet’s property.

9. Once the template is deployed you will have to create a cluster to see if it generates the public IP addresses.

You will see that there would not be any public IP address generated from the template.

I hope this will be very useful!!

5 Comments

Add yours
  1. 1
    Atif Al-Amir

    Excellent post rajaniesh. This approach addressed a major security hole in the Azure Databricks out of the box implementation. It also solved a problem we had with workspaces creating Dynamic Public IPs and not able to get a handle on the list of IPs to add our allowed IP lists when connecting to on-prem resources or Snowflake warehouses. I added an Azure NAT Gateway with a single static IP and added it to the public-subnet created with your template. Solved that issue nicely and with little cost. azure suggestion was to create a firewall with routing tables and NVA. way overkill and costly for a simple solution.
    Thank you

  2. 2
    Luella Reedholm

    We absolutely love your blog and find most of your post’s to be just what I’m looking for. Do you offer guest writers to write content for you personally? I wouldn’t mind producing a post or elaborating on most of the subjects you write with regards to here. Again, awesome weblog!

Leave a Reply