Spin the Wheel: Python Packages Meet Databricks


In today’s fast-paced development environment, sharing and distributing Python code across teams and within organizations can be a daunting task. While there are various methods to package Python code, one of the most efficient ways is to use Python Wheel files. These .whl files offer a plethora of advantages, including smaller file sizes for quicker network transfers and the elimination of the need for a compiler during installation.

In this comprehensive guide, we will walk you through the entire process of creating a Python Wheel file (Python Packages) using PyCharm. But we won’t stop there; we’ll also show you how to deploy this Wheel file to a Databricks Cluster Library. Finally, you’ll learn how to call a function from this package within a Databricks Notebook.

By the end of this article, you’ll have a solid understanding of how to package your Python code into a Wheel file and deploy it to Databricks, making your code easily shareable and deployable across your organization.

So, let’s get started and simplify the way you distribute your Python code!

What Are Python Wheels?

Python Wheels are a built, archive format that can greatly speed up the process of installing Python packages. A Wheel file is essentially a pre-compiled package that allows for faster package installation compared to building and installing packages from source. The Wheel format is designed to contain all the files necessary for the package, including binaries and dependencies. Wheel files have a .whl extension and can be installed using package management tools like pip.

Benefits of Using Python Wheels

  1. Speed: Since Wheel files are pre-compiled, the installation process is much faster compared to installing from source code. This is particularly beneficial for packages that have a long compilation time.
  2. Ease of Use: Installing a Wheel file is as simple as running a single pip install command. There’s no need to worry about missing dependencies or compilation errors that you might encounter when installing from source.
  3. Portability: Wheel files can be easily shared and distributed, making it convenient for developers to distribute their packages. This is especially useful for packages that are platform-specific and have binaries that are difficult to compile.
  4. Reduced Risk: Since Wheel files are pre-compiled, there’s less risk of end-users encountering errors during the installation process. This makes it a more reliable format for distribution.
  5. Compatibility: Wheel files are compatible with the Python Package Index (PyPI), making it easy to publish your packages and make them accessible to the community.

When to Use Python Wheels

  1. Large Projects: For larger projects that have multiple dependencies and require a long time to compile, using Wheel can significantly speed up the installation process.
  2. Distribution: If you’re a package maintainer looking to distribute your package, Wheel files make it easier for end-users to install, thereby increasing the adoption rate of your package.
  3. Platform-Specific Packages: If your package includes compiled binaries that are platform-specific, Wheel files can help you distribute these binaries easily.
  4. Environment Isolation: In cases where you want to ensure that the package and its dependencies do not interfere with other Python environments, Wheel files can be a good choice.
  5. Ease of Deployment: For DevOps and system administrators who need to automate the deployment of Python packages across multiple systems, Wheel files offer a quick and reliable option.

Creating a Python Wheel

Creating a Python Wheel is a straightforward process that involves a few essential steps. Below is a guide that will walk you through each step, from setting up your Python project to testing the Wheel locally.

Step 1: Setting Up Your Python Project

Directory Structure

Your Python project should have a directory structure similar to the following:

my_project/
|-- my_module/
|   |-- __init__.py
|   |-- some_code.py
|-- setup.py
|-- README.md
  • my_module: This is the directory where your Python code resides. We will create a module called Contoso.
  • __init__.py: An empty file that tells Python that this directory should be considered a Python package.
  • some_code.py: Your Python code file which will reside the reusable code.
  • setup.py: The build script for setuptools.
  • README.md: Project description and other information.

To create the module structure in PyCharm we will add a python package:

Here I have created a Python package called Contoso and this is how it looks like:

setup.py File

The setup.py file is crucial for packaging your Python project. Here’s a simple example:

import os
from setuptools import find_packages, setup

setup(name='Contoso',
        version='0.0.3',
        description='custom functions for DLT Rule engine',
        url='http://contoso.com',
        author='rajaniesh Kaushikk',
        author_email='rajaniesh.kaushikk@Contoso.com',
        license='MIT',
        packages=['Contoso'],
        install_requires=['setuptools','pyspark'],
      )

You can change it according to your own requirements.

__init__.py

This is a very important file where we will use the initialization code.

# from .rule_engine means we are referring to the file kept in the root (. means root) an the file name is rule_engine. If you look at the above image you will find exactly same file. 
from .rule_engine import get_rules, get_quarantine_rules
# import the method names kept in the file. Remember it should be exact name of the methods kept inside the file otherwise it won't work

Step 2: Writing Your Python Code

We need to write the reusable Python code inside the Python package.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

def get_rules(rule_table_name,table_name_in_rule_table):
    """
     Load data quality rules from a table.
     :param tag: Tag to match.
     :return: Dictionary of rules that match the tag.
     """
    spark = SparkSession.builder.getOrCreate()
    rules = {}
    df = spark.read.table(rule_table_name)
    for row in df.filter(col("table_name") == table_name_in_rule_table).collect():
        rules
] = row[
'constraint']
return rules def get_quarantine_rules(rule_table_name,table_name_in_rule_table): """ Load data quality rules from a table. :param tag: Tag to match. :return: Quarantine rule that matches the tag. """ all_rules_in_tags = get_rules(rule_table_name,table_name_in_rule_table) quarantine_rule = "NOT({0})".format(" AND ".join(all_rules_in_tags.values())) return quarantine_rule

In this code, I am fetching the rules written in the table. Please note the most important part of the code spark = SparkSession.builder.getOrCreate(). This code gets the existing spark session if it already exists or else creates a new spark session. Without including this code your python wheel may not work in the databricks environment.

Step 3: Building the Python Wheel

To build a Wheel file, navigate to your project directory where setup.py is located and run the following command:

python setup.py sdist bdist_wheel
  • sdist: Creates a source distribution.
  • bdist_wheel: Builds the Wheel.

This will generate a .whl file in a dist/ directory within your project folder.

Step 4: Deploying the Python Wheel to Databricks

Follow the following steps to deploy the Wheel file in the databricks cluster.

Step 5: Test the databricks Wheel

To install the Wheel file locally, use pip:

%pip install Contoso

After installation, you can import your package in Python to test

import Contoso
from Contoso import get_rules, get_quarantine_rules
rule_dict=Contoso.rule_engine.get_quarantine_rules("rules","claim_summary") 
display(rule_dict)

By following these steps, you’ll be able to create a Python Wheel for your project, making it easier to distribute and install. This not only benefits you as a developer but also makes life easier for end-users who wish to use your package. Please refer to my other Databricks Articles to learn more on Databricks latest features.

Conclusion

In the rapidly evolving landscape of software development, the need for efficient code distribution is more critical than ever. This comprehensive guide aimed to simplify this process for Python developers, particularly those working in data-centric environments like Databricks. We explored the concept of Python Wheel files, a packaging format that offers numerous advantages such as speed, ease of use, and reliability.

We took a deep dive into the step-by-step process of creating a Python Wheel file using PyCharm, from setting up your project structure to writing reusable code. We also covered the crucial steps to deploy this Wheel file to a Databricks Cluster Library and how to utilize it within a Databricks Notebook.

By following this guide, you should now have a solid understanding of how to package your Python code into a Wheel file and deploy it in a Databricks environment. This not only streamlines the code-sharing process within your team or organization but also opens doors for broader distribution and contribution to the community.

+ There are no comments

Add yours

Leave a Reply