Begin typing your search above and press return to search. Press Esc to cancel.

Home
Azure
Fabric
Unlock Powerful Data Strategies: Master Managed and External Tables in Fabric Delta Lake

Unlock Powerful Data Strategies: Master Managed and External Tables in Fabric Delta Lake

By

. Published on July 15, 2024. November 6, 2024 0

Welcome to our series on optimizing data ingestion with Spark in Microsoft Fabric. In our first post, we covered the capabilities of Microsoft Fabric and its integration with Delta Lake. In this second installment, we dive into mastering Managed and External tables.

Choosing between managed and external tables is a crucial decision when working with Delta Lake in Microsoft Fabric. Each option has its unique benefits and use cases, and understanding these differences is key to efficient data management. Additionally, we will explore data versioning, time travel, and using Delta tables with streaming data to provide a comprehensive understanding of managing data in Delta Lake.

This post will guide you through the differences between managed and external tables, explain data versioning and time travel, and demonstrate how to use Delta tables with streaming data. By the end, you’ll be ready to navigate the data maze with confidence and optimize your data strategies effectively.

Table of Contents

Managed Tables

In the realm of data management using Spark and Delta Lake within Microsoft Fabric, understanding the distinction between managed and external tables is crucial. Both types have their advantages and use cases, and choosing the right one depends on your specific requirements for data storage, management, and control.

Managed tables are tables for which Spark manages both the metadata and the data. When you create a managed table, Spark stores the data in a default location (typically within the Tables folder of your lakehouse or a predefined directory).

Key Characteristics:

Automatic Data Management: Spark takes care of the data files, meaning that when you drop a managed table, both the table definition and the data files are deleted.
Ease of Use: Managed tables are easy to use, especially for users who do not need to worry about data file locations and management.
Default Storage Location: Data is stored in the default storage location defined by the Spark configuration.

Example:

Creating a managed table in Spark SQL:

CREATE TABLE managed_table (
    id INT,
    name STRING,
    amount DOUBLE
) USING delta;

Creating a managed table in PySpark:

df.write.format("delta").saveAsTable("managed_table")

In this example, Spark will store the data files for managed_table in the default storage location.

External Tables

External tables are tables where Spark only manages the metadata, and the data itself is stored at an external location specified by the user. When you drop an external table, the metadata is removed, but the actual data files remain intact.

Key Characteristics:

Control Over Data Location: Users specify the exact location where data files are stored. This is useful when you need to manage data storage independently of the table lifecycle.
Data Persistence: Dropping an external table only deletes the table definition, leaving the data files unaffected.
Flexibility: External tables provide greater flexibility and control over where and how data is stored, which can be important for integration with other systems or data migration scenarios.

Example:

Creating an external table in Spark SQL:

CREATE TABLE external_table (
    id INT,
    name STRING,
    amount DOUBLE
) USING delta
LOCATION '/path/to/external/data';

Creating an external table in PySpark:

df.write.format("delta").option("path", "/path/to/external/data").saveAsTable("external_table")

In this example, the data for external_table is stored in the specified path, /path/to/external/data.

Choosing Between Managed and External Tables

Use Managed Tables When:

You prefer simplicity and ease of management.
Your data lifecycle is tightly coupled with the table definition.
You want Spark to handle the details of data storage.

Use External Tables When:

You need to manage data storage locations explicitly.
You require data persistence independent of table definitions.
You are integrating with other data systems or migrating data.

Here is the comparison between the managed vs unmanaged table:

Feature	Managed Table	Unmanaged (External) Table
Definition	Spark manages both metadata and data.	Spark manages only metadata; data is stored at a user-specified location.
Data Storage	Data is stored in a default location managed by Spark.	Data is stored in an external location specified by the user.
Data Management	Spark takes care of data file management.	Users manage data file locations and storage.
Lifecycle Management	Dropping the table deletes both metadata and data files.	Dropping the table deletes only the metadata; data files remain intact.
Use Case	Ideal for simplicity and ease of management.	Ideal for scenarios where data location control and persistence are required.
Creation Syntax (SQL)	`CREATE TABLE managed_table (...) USING delta;`	`CREATE TABLE external_table (...) USING delta LOCATION 'path';`
Creation Syntax (PySpark)	`df.write.format("delta").saveAsTable("managed_table")`	`df.write.format("delta").option("path", "path").saveAsTable("external_table")`
Data Location Example	Default storage directory like `/Tables/managed_table`	Custom path like `/external/data/location`
Flexibility	Less flexibility, more automation by Spark.	More flexibility, user-defined data storage and management.
Integration	Best for environments where data is tightly coupled with Spark.	Best for integrating with other systems or for data migration.
Performance Optimization	Managed by Spark, includes optimizations like compaction.	Requires user to handle performance optimizations.

Data Versioning and Time Travel in Delta Lake

Delta Lake introduces powerful features such as data versioning and time travel, which enhance data management capabilities by allowing users to access and revert to previous states of the data. These features are crucial for debugging, auditing, and recovering from accidental data changes.

Data Versioning

What is Data Versioning? Data versioning in Delta Lake means keeping track of all changes made to a dataset. Each transaction (insert, update, delete) in Delta Lake generates a new version of the data, which is stored along with a timestamp and a unique version identifier.

Benefits of Data Versioning:

Historical Analysis: Enables users to analyze data as it was at any point in time.
Auditability: Provides a complete audit trail of all changes, helping in compliance and regulatory reporting.
Recovery: Facilitates recovery from accidental deletions or updates by reverting to a previous state.

Example:

# Create a DataFrame and write it to Delta Lake
data = [(1, "Alice", 50), (2, "Bob", 30)]
columns = ["id", "name", "amount"]
df = spark.createDataFrame(data, columns)
df.write.format("delta").save("/path/to/delta/table")

# Update the data and save it again
data_updated = [(1, "Alice", 60), (2, "Bob", 30), (3, "Cathy", 40)]
df_updated = spark.createDataFrame(data_updated, columns)
df_updated.write.format("delta").mode("overwrite").save("/path/to/delta/table")

In this example, each write operation generates a new version of the data in the Delta table.

What is Time Travel?

Time travel in Delta Lake allows users to query previous versions of the data. This is useful for examining historical data, auditing, and recovering data from specific points in time.

How to Use Time Travel:

Query by Timestamp: Retrieve data as it was at a specific timestamp.
Query by Version Number: Retrieve data as it was at a specific version.

Examples:

Querying by Timestamp:

# Read data as it was on a specific date
df = spark.read.format("delta").option("timestampAsOf", "2023-01-01").load("/path/to/delta/table")
df.show()

Querying by Version Number:

# Read data as it was at a specific version
df = spark.read.format("delta").option("versionAsOf", 1).load("/path/to/delta/table")
df.show()

Additional Operations:

Viewing History: Delta Lake allows you to view the history of a table, including all changes made to it.

from delta.tables import DeltaTable

# Initialize the Delta table
delta_table = DeltaTable.forPath(spark, "/path/to/delta/table")

# View history of the table
history_df = delta_table.history()
history_df.show()

This operation displays the full history of changes, including version numbers, timestamps, and operation types.

Using Delta Tables with Streaming Data

Delta Lake is a powerful tool for managing both batch and streaming data in a unified manner. Its ability to handle streaming data makes it ideal for real-time analytics and data processing. This feature ensures that the data ingested in real-time is consistent, reliable, and easily accessible for querying and analysis.

How to Use Delta Tables with Streaming Data

Setting Up a Streaming Source:

To read streaming data into a Delta table, you can use the readStream method in PySpark.

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("DeltaStreaming") \
    .getOrCreate()

# Define the schema of the streaming data
schema = StructType([
    StructField("event_id", StringType(), True),
    StructField("event_type", StringType(), True),
    StructField("timestamp", TimestampType(), True)
])

# Read the streaming data from a source (e.g., Kafka, Event Hubs)
stream_df = spark.readStream.format("eventhubs") \
    .option("eventhubs.connectionString", "YOUR_EVENT_HUBS_CONNECTION_STRING") \
    .schema(schema) \
    .load()

Writing Streaming Data to a Delta Table:

To write streaming data into a Delta table, use the writeStream method.

# Write the streaming data to a Delta table
stream_df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/path/to/checkpoint/dir") \
    .table("delta_streaming_table")

The checkpointLocation option is crucial as it stores the state information of the streaming query, ensuring fault tolerance.

Using Delta Lake as a Streaming Sink:

Delta Lake can act as both a source and a sink for streaming data. Here’s an example of how to set it up as a sink:

# Create a stream that reads JSON data from a folder
inputPath = '/path/to/input/json/files/'
jsonSchema = StructType([
    StructField("device", StringType(), False),
    StructField("status", StringType(), False)
])

stream_df = spark.readStream.schema(jsonSchema).json(inputPath)

# Write the stream to a Delta table
table_path = '/path/to/delta/table'
checkpoint_path = '/path/to/checkpoint/dir'

delta_stream = stream_df.writeStream \
    .format("delta") \
    .option("checkpointLocation", checkpoint_path) \
    .start(table_path)

Reading Streaming Data from a Delta Table:

To query the streaming data, you can simply use the readStream method again.

# Read the streaming data from the Delta table
read_df = spark.readStream.format("delta").table("delta_streaming_table")

# Perform operations on the streaming data
read_df.createOrReplaceTempView("streaming_data")
spark.sql("SELECT COUNT(*) FROM streaming_data").show()

Benefits of Using Delta Tables for Streaming Data

Consistency and Reliability: ACID transactions ensure that the data is consistent and reliable, even when processed in real-time.
Scalability: Delta Lake can handle large volumes of streaming data, making it suitable for big data applications.
Real-Time Analytics: By using Delta tables with streaming data, you can perform real-time analytics and derive insights as soon as the data arrives.

For more such Microsoft Fabric blog posts please refer to my blog posts.

Conclusion

In this post, we explored the differences between managed and external tables in Delta Lake within Microsoft Fabric, as well as the powerful features of data versioning, time travel, and using Delta tables with streaming data. Understanding these distinctions and capabilities is crucial for optimizing your data management strategy.

Managed tables offer simplicity and ease of management, while external tables provide greater control and flexibility over data storage locations. Data versioning and time travel enhance your ability to manage and recover data, and integrating streaming data with Delta tables enables real-time analytics and processing.

By mastering these concepts, you can ensure efficient and effective data management tailored to your specific needs. Stay tuned for our next blog post, where we will delve into advanced optimization techniques and practical maintenance strategies for Delta tables in Delta Lake.

Related

+ There are no comments

Add yours

Leave a ReplyCancel reply

Rajaniesh Kaushikk is Microsoft MVP and TOGAF Certified Enterprise Architect with over 22 years of experience in delivering complex software application architectures for Fortune 500 companies. He specializes in Hybrid Cloud, Azure Cloud, Power BI, Azure Synapse, Data Lake, Data Warehouse, HDInsight, Databricks Lakehouse, Snowflake, Azure DevOps, Kubernetes, and production debugging. Rajaniesh holds several industry certifications, including Databricks Champion, Databricks Certified Data Engineer Professional, Microsoft Certified Azure Solutions Architect Expert, Snowflake SnowPro core, Snowflake Advanced Architect, and Microsoft Certified Trainer. He is also a blogger, YouTuber, and speaker on various Microsoft technology events.