Empower Data Analysis with Materialized Views in Databricks SQL


Envision a realm where your data is always ready for querying, with intricate queries stored in a format primed for swift retrieval and analysis. Picture a world where time is no longer a constraint, where data handling is both rapid and efficient. This is the transformative potential that materialized views bring to your data analysis workflow.

Materialized views are not just a concept; they are a powerful solution. Are you ready to harness the revolutionary capabilities of materialized views in the domain of data analysis?

Dive into this comprehensive guide as we traverse the landscape of materialized views, understanding why they are the premier choice for augmenting your data analysis workflow. This article will equip you with invaluable knowledge on the creation and refreshing of materialized views through a detailed, step-by-step tutorial. By integrating materialized views into your workflow, you’ll observe a remarkable enhancement in query performance and a substantial decrease in processing times.

Embark on this journey with us into the realm of materialized views. Learn how to create and refresh them, thereby optimizing your data operations to their fullest potential

What are Materialized Views?

A Materialized view in the Databricks SQL is Unity Catalog Managed Tables that help you to store and precompute the result of the SQL queries based on the latest version of data in source tables. This is designed to make your queries faster and more efficient.

Unlike other tables, materialized view will not update immediately when you ask for the data rather it will show the data from the last time when it was updated.

Imagine you work for a retail company, and your team is responsible for analyzing vast amounts of transaction data to gain valuable insights into customer behavior and sales trends. As your company grows, so does the volume of data, leading to increasingly complex and time-consuming queries to extract meaningful information.

Traditionally, running these queries directly on the raw transaction data can be slow and resource-intensive, making it challenging to strike a balance between obtaining accurate results and maintaining optimal performance. This is where materialized views in Databricks SQL step in as a game-changer.

Materialized views can significantly decrease the time it takes to retrieve data, leading to faster query responses and improved performance. This is particularly crucial in the constantly evolving retail industry, where real-time analysis can enhance decision-making, enable targeted marketing strategies, and ultimately improve customer experiences.

Why use materialized views in Databricks SQL?

The materialized view is stored in the database and can be queried just like any other table. The results of the materialized view are pre-computed, which means that they are already calculated and stored in the database. This can improve the performance of queries that use the materialized view because the database does not have to recalculate the results each time the query is run.

Now, consider the below diagram which shows how the performance of the query is improved using materialized view.

In the above diagram, materialized views are being used with the Bronze-Silver-Gold architecture. This is a common data lake design pattern. It’s also known as the “raw-cleansed-consumption” pattern or “ingest-prep-consume” pattern.

  1. Bronze Layer (Raw Data): This is where raw data is ingested from various sources. The data is stored in its original, unaltered state. In the context of materialized views, a streaming table could be set up to continuously ingest newly arrived files from a data source into a bronze table.
  2. Silver Layer (Cleansed Data): This layer contains data that has been cleansed, enriched, and prepared for analysis. Materialized views can be used here to incrementally transform and aggregate data from the bronze layer. For example, you could create a materialized view that counts distinct events from a bronze table and stores the result in a silver table. The materialized view would be automatically and incrementally updated as new data arrives in the bronze table.
  3. Gold Layer (Business-Level Aggregates): This is the layer where data is further aggregated and refined for consumption by end users, typically for reporting and analytics. Materialized views can be used here to pre-compute complex queries and frequently used computations, thereby accelerating query performance for end users. For example, a materialized view could be created to calculate the daily active users from a silver table and store the result in a gold table.

By using materialized views in this way, you can create efficient data pipelines that automatically and incrementally update your silver and gold tables as new data arrives in your bronze tables. This can lead to significant improvements in query performance and reductions in processing times.

Here are some key reasons why materialized views in Databricks SQL are valuable:

  1. Performance Optimization: It stores pre-computed results of complex queries, improving query performance and reducing latency.
  2. Efficient ETL Processing: Ideal for Extract, Transform, and Load (ETL) tasks, materialized views simplify data processing, allowing manual or automatic refreshes for up-to-date data.
  3. Cost Savings: By precomputing and storing query results, it can reduce the computational burden on your data processing workflows which will save the cost especially when dealing with large datasets.
  4. Simplified End-User Experience: It can abstract away complex queries, presenting a user-friendly view of data in pre-aggregated or denormalized formats.
  5. Incremental Computation: In some cases, it can incrementally compute changes from the base tables, reducing the need to recompute the entire view.
  6. Data Transformation and Enrichment: It enables you to perform data cleaning, enrichment, and denormalization on the base tables, simplifying the data transformation process.
  7. Offline Analysis: It can be particularly beneficial for historical analysis, performing analysis on data snapshots at specific points in time.

Materialized Views vs Streaming Tables

In Databricks, materialized views and streaming tables are both powerful tools that can be used to improve the performance of queries and process streaming data. However, they have different purposes and are suited for different use cases.

Here is a comparison of materialized views and streaming tables in Databricks SQL:

FeatureMaterialized ViewsStreaming Tables
PurposeTo improve the performance of queries by caching the results of frequently-run queries.To process streaming data.
DataMaterialized views are created from static data.Streaming tables are created from streaming data.
RefreshingMaterialized views can be refreshed on a regular basis or on demand.Streaming tables are refreshed continuously as new data arrives.
ScalabilityMaterialized views can be scaled horizontally by creating multiple copies of the materialized view.Streaming tables can be scaled horizontally by adding more workers to the streaming cluster.
CostMaterialized views are a relatively inexpensive way to improve the performance of queries.Streaming tables can be more expensive than materialized views, as they require more resources to process streaming data.

Which one to use?

The best choice for you will depend on your specific requirements. If you need to improve the performance of queries on static data, then materialized views may be a good option for you. If you need to process streaming data, then streaming tables may be a good option for you.

Here are some additional considerations when choosing between materialized views and streaming tables:

  • The size and volume of the data: If you have a large amount of data, then materialized views may not be a good option, as they can become too large and slow to refresh. Streaming tables may be a better option for large datasets, as they can be refreshed continuously as new data arrives.
  • The frequency of the queries: If you run queries on a regular basis, then materialized views may be a good option, as they can improve the performance of those queries. Streaming tables may not be a good option if you only run queries occasionally, as they can be more expensive to maintain.
  • The budget:¬†Materialized views are a relatively inexpensive way to improve the performance of queries. Streaming tables can be more expensive, requiring more resources to process streaming data.

How to Create Materialized Views?

Materialized views in Databricks SQL are created using the “CREATE MATERIALIZED VIEW” statement. You can use the SQL Editor in the Azure Databricks UI, the Databricks SQL CLI, or the Databricks SQL API to submit the commands and create these materialized views.

Here’s an example of creating a materialized view named “mv1” from the base table “base_table1”:

CREATE MATERIALIZED VIEW mv1 
AS SELECT 
  date, sum(sales) AS sum_of_sales 
FROM 
  table1 
GROUP BY 
  date;

In the above code, CREATE utilizes a Databricks SQL warehouse to create and load data into the materialized view. This operation is synchronous, meaning the “CREATE MATERIALIZED VIEW” command will wait until the materialized view is created and the initial data load is complete.

Refreshing Materialized Views

In Databricks SQL, materialized views utilize Delta Live Tables for their refresh operations. When you need to refresh a materialized view, it triggers an update to the Delta Live Tables pipeline responsible for managing that view. This ensures that the data in the materialized view is always up-to-date with the latest changes from the base table.

To initiate the refresh and ensure your materialized view reflects the most recent changes from the base table, you can use the “REFRESH MATERIALIZED VIEW” statement.

Here’s an example of how to refresh a materialized view named “mv1”:

REFRESH MATERIALIZED VIEW mv1;

There are two different ways to refresh materialized views:

  1. Incremental Refresh
  2. Full Refresh

Incremental Refresh  

Incremental refresh materialized views only update the materialized view with the changes that have been made to the underlying data since the last refresh. This can be a more efficient way to refresh a materialized view, especially if the underlying data changes frequently

Below is an example to create an incremental refresh materialized view:

CREATE MATERIALIZED VIEW orders_mv 
AS SELECT * FROM orders; 

-- Refresh the materialized view incrementally. 
SCHEDULE REFRESH orders_mv 
ON INTERVAL '1' HOUR 
WITH OPTIONS ( INCREMENTAL REFRESH = TRUE );

In the above code, the INCREMENTAL REFRESH option is set to TRUE, which means that the materialized view will only be updated with the changes that have been made to the underlying data since the last refresh.

Full Refresh

Full refresh materialized views completely rebuild the materialized view from the underlying data. This can be a more expensive way to refresh a materialized view, but it ensures that the materialized view is always up-to-date.

Consider the below example to create an full refresh materialized view:

CREATE MATERIALIZED VIEW orders_mv 
AS 
SELECT * FROM orders; 

-- Refresh the materialized view fully. 
SCHEDULE REFRESH orders_mv 
ON INTERVAL '1' DAY 
WITH OPTIONS ( INCREMENTAL REFRESH = FALSE );

In the above code, the INCREMENTAL REFRESH option is set to FALSE, which means that the it will be completely rebuilt from the underlying data every time it is refreshed.

How to Refresh Materialized Views on a Schedule

You can set up automatic refresh for a Databricks SQL materialized view using a predefined schedule. This schedule can be configured during the creation using the SCHEDULE clause, or you can add a schedule later using the ALTER VIEW statement.

Once a schedule is created, a new Databricks job is automatically set up to handle the refresh process according to the defined schedule. This job takes care of processing the updates and keeping the materialized view up-to-date with the latest data from the base tables.

Consider the below example where I will schedule materialized view using the SCHEDULE clause:

CREATE MATERIALIZED VIEW orders_mv 
AS 
SELECT * FROM orders; 

-- Refresh the materialized view every hour. 
SCHEDULE REFRESH orders_mv 
ON INTERVAL '1' HOUR;

The above code will create a materialized view called orders_mv that is refreshed every hour. The SCHEDULE REFRESH statements specify the schedule for refreshing it. In this case, the it will be refreshed every hour.

Now, consider the below example where I will schedule materialized view using ALTER VIEW statements:

CREATE MATERIALIZED VIEW orders_mv 
AS 
SELECT * FROM orders; 

-- Refresh the materialized view every hour. 
ALTER VIEW orders_mv 
SET SCHEDULE REFRESH 
ON INTERVAL '1' HOUR;

The ALTER VIEW statement specifies the schedule for refreshing. The SET SCHEDULE REFRESH clause specifies the frequency of the refresh.

How to Control Access to the Materialized Views?

Materialized views have strong access controls, which help in sharing data without revealing private information. The owner of a materialized view can give other users permission to view the data using SELECT.

The interesting part is that these users don’t need access to the original tables used in the materialized view. This control allows safe data sharing while keeping the underlying data protected and secure.

Here are some controlled access options:

1. ALTER MATERIALIZED VIEW: It is used to change the permission on a materialized view. You can use the GRANT clause to grant SELECT permissions to other users, and you can use the REVOKE clause to revoke SELECT permissions from other users.

For example, the following code would grant SELECT permissions on the orders_mv to the user johndoe:

ALTER MATERIALIZED VIEW orders_mv
GRANT SELECT TO johndoe;

The following code would revoke SELECT permissions on the orders_mv from the user janedoe:

ALTER MATERIALIZED VIEW orders_mv
REVOKE SELECT FROM janedoe;

2. DROP MATERIALIZED VIEW: It is used to delete a materialized view. For example, the following code would drop the orders_mv:

DROP MATERIALIZED VIEW orders_mv;

Limitations of using Materialized Views

Using materialized views in Databricks SQL can simplify your work, but there are certain restrictions and considerations to keep in mind when managing and querying them. Here are some of the limitations of using materialized views:

  1. It can only be created and refreshed in the pro-SQL warehouse and serverless SQL warehouse.
  2. You cannot query materialized views from assigned access mode clusters. They are not compatible with this type of cluster.
  3. The base table should be registered in the Unity catalog as managed or external tables.
  4. Once someone becomes the owner of the Databricks SQL materialized view, it cannot be changed.
  5. Identity columns or surrogate keys are not supported by the materialized views.
  6. If it uses a sum calculation on a column that can have empty values (NULLs), and the column contains only NULL values, the materialized view will show the result as zero instead of showing it as NULL.
  7. When creating it with expressions (e.g., SUM() function), you have to provide an alias for correct functionality (e.g., “AS sum_col”).
  8. REFRESH permission is managed differently and cannot be granted via the Permissions tab.
  9. Files may include data from the base tables for incremental updates, so avoid sharing storage with untrusted users to protect sensitive data.
  10. If a view includes COUNT(DISTINCT field), the underlying files will contain a list of actual “field” values, potentially compromising data privacy and security.

Conclusion

In Databricks SQL, materialized views offer a valuable and versatile solution for improving query performance, simplifying data processing, and enhancing the overall data analysis experience. Throughout this article, you’ve gained insights into creating and refreshing materialized views and learned how to exercise control over your views. Additionally, the article has highlighted the limitations of materialized views, which are essential to efficiently manage their usage. By using materialized views effectively, data analysts and engineers can optimize their workflows, make informed decisions, and derive meaningful insights from the data.

+ There are no comments

Add yours

Leave a Reply