In Part 1 “Boost your Snowflake Queries: Top Strategies for Faster Results“, we discussed the concepts of Query Optimization, the Snowflake Query Processing Layer, and the Query Optimization Techniques, including Snowflake Search Optimization Service (SOS), Minimize Data Movement, Use of appropriate Data Types, Use of Materialized Views, Using Clustering Keys and Use of Query Profiling.
In this article “Optimizing Snowflake Queries: Boosting Performance“, we will continue our exploration of Snowflake Query Optimization Techniques. These techniques can further improve the performance and efficiency of Snowflake queries, making them faster and more cost-effective. Let’s dive in!
Snowflake Query Optimization Techniques
Below are the techniques to optimize the Snowflake Queries which will further help you to improve the performance and efficiency:
- Maximize Cache Usage
- Select only the Required Column
- Optimizing Data Load into tables using COPY Command
- Use Separate Warehouses for Queries, Task Loading and Large Files
- Avoiding data Skew
- Optimizing Resources and Queries
1. Maximize Cahche Usage
Maximizing cache usage is a key query optimization technique in Snowflake that can significantly improve query performance. The Snowflake cache is a feature that allows frequently accessed data to be cached in memory for faster access, reducing the need to access data from disk.
To maximize cache usage in Snowflake. Use the Snowflake query profiling tool to analyze query performance and identify queries that could benefit from caching. Look for queries that access the same data repeatedly or that involve large data sets that are accessed frequently so you can enable the cache. To enable result caching for the session use the below command.
ALTER SESSION SET USE_CACHED_RESULT = TRUE
2. Select only the Required Column
Selecting only the required columns is a common query optimization technique in Snowflake that can improve query performance and reduce resource consumption. The idea behind this optimization technique is to only retrieve the columns needed for the query, instead of retrieving all columns from the table.
To implement this optimization technique in Snowflake, follow these steps:
Step 1: Identify the Required Columns
Analyze the query and identify the columns that are required for the query. Remove any columns that are not needed from the SELECT statement.
Step 2: Use the SELECT Statement to Retrieve Only the Required Columns
In Snowflake, use the SELECT statement to retrieve only the required columns. This can be done by explicitly specifying the column names in the SELECT statement, like so:
SELECT column1, column2, column3 FROM my_table WHERE ...
By specifying only the required columns, Snowflake can avoid retrieving unnecessary data from the table, reducing resource consumption and improving query performance.
Step 3: Use Views to Simplify Queries
If you frequently need to retrieve a subset of columns from a table, consider creating a view that selects only the required columns. This can simplify your queries and make them more efficient.
For example, if you frequently need to retrieve only the customer name and email address from a table, you could create a view like this:
CREATE VIEW customer_info AS SELECT name, email FROM customers;
Then, you could query the view instead of the original table:
SELECT * FROM customer_info WHERE ...
By using the view, you can simplify your queries and improve query performance by only retrieving the required columns.
3. Optimize Bulk-Data Loading into tables using COPY commands
Snowflake provides several features and optimizations that can significantly improve the performance of bulk-loading data into tables. By using the COPY command, you can load large datasets quickly and efficiently into Snowflake tables, enabling faster data processing and analysis. Here are some ways in which COPY commands can improve performance:
- Parallelism: The COPY command supports the parallel loading of data into tables, which means that multiple files can be loaded simultaneously. This can significantly speed up the data-loading process, especially for large datasets.
- Compression: Snowflake supports the automatic compression of data during the loading process. This can help to reduce the amount of disk space required to store the data and improve query performance.
- Automatic Data Conversion: The COPY command automatically converts data from various file formats (such as CSV, JSON, and Avro) into the appropriate data types for Snowflake tables. This eliminates the need for manual data conversion and can save time during the loading process.
- Data Validation: Snowflake validates data during the loading process to ensure that it conforms to the table schema. This can help to identify and fix data issues early on, which can save time and effort during data processing.
- Snowflake-Specific optimizations: Snowflake’s COPY command also includes optimizations that are specific to the Snowflake platform, such as loading data in a clustered table to improve query performance, or using Snowflake’s bulk-loading API to load data directly from cloud storage services like Amazon S3.
- Load only required files: Avoid scanning files unnecessary for the query by specifying the file name or pattern in the COPY command.
-- Scan entire stage copy into customer_table from @landing_data pattern='.*[.]csv'; -- Limit within directory copy into sales_transaction_table from @landing_data/sales/transactions/2020/05 pattern='.*[.]csv'; -- specifying the named file copy into sales_transaction_table from @landing_data/sales/transactions/2020/05/sales_050.csv;
7. Select only specific columns: Select only required columns and filter out unnecessary rows to reduce the amount of data scanned.
4. Use Separate Warehouse for Queries, Task Loading, and Large Files.
A warehouse is a set of computing resources that execute SQL statements in Snowflake. By using separate warehouses for loading and querying operations, you can optimize the performance for each type of workload and avoid any interference or slowdown caused by competing for the same resources. For example, you can use a multi-cluster warehouse for batch loading and a single-cluster warehouse for interactive querying. You can also use a separate warehouse for large files that require more resources to load. This way, you can improve the query performance by reducing the amount of data scanned, network traffic, and CPU usage.
Consider the image above. You can see that Finance, Data Scientist, Data Loading, and Marketing have separate warehouses which will help them with fast and responsive queries which will ultimately improve performance, scalability, and cost efficiency.
Finance is using Tableau to visualize and analyze financial data from the cloud database. Tableau contains pre-aggregated or optimized data sets, which can accelerate query response time and reduce the load on the primary database. By separating the query workload from the transactional workload, you can avoid contention and ensure that the reporting and analysis don’t affect the online application’s performance.
The Data Loading process is using Matillion to extract data from various sources, transform it, and load it into the cloud database. Matillion provides fast network connectivity, parallel processing, and low-latency storage. By separating the data loading workload from the query workload, you can ensure that the data loading process doesn’t affect the reporting and analysis performance.
The Marketing team is using Looker to visualize and analyze customer data from the cloud database, which may include large files such as product images, campaign videos, or customer contracts. Looker is optimized for storing and managing large files by providing high-capacity storage, fast retrieval, and content delivery network (CDN) integration.
The Data Science team is using Apache Spark to process and analyze large datasets from the cloud database. Apache Spark is optimized for data science tasks by providing high-speed processors, GPU acceleration, and specialized software packages.
5. You should Avoid Data Skew
Data skew occurs when certain values in a column or set of columns have a disproportionate amount of data compared to other values in the same column(s). This can cause performance issues in queries that need to scan or filter based on those columns.
Here is an example of data skew in a Snowflake table:
Suppose we have a table named
sales with the following columns:
product_id(varchar): The ID of the product being sold
sales_date(date): The date of the sale
quantity(integer): The quantity of the product sold
We have data for the past year in this table, and we notice that a particular product, let’s say
product_id = 'ABC', has significantly more sales than any other product in the table. In fact, 90% of the rows in the table have
product_id = 'ABC', while the remaining 10% of the rows are spread across many other products. So in this case due to data skew, it can cause performance issues when querying this table. For example, if we want to know the total sales for each product, we might write a query like this:
SELECT product_id, SUM(quantity) AS total_sales FROM sales GROUP BY product_id
This query will be slow because most of the data in the table is for
product_id = 'ABC', and Snowflake will need to scan a large amount of data to calculate the sum for that product. In this case, we might consider creating a clustering key on the
product_id column to group all the rows with the same product ID together on disk, which can improve query performance.
Alternatively, we can consider partitioning the table based on the
product_id column, which can improve query performance by limiting the amount of data that needs to be scanned for each product. This can be done by creating a table with a
product_id column as the partition key, and then inserting data into that table using a
SELECT statement with a
WHERE clause to filter the data by product ID.
6. Optimize your Queries and Resources
Resource Management refers to the ability to monitor and control the usage of computing resources (Virtual Warehouse). Snowflake provides features and control capabilities that can help you to determine the best approach for warehouse management based on your organization’s use cases. Here are some tips to optimize queries using resource management:
- Classify your workloads and assign each workload to a separate virtual warehouse with its own size and configuration. Through this, you can isolate resources and avoid interference between different workloads.
- Enable auto-suspend and auto-resume for your virtual warehouse to stop credit consumption when they are idle and resume when they are needed.
- Review your query history for optimization such as rewriting queries, using caching, partitioning tables, clustering data, or using materialized views.
Snowflake provides a robust and powerful data warehousing solution with numerous features and capabilities for optimizing query performance. It is essential to continuously monitor query performance and make adjustments as needed to ensure optimal performance. Additionally, taking advantage of Snowflake’s auto-scaling capabilities and monitoring query usage patterns can help to minimize costs while maintaining high query performance.