Azure Synapse (Azure SQL Data Warehouse) is a massively parallel processing (MPP) database system. The data within each synapse instance is spread across 60 underlying databases. These 60 databases are referred to as “distributions”. As the data is distributed, there is a need to organize the data in a way that makes querying faster and more efficient.
Although distributing data helps to spread the load, it also leads to network inefficiencies, increased I/O, and consequent poor performance.For example, if you regularly perform a query that joins data across two tables, but the data to be joined is stored in different databases, then the data warehouse will have to fetch the data over the network. Therefore, you need to design your data warehouse carefully so as to minimize the “data movements”. This requires that you fully understand how the data is structured, and how it’s likely to be used before you come up with a suitable approach of data distribution. Let’s discuss some scenarios:
|Scenario Description||How to organize the data?|
|Scenario 1: In the sales database for a global corporation, the details of customers and the purchases that they have made are most likely to be used in the same locality (for example, India US , UK, Australia, and so on). The system is unlikely to perform queries that require combining the details of sales made in the India territory with customers located in Australia with any frequency.||It would make sense to organize the data according to this locality.|
|Scenario 2:The corporation performs daily financial analyses of sales, calculating the revenue for the previous day.||In this case it would make sense to ensure that the financial records for a given dates are held together|
|Scenario 3: If the Organization performs analyses by week, month, quarter, or year (to help spot trends), the historical summary information generated by the daily analysis—and for previous weeks, months, and years.||Data can be stored in lookup tables. These lookup tables are likely to be small compared to the rest of the financial data. The information that the tables contain is static, so they could be copied to every database in the data warehouse to reduce the costs of retrieving this data.|
This clealy describes that you should know your data, DDL and queries.
Azure Synapse distributes the data in three ways:
|Round Robin:||This distributes the data equally among all the 60 underlying distributions. There is no specific key used to distribute the data. This is the default method used when no data distribution strategy is specified.|
|Hashing:||Azure Synapse Analytics uses a hash function to assign each row to one distribution deterministically. In the table definition, one of the columns is designated as the distribution column. The hash function uses the values in the distribution column to assign each row to a distribution.|
|Replication:||A table that is replicated caches a full copy on each compute node. Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation. This requires extra storage and there are additional overheads that are incurred when writing data which make large tables impractical.|
When to use Round Robin Distribution?
A round-robin table is the most straightforward table to create and delivers fast performance when used as a staging table for loads. These are some scnearios where you should choose Round robin distribution:
- When you cannot identify a single key to distribute your data.
- If your data doesn’t frequently join with data from other tables.
- When there are no obvious keys to join.
- If the table is being used to hold temporary data.
- Dimension tables or other lookup tables in a schema can usually be stored as round-robin tables.Usually these tables connects to more than one fact tables and optimizing for one join may not be the best idea.These smaller dimanesions tables can leave some distributions empty when has distributed. So use round robin for uniform distribution of such tables.
A round-robin table is the most straightforward table to create and delivers fast performance when used as a staging table for loads.
A round-robin distributed table distributes data evenly across the table but without any further optimization. A distribution is first chosen at random, and then buffers of rows are assigned to distributions sequentially. It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables. Joins on round-robin tables require reshuffling data, and this takes additional time.
If you are unsure of query patterns and data, you can start with all tables in round-robin distribution. And as you learn the patterns the data can be easily redistributed on a hash key.
When to use Hash Distribution?
Hashing is a very common and effective data distribution method. The data is distributed based on the hash value of a single column that you select, according to some hashing algorithm. This distribution column then dictates how the data is spread across the underlying distributions. Items that have the same data value and data type for the hash key always end up in the same distribution.A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.
How to choose best distribution column?
Please make sure to keep these points in consideraion to identify the best distribution column.
- Choose a column that won’t be updated.
- Always choose a column that leads to even distribution (skew) of rows across all the databases.
- Choose an appropriate distribution column that reduces the data movement activity between different distributions.
- Do not choose nullable column because all null columns are hashed in th same way and thus the rows will end up in the same location. If most columns of the table are nulls then it may not be a good candidate for hash distribution.
- Any fact tables that has a default value in a column is also not a good candidate to create a hash distributed table.
- Large fact tables or historical transaction tables are usually stored as hash distributed tables.
When to use Replicated Distribution?
Replication is very good for small lookup tables or dimension tables that are frequently joined with other big tables. Instead of distributing small sections of data across the underlying distributions for small tables, the replication data strategy creates a copy of the entire table in each of the underlying distribution.
|Type of Distribution||Best Fit for…||Do not use when…|
|Replicated||–Small dimension tables in a star schema with less then 2 GB of storage after the compression (Synapse does 5x compression).
-Good for small lookup tables.
-Good for dimension tables that are frequently joined with other big tables.
|-Many write transactions are on the table (for example insert, delete and updates).
-If you change the datawarehouse Units frequently.
-You only use 2 -3 columns out of many columns in your tables.
-you are indexing a replicated table.
|Round Robin (default)||–Temporary /staging Table.
–No obvious joining key candidate is found in the table or If your data doesn’t frequently join with data from other tables.
–When you cannot identify a single key to distribute your data.
–If the table is being used to hold temporary data.
–If you are using a staging table for faster loads.
–If you are unsure of query patterns and data, you can start with all tables in round-robin distribution. And as you learn the patterns the data can be easily redistributed on a hash key.
–Small dimension table.
|–Performance is slow due to data movement|
|Hash||–Large Fact Tables or historical Transaction tables are good candidates.
–Large dimension tables.
|–The distribution key can not be updated
–A nullable column is a bad candidate for any hash distributed table.
–Fact tables that has a default value in a column is also not a good candidate to create a hash distributed table.
I hope this article helps to choose the correct distribution strategy in Azure Synapse.