How to use data distribution for SQL query optimization

Updated - 15 Apr 2025 7 min read

Nikolay Ganev Senior SQL and NET. Consultant

In this article, we discuss some ways to optimize queries for relational databases, focusing on Top (N) per group queries. The queries aim to improve database performance by selecting the most efficient approach based on data distribution. The central issue is how to improve query processing and reduce query execution at the same time, through proper query execution plans.

The article emphasizes the need for customized strategies depending on how the data is stored. This means that one needs different techniques for low-density data (e.g., many customers with few orders) vs. high-density data (e.g., few customers with many orders).

You can tackle performance bottlenecks, and make sure that data retrieval is as optimal as possible, by employing proper indexing, query rewriting, and a efficient execution plan. This approach is important for complex queries in systems like Microsoft SQL Server, where a query optimizer doesn’t always adapt to different data access patterns.

Overview of challenges with SQL queries

Relational databases are a type of database system that limits data into structured tables linked by primary and foreign keys. They are widely used for managing frequently accessed data and ensuring data integrity. However, as the volume of data grows, database performance can suffer, especially when handling complex queries or large datasets.

In relational databases, data distribution can vary significantly. For example, data may be low-density (many customers with few records each) or high-density (few customers with many records each). These differences impact how database processes retrieve and filter data, making it essential to adapt strategies based on the data storage patterns.

To overcome these challenges, query optimization becomes imperative. In its absence, slow-running queries can cause excessive memory consumption, high query costs, and inefficient index usage. Optimization of the queries can help in decreasing the query execution time, making less use of the system resources, and enhancing the performance of the database design. This is especially important for database administrators who need to identify performance bottlenecks and ensure efficient data retrieval in large-scale systems.

Why do we need to have different strategies?

One hand holding a tablet and another pressing on it.

Query optimization is important to enhance the database’s performance in relational databases. Yet, SQL servers easily get fooled by data distribution as they do not change their query execution plan. One such example is for the Top (N) per group queries of Row_Number() function. Even though this function can compute optimal matching, it does this in the same way for both, low-density data and high-density data. Such an approach is inefficient for handling large datasets or complex queries.

A closer look at a practical situation with customers and orders is needed to emphasize the need for various ways of optimizing queries. Let’s say that we have a database schema where we want to pull the last 3 orders (by date) for each customer. Such queries are common when we require data for analysis with filtering and grouping under certain conditions. However, this can very much depend on the distribution of data in the database for efficiency.

A way to understand the issue better is to give 2 scenarios as examples:

Low-Density Scenario: For the first example we will assume a database contains 1 million customers, and every customer makes 10 orders. In that database, the number of orders is 10 million, but the distribution is spread thinly across many customers. Running a query to retrieve the last 3 orders for each customer, in this example would result in 3 million rows (1 million customers multiplied by 3 orders each). This type of scenario represents a low-density data distribution. Here the data is widely dispersed, and the query must process a large contingent of customers with a small number of records per customer.
High-Density Scenario: The opposite example is a database with only 10 customers, but each having 1 million orders. Here, the number of orders is still 10 million, but the data is concentrated in only a handful of customers. Therefore, running the same query to retrieve the last 3 orders for each customer would result in a set including only 30 rows (10 customers multiplied by 3 orders each). This scenario represents a high-density data distribution. Here the data is densely packed, and the query must process a small number of customers with massive number of records per customer.

These two examples show how data distribution dramatically impacts the efficiency of queries. In the low-density scenario, the query must handle a lot of customers but only a few records per customer. These databases favor certain strategies like full index scans or the use of the Row_Number() function. On the other hand, in the high-density example, the query must handle a small number of customers with an enormous number of records per customer. This type of database may require a different approach entirely. Such an approach can be scanning smaller tables and using operators like CROSS APPLY to minimize query execution time and resource utilization.

The key takeaway is that there is not one answer to query optimization. Different data densities demand different strategies to ensure efficient SQL code and optimal database performance. We can avoid performance bottlenecks, reduce query costs, and improve the overall efficiency of the database system only if we understand the nuances of data distribution and tailor the query strategy accordingly,.

Basic strategies

When dealing with Top (N) per Group queries, there are two primary strategies for accessing and retrieving data efficiently:

The Row_Number() function;
The Apply Operator.

These strategies are fundamentally tied to how we read and process table data. Regardless of the approach chosen, both rely on creating an index that organizes the data into groups (or partitions) based on a key column – in this case, the customer. This indexing ensures that the data is logically grouped, making it easier to filter and retrieve the required records.

Once the data is partitioned by the customer, the next step involves applying either a scan or seek operation to locate the related records in the associated table (e.g., the Orders table). The choice between scan and seek depends on the data distribution and the specific requirements of the query.

Then, for each customer, scan or seek should be applied to a customer to find the related table (in this case the Orders). We should first build an appropriate index that will be used in both strategies

Create Unique Index Idx_Poc on Sales.orders (Custid, Order Date Desc, Order ID Desc) Include (Empid);

We use the Include clause to avoid lookup requests for each customer.

What is the right low-density strategy?

At low data density, the query is likely to return a significant portion of the rows from the Orders table. This is because, in such scenarios, there are many customers, but each customer has only a small number of associated records. In this context, the most efficient strategy is to perform a complete scan of the index. This approach ensures that all relevant rows are retrieved in a single pass, minimizing the overhead associated with repeated seek operations

Proposals POC Index (Partitioning, Ordering, Covering) presents the data ordered in groups(partitions) by the customer. The right strategy in this case would be to use the Row_Number function. With the example above, we will have the Seek Per Customer and => 3 mln Random Reads.

— solution for low density

WITH C AS

(  SELECT 

ROW_NUMBER() OVER( PARTITION BY custom ORDER BY orderdate DESC, orderid DESC) AS rn,

     orderid, orderdate, custid, empid

  FROM Sales.Orders

)

SELECT custid, orderdate, orderid, empid

FROM C

WHERE rownum <= 3;

The result of the application will be 3 million returned rows.

What would happen if we apply the same strategy when we have a high-density data structure?

Row_number() function is optimized by the SQL server in the same way, respectively with an expected result of 30 rows (10 customers and 3 orders for each of them) we will have again Seek Per Customer and => 3 mln Random Reads.

Therefore, it is necessary to use another strategy.

It is necessary to scan the small table, in this case the customers table, which, following the example above, contains 10 rows. Then, for each customer, find the 3rd last order by date.

In this case, we must use the following strategy with APPLY operator

— solution for high density

SELECT C.custid, A.*

FROM Sales.Customers AS C

  CROSS APPLY ( SELECT TOP (3) orderid, orderdate, empid

                FROM Sales.Orders AS O

                WHERE O.custid = C.custid

                ORDER BY orderdate DESC, orderid DESC ) AS A;

Conclusion

When working with Top (N) per Group queries, it is crucial to use different strategies based on data density, supported by an explicitly designed POC index (Partitioning, Ordering, Covering). For low-density data, where there are many customers with few records each, the Row_Number() function combined with a Seek Per Customer approach is the most efficient strategy. This method leverages the partitioned and ordered index to quickly retrieve the top N rows for each customer.

In contrast, for high-density data, where there are few customers with many records each, the Apply Operator is the better choice, as it scans the small Customer table and performs targeted searches on the Orders table for each customer. By selecting the right strategy based on data distribution, we can optimize query performance and ensure efficient resource utilization.

Nikolay Ganev

Nikolay is a software consultant with extensive experience in healthcare and enterprise IT, specializing in building and maintaining backend systems and integrations using .NET and SQL technologies. Certified in Microsoft Azure and SQL, he focuses on data warehouse implementation and continuous integration.