πŸ“Š

Sql Query Optimization

Databases & SQL Intermediate 4 min read 800 words

SQL Query Optimization

Introduction

Query optimization is essential for building performant database applications. Understanding how the query optimizer works, reading execution plans, and writing efficient queries can dramatically improve application performance.


Table of Contents


Query Execution Process

How SQL Server Executes Queries

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Parser    │───▢│   Algebrizer │───▢│  Optimizer  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                                     β”‚
       β”‚ Syntax Check                        β”‚ Cost-Based
       β”‚                                     β”‚ Optimization
       β”‚                                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Result    │◀───│  Execution  │◀───│  Plan Cache β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Execution Steps

  1. Parsing: Checks SQL syntax, builds parse tree
  2. Algebrizing: Resolves names (tables, columns), validates objects
  3. Optimization: Generates and evaluates execution plans
  4. Execution: Runs the chosen plan, returns results

Plan Cache

-- View cached plans
SELECT
    cp.objtype,
    cp.cacheobjtype,
    cp.size_in_bytes,
    cp.usecounts,
    st.text AS query_text,
    qp.query_plan
FROM sys.dm_exec_cached_plans cp
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) st
CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp
WHERE st.text LIKE '%YourTable%'
ORDER BY cp.usecounts DESC;

-- Clear plan cache (use cautiously in production)
DBCC FREEPROCCACHE;  -- All plans
DBCC FREEPROCCACHE(plan_handle);  -- Specific plan

Reading Execution Plans

Getting Execution Plans

-- Estimated plan (no execution)
SET SHOWPLAN_XML ON;
GO
SELECT * FROM Orders WHERE CustomerId = 123;
GO
SET SHOWPLAN_XML OFF;

-- Actual plan (executes query)
SET STATISTICS XML ON;
SELECT * FROM Orders WHERE CustomerId = 123;
SET STATISTICS XML OFF;

-- In SSMS: Ctrl+L (estimated), Ctrl+M (actual)

Key Plan Operators

Operator Description Performance Implication
Table Scan Reads entire table Avoid on large tables
Clustered Index Scan Scans entire clustered index Often unavoidable for aggregates
Index Scan Scans entire non-clustered index Better than table scan
Index Seek Seeks to specific rows Best performance
Key Lookup Retrieves columns from clustered index Can be expensive
Nested Loops Row-by-row join Good for small datasets
Hash Match Hash-based join Good for large unsorted data
Merge Join Sorted merge Best for large sorted data
Sort In-memory sort Expensive, watch for spills
Parallelism Multi-threaded execution Can indicate complex query

Interpreting Plan Costs

-- Sample execution plan analysis

-- Look for these warnings in actual plans:
-- 1. High percentage cost operators
-- 2. Thick arrows (many rows flowing)
-- 3. Yellow warning triangles
-- 4. Missing index hints
-- 5. Sort spills to tempdb
-- 6. Implicit conversions

Identifying Expensive Operations

-- Find queries with high reads
SELECT TOP 20
    qs.total_logical_reads / qs.execution_count AS avg_logical_reads,
    qs.total_worker_time / qs.execution_count AS avg_cpu_time,
    qs.execution_count,
    SUBSTRING(st.text, (qs.statement_start_offset/2) + 1,
        ((CASE statement_end_offset
            WHEN -1 THEN DATALENGTH(st.text)
            ELSE qs.statement_end_offset END
            - qs.statement_start_offset)/2) + 1) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY avg_logical_reads DESC;

Query Optimization Techniques

1. Use Appropriate Indexes

-- ❌ Without index (Table Scan)
SELECT * FROM Orders WHERE OrderDate > '2024-01-01';

-- βœ… With covering index
CREATE INDEX IX_Orders_OrderDate
ON Orders (OrderDate) INCLUDE (CustomerId, Total);

-- βœ… Now uses Index Seek
SELECT CustomerId, Total FROM Orders WHERE OrderDate > '2024-01-01';

2. Write SARGable Queries

-- ❌ Non-SARGable (can't use index)
SELECT * FROM Orders WHERE YEAR(OrderDate) = 2024;
SELECT * FROM Customers WHERE Name LIKE '%Smith';
SELECT * FROM Products WHERE Price * 1.1 > 100;

-- βœ… SARGable (can use index)
SELECT * FROM Orders
WHERE OrderDate >= '2024-01-01' AND OrderDate < '2025-01-01';

SELECT * FROM Customers WHERE Name LIKE 'Smith%';

SELECT * FROM Products WHERE Price > 100 / 1.1;

3. Avoid SELECT *

-- ❌ Returns all columns, may cause Key Lookups
SELECT * FROM Orders WHERE CustomerId = 123;

-- βœ… Only needed columns
SELECT OrderId, OrderDate, Total
FROM Orders
WHERE CustomerId = 123;

4. Optimize JOINs

-- ❌ Implicit joins (old syntax)
SELECT *
FROM Orders, Customers, Products
WHERE Orders.CustomerId = Customers.Id
  AND Orders.ProductId = Products.Id;

-- βœ… Explicit joins (clearer, same performance)
SELECT o.OrderId, c.Name, p.ProductName
FROM Orders o
INNER JOIN Customers c ON o.CustomerId = c.Id
INNER JOIN Products p ON o.ProductId = p.Id;

-- βœ… Filter early to reduce rows in join
SELECT o.OrderId, c.Name
FROM Orders o
INNER JOIN Customers c ON o.CustomerId = c.Id
WHERE o.OrderDate > '2024-01-01';  -- Filter first

5. Use EXISTS Instead of IN for Subqueries

-- ❌ IN with subquery (can be slow with NULLs)
SELECT *
FROM Customers c
WHERE c.Id IN (SELECT CustomerId FROM Orders);

-- βœ… EXISTS (often faster, handles NULLs correctly)
SELECT *
FROM Customers c
WHERE EXISTS (SELECT 1 FROM Orders o WHERE o.CustomerId = c.Id);

6. Batch Large Operations

-- ❌ Single large update (locks table, fills log)
UPDATE Orders SET Status = 'Archived' WHERE OrderDate < '2020-01-01';

-- βœ… Batched updates
WHILE 1 = 1
BEGIN
    UPDATE TOP (10000) Orders
    SET Status = 'Archived'
    WHERE OrderDate < '2020-01-01' AND Status != 'Archived';

    IF @@ROWCOUNT = 0 BREAK;

    WAITFOR DELAY '00:00:01';  -- Brief pause
END

7. Use Set-Based Operations

-- ❌ Cursor-based (row-by-row)
DECLARE @Id INT;
DECLARE cur CURSOR FOR SELECT Id FROM Orders;
OPEN cur;
FETCH NEXT FROM cur INTO @Id;
WHILE @@FETCH_STATUS = 0
BEGIN
    UPDATE OrderDetails SET Processed = 1 WHERE OrderId = @Id;
    FETCH NEXT FROM cur INTO @Id;
END;
CLOSE cur;
DEALLOCATE cur;

-- βœ… Set-based (all at once)
UPDATE od
SET od.Processed = 1
FROM OrderDetails od
INNER JOIN Orders o ON od.OrderId = o.Id;

Common Anti-Patterns

1. Functions on Indexed Columns

-- ❌ Index on FirstName can't be used
SELECT * FROM Employees WHERE UPPER(FirstName) = 'JOHN';

-- βœ… Use case-insensitive collation or computed column
SELECT * FROM Employees WHERE FirstName = 'John';

-- Or create computed column
ALTER TABLE Employees ADD FirstNameUpper AS UPPER(FirstName);
CREATE INDEX IX_Employees_FirstNameUpper ON Employees(FirstNameUpper);

2. Implicit Conversions

-- ❌ String compared to INT causes conversion
-- Column is VARCHAR, but parameter is INT
SELECT * FROM Products WHERE ProductCode = 12345;

-- βœ… Match data types
SELECT * FROM Products WHERE ProductCode = '12345';

-- Check for implicit conversions in plan warnings

3. OR Conditions

-- ❌ OR can prevent index usage
SELECT * FROM Orders
WHERE CustomerId = 123 OR OrderDate > '2024-01-01';

-- βœ… Use UNION for better plan
SELECT * FROM Orders WHERE CustomerId = 123
UNION
SELECT * FROM Orders WHERE OrderDate > '2024-01-01';

-- Or ensure indexes exist on both columns

4. Leading Wildcards

-- ❌ Can't use index
SELECT * FROM Products WHERE Name LIKE '%Widget%';

-- βœ… Use Full-Text Search for text searching
SELECT * FROM Products
WHERE CONTAINS(Name, 'Widget');

-- Or if prefix is known
SELECT * FROM Products WHERE Name LIKE 'Widget%';

5. Unnecessary DISTINCT

-- ❌ DISTINCT hides duplicate joins
SELECT DISTINCT c.Name
FROM Customers c
INNER JOIN Orders o ON c.Id = o.CustomerId;  -- Wrong join?

-- βœ… Fix the query logic instead
SELECT c.Name
FROM Customers c
WHERE EXISTS (SELECT 1 FROM Orders o WHERE o.CustomerId = c.Id);

6. SELECT COUNT(*) vs EXISTS

-- ❌ Counts all rows when you just need existence
IF (SELECT COUNT(*) FROM Orders WHERE CustomerId = 123) > 0
BEGIN
    -- Do something
END

-- βœ… Stops at first match
IF EXISTS (SELECT 1 FROM Orders WHERE CustomerId = 123)
BEGIN
    -- Do something
END

Statistics and Cardinality

Understanding Statistics

Statistics help the optimizer estimate how many rows an operation will return.

-- View statistics for a table
DBCC SHOW_STATISTICS ('Orders', 'IX_Orders_CustomerId');

-- Update statistics
UPDATE STATISTICS Orders;
UPDATE STATISTICS Orders IX_Orders_CustomerId;

-- Rebuild statistics (more thorough)
UPDATE STATISTICS Orders WITH FULLSCAN;

-- Auto-update settings
SELECT
    name,
    is_auto_create_stats_on,
    is_auto_update_stats_on,
    is_auto_update_stats_async_on
FROM sys.databases
WHERE name = 'YourDatabase';

Cardinality Estimation Issues

-- Check estimated vs actual rows in execution plan
-- Large discrepancies indicate statistics problems

-- Example: Bad estimate causes nested loop instead of hash join
-- Optimizer thinks 10 rows, actually 100,000

-- Solutions:
-- 1. Update statistics
UPDATE STATISTICS Orders WITH FULLSCAN;

-- 2. Use query hint (last resort)
SELECT * FROM Orders o
INNER JOIN OrderDetails od ON o.Id = od.OrderId
OPTION (USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION'));

-- 3. Create filtered statistics for specific values
CREATE STATISTICS ST_Orders_StatusActive
ON Orders (OrderDate)
WHERE Status = 'Active';

Query Hints

Use hints sparinglyβ€”they override optimizer decisions.

Join Hints

-- Force specific join type
SELECT *
FROM Orders o
INNER LOOP JOIN Customers c ON o.CustomerId = c.Id;  -- Nested loops

SELECT *
FROM Orders o
INNER HASH JOIN Customers c ON o.CustomerId = c.Id;  -- Hash match

SELECT *
FROM Orders o
INNER MERGE JOIN Customers c ON o.CustomerId = c.Id;  -- Merge join

Index Hints

-- Force specific index
SELECT *
FROM Orders WITH (INDEX(IX_Orders_CustomerId))
WHERE CustomerId = 123 AND OrderDate > '2024-01-01';

-- Force table scan (useful for testing)
SELECT *
FROM Orders WITH (INDEX(0))  -- 0 = heap/clustered scan
WHERE CustomerId = 123;

-- Force clustered index
SELECT *
FROM Orders WITH (INDEX(1))  -- 1 = clustered index
WHERE CustomerId = 123;

Query Option Hints

-- Force parallelism
SELECT * FROM LargeTable OPTION (MAXDOP 4);

-- Disable parallelism
SELECT * FROM SmallTable OPTION (MAXDOP 1);

-- Recompile (ignore cached plan)
SELECT * FROM Orders WHERE CustomerId = @CustomerId
OPTION (RECOMPILE);

-- Optimize for specific value
SELECT * FROM Orders WHERE CustomerId = @CustomerId
OPTION (OPTIMIZE FOR (@CustomerId = 123));

-- Optimize for unknown (average distribution)
SELECT * FROM Orders WHERE CustomerId = @CustomerId
OPTION (OPTIMIZE FOR UNKNOWN);

Parameter Sniffing

The Problem

-- First execution with CustomerId = 1 (has 1 million orders)
-- Plan optimized for many rows (hash join, parallel)
EXEC GetOrders @CustomerId = 1;

-- Second execution with CustomerId = 99999 (has 5 orders)
-- Uses cached plan designed for millions of rows!
EXEC GetOrders @CustomerId = 99999;

Solutions

-- Solution 1: RECOMPILE (always generate new plan)
CREATE PROCEDURE GetOrders @CustomerId INT
AS
BEGIN
    SELECT * FROM Orders
    WHERE CustomerId = @CustomerId
    OPTION (RECOMPILE);  -- New plan each time
END

-- Solution 2: OPTIMIZE FOR UNKNOWN
CREATE PROCEDURE GetOrders @CustomerId INT
AS
BEGIN
    SELECT * FROM Orders
    WHERE CustomerId = @CustomerId
    OPTION (OPTIMIZE FOR UNKNOWN);  -- Average statistics
END

-- Solution 3: Local variable (old technique)
CREATE PROCEDURE GetOrders @CustomerId INT
AS
BEGIN
    DECLARE @LocalCustomerId INT = @CustomerId;

    SELECT * FROM Orders
    WHERE CustomerId = @LocalCustomerId;  -- Uses local variable
END

-- Solution 4: Query Store forced plan (SQL Server 2016+)
EXEC sp_query_store_force_plan @query_id = 123, @plan_id = 456;

Performance Monitoring

Query Store (SQL Server 2016+)

-- Enable Query Store
ALTER DATABASE YourDatabase SET QUERY_STORE = ON;

-- Configure Query Store
ALTER DATABASE YourDatabase SET QUERY_STORE (
    OPERATION_MODE = READ_WRITE,
    CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 30),
    DATA_FLUSH_INTERVAL_SECONDS = 900,
    MAX_STORAGE_SIZE_MB = 1000
);

-- Find top resource-consuming queries
SELECT TOP 10
    q.query_id,
    qt.query_sql_text,
    rs.avg_duration,
    rs.avg_logical_io_reads,
    rs.count_executions
FROM sys.query_store_query q
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
JOIN sys.query_store_plan p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
ORDER BY rs.avg_logical_io_reads DESC;

-- Find regressed queries (plan changes)
SELECT
    q.query_id,
    qt.query_sql_text,
    p.plan_id,
    p.is_forced_plan
FROM sys.query_store_query q
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
JOIN sys.query_store_plan p ON q.query_id = p.query_id
WHERE p.last_execution_time > DATEADD(day, -7, GETDATE())
ORDER BY q.query_id;

Extended Events

-- Create session to capture slow queries
CREATE EVENT SESSION [SlowQueries] ON SERVER
ADD EVENT sqlserver.sql_statement_completed(
    ACTION(sqlserver.sql_text, sqlserver.username, sqlserver.database_name)
    WHERE duration > 5000000)  -- > 5 seconds
ADD TARGET package0.ring_buffer
WITH (STARTUP_STATE = ON);

-- Start session
ALTER EVENT SESSION [SlowQueries] ON SERVER STATE = START;

-- Query captured events
SELECT
    event_data.value('(event/@timestamp)[1]', 'datetime') AS event_time,
    event_data.value('(event/data[@name="duration"]/value)[1]', 'bigint') / 1000 AS duration_ms,
    event_data.value('(event/action[@name="sql_text"]/value)[1]', 'nvarchar(max)') AS sql_text
FROM (
    SELECT CAST(target_data AS XML) AS target_data
    FROM sys.dm_xe_sessions s
    JOIN sys.dm_xe_session_targets t ON s.address = t.event_session_address
    WHERE s.name = 'SlowQueries'
) AS data
CROSS APPLY target_data.nodes('RingBufferTarget/event') AS xevent(event_data);

DMV Queries for Troubleshooting

-- Currently running queries
SELECT
    r.session_id,
    r.status,
    r.wait_type,
    r.wait_time,
    r.cpu_time,
    r.logical_reads,
    t.text AS query_text
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.session_id > 50;  -- Exclude system sessions

-- Missing index suggestions
SELECT
    mig.index_group_handle,
    mid.statement AS table_name,
    mid.equality_columns,
    mid.inequality_columns,
    mid.included_columns,
    migs.user_seeks * migs.avg_total_user_cost * migs.avg_user_impact AS improvement_measure
FROM sys.dm_db_missing_index_group_stats migs
JOIN sys.dm_db_missing_index_groups mig ON migs.group_handle = mig.index_group_handle
JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
ORDER BY improvement_measure DESC;

-- Index usage statistics
SELECT
    OBJECT_NAME(i.object_id) AS table_name,
    i.name AS index_name,
    s.user_seeks,
    s.user_scans,
    s.user_lookups,
    s.user_updates
FROM sys.dm_db_index_usage_stats s
JOIN sys.indexes i ON s.object_id = i.object_id AND s.index_id = i.index_id
WHERE OBJECTPROPERTY(i.object_id, 'IsMsShipped') = 0
ORDER BY s.user_seeks + s.user_scans + s.user_lookups DESC;

Interview Questions

1. What is an execution plan and why is it important?

Answer: An execution plan is a roadmap showing how SQL Server will execute a query. It shows:

  • Which indexes will be used (or if tables are scanned)
  • Join algorithms (nested loops, hash, merge)
  • Sort and filter operations
  • Estimated and actual row counts

Important because it helps identify:

  • Missing indexes
  • Inefficient operations (table scans on large tables)
  • Poor cardinality estimates
  • Opportunities for optimization

2. What is SARGable and why does it matter?

Answer: SARGable (Search ARGument ABLE) refers to queries that can utilize indexes. A predicate is SARGable if the optimizer can use an index seek.

Non-SARGable patterns:

  • Functions on columns: WHERE YEAR(Date) = 2024
  • Arithmetic: WHERE Price * 1.1 > 100
  • Leading wildcards: WHERE Name LIKE '%Smith'
  • Implicit conversions

SARGable queries enable index seeks instead of scans, dramatically improving performance on large tables.


3. Explain the difference between a scan and a seek.

Answer:

  • Scan: Reads every row in the table or index. O(n) operation. May be unavoidable for queries returning large portions of data.
  • Seek: Navigates B-tree structure directly to matching rows. O(log n) for finding start, then sequential read. Much faster for selective queries.

Example: Finding 10 rows in a million-row table:

  • Scan: Reads 1,000,000 rows
  • Seek: Reads ~10-20 pages (index navigation + data pages)

4. What is parameter sniffing?

Answer: Parameter sniffing occurs when SQL Server creates an execution plan based on the parameter values from the first execution, then reuses that plan for subsequent executions with different parameters.

Problem: If first execution uses atypical values, the cached plan may perform poorly for typical values.

Solutions:

  • OPTION (RECOMPILE) - New plan each time
  • OPTION (OPTIMIZE FOR UNKNOWN) - Use average statistics
  • Local variables - Hides parameter values
  • Query Store forced plans

5. How do you identify slow queries in production?

Answer: Several methods:

  1. Query Store (SQL 2016+): Built-in query performance tracking
  2. Extended Events: Capture slow queries with customizable thresholds
  3. DMVs: Query sys.dm_exec_query_stats for aggregate statistics
  4. SQL Server Profiler: Real-time query tracing (being replaced by XE)
  5. Application logs: Log queries taking longer than threshold

Best approach: Use Query Store for historical analysis, Extended Events for real-time monitoring.


6. When would you use query hints?

Answer: Use hints sparingly as last resort when:

  • Optimizer consistently chooses wrong plan
  • Statistics are accurate but plan is still poor
  • Temporary fix while investigating root cause
  • Specific scenarios like forced parallelism for reporting queries

Avoid because:

  • Optimizer usually knows best
  • Data distribution changes over time
  • Hints prevent optimizer from adapting
  • Can cause worse performance in future

Sources

πŸ“š Related Articles