Performance

Is a clustered index best for range queries?

I see a lot of advice that talks about the clustered index been the best index for use for range queries, that is queries with inequalities filters, queries that retrieve ranges of rows, as opposed to singleton queries, queries that retrieve single rows (including, unfortunately, a Technet article).

I suspect the reasoning behind this advice is the idea that the clustered index stores the data in order of the clustering key (ack) and hence it’s ‘logical’ that such a structure would be best for range scans as SQL can simply start at the beginning of the range and read sequentially to the end.

Question is, is that really the case?

Let’s do some experiments and find out.

CREATE TABLE TestingRangeQueries (
ID INT IDENTITY,
SomeValue NUMERIC(7,2),
Filler CHAR(500) DEFAULT ''
)

-- 1 million rows
INSERT INTO TestingRangeQueries (SomeValue)
SELECT TOP (1000000) RAND(CAST(a.object_id AS BIGINT) + b.column_id*2511)
FROM msdb.sys.columns a CROSS JOIN msdb.sys.columns b

-- One cluster and two nonclustered indexes on the column that will be used for the range filter

CREATE CLUSTERED INDEX idx_RangeQueries_Cluster
ON TestingRangeQueries (ID)

CREATE NONCLUSTERED INDEX idx_RangeQueries_NC1
ON TestingRangeQueries (ID)

CREATE NONCLUSTERED INDEX idx_RangeQueries_NC2
ON TestingRangeQueries (ID)
INCLUDE (SomeValue)
GO

The query that I’ll be testing with will do a sum of the SomeValue column for a large range of ID values. That means that of the three indexes that I’m testing, one is clustered, one is a nonclustered that does not cover the query and the third is a covering nonclustered index.

SELECT SUM(SomeValue)
FROM TestingRangeQueries
WHERE ID BETWEEN 20000 and 200000 -- 180 001 rows, 18% of the table

I’m going to run the same range scan query three times, each with an index hint so that SQL will use the three different indexes, regardless of which one it thinks is best.

First up, the clustered index.

As expected, we get a clustered index seek (the predicate is SARGable) and a stream aggregate.

ClusteredIndex

Table ‘TestingRangeQueries’. Scan count 1, logical reads 12023, physical reads 0.

SQL Server Execution Times:
CPU time = 94 ms,  elapsed time = 110 ms.

(more…)

Distincting an IN subquery

This is going to be a quick one…

I keep seeing forum code (and production code) that includes the DISTINCT in IN or EXISTS subqueries. The rationale is given either as a performance enhancement or as necessary for correct results.

Is it necessary or useful? Only one way to find out.

Let’s check for correct results first, because that can be done with nice small tables.

CREATE TABLE DistinctOuter (
ID INT
);

CREATE TABLE DistinctInner (
ID INT
);

INSERT INTO DistinctOuter
VALUES (1), (2), (3), (4), (5), (6), (7), (8)

INSERT INTO DistinctInner
VALUES (1), (2), (2), (2), (2), (4), (6), (7)

DistinctIN

(more…)

Hit and miss

Or “Monitoring plan cache usage

For people interested in the details of how SQL is using and reusing execution plans, there are some useful events in profiler for watching this in detail, under the Stored procedure group:

  • SP:CacheMiss
  • SP:CacheInsert
  • SP:CacheHit
  • SP:CacheRemove
  • SP:Recompile
  • SP:StmtRecompile

Additionally there’s the SQL:StmtRecompile event under the TSQL group.

For now, I just want to look briefly at the CacheMiss and CacheHit events.

One word of caution early on, these are frequently occurring events and it may not be a good idea to trace these on busy production servers. If you do need to, keep the duration of the trace short and the columns to a minimum.

CacheEvents

CacheMiss

The cache miss event fires any time SQL looks for the execution plans for an object or ad-hoc batch and does not find it in the plan cache.

For an object (scalar function, multi-statement table-valued function, stored procedure or trigger) the match is done on the object ID (along with some of the connection’s SET options and possibly the database user and c couple other factors1). For an ad-hoc batch, the match is done on a hash of the text of the batch (along with some of the connection’s SET options and possibly the database user)

When testing stored procedures from Management Studio (or another SQL querying tool), two CacheMiss events will appear in the trace.

CacheMiss

What’s going on here?

(more…)

The Root of all Evil

Or “Shot gun query tuning

There have been a fair few forums questions in recent months asking for help in removing index scans, loop joins, sorts or other, presumed, slow query operators. There have been just as many asking how to replace a subquery with a join or a join with a subquery or similar aspects of a query usually for performance reasons.

The first question that I have to ask when looking at requests like that is “Why?”

Why is removing a particular query operator the goal? Why is changing a where clause predicate the goal? If it’s to make the query faster, has the query been examined and has it been confirmed that query operator or predicate really is the problem?

The title of this post refers to a comment I’ve seen again and again in blogs or articles about front-end development. “Premature optimisation is the root of all evils.” It’s true in the database field as well.

(more…)

The most optimal join type

What’s the best join type for a query? Should we aspire to seeing nested loop joins in all our queries? Should we tremble with horror at the sight of a hash join?

Well, it depends. 🙂

There’s no single join type that’s best in every scenario and there’s no join type that’s bad to have in every scenario. If one of the join types, say the much maligned hash join, was very much a sub-optimal join type in every single scenario, then there would be no reason for it to be in the product and no reason for the optimiser to ever select it for a plan. Since there are three join types, and the optimiser can and does use all three, we must assume that they are all useful under some circumstances.

I took a look at the joins a while back, but it’s worth revisiting.

The nested loop join

A nested loop join is an optimal join type when two conditions are true.

  1. One of the resultsets contains quite a small number of rows.
  2. The other table has an index on the join column(s).

When both of these are true, SQL can do a very efficient nested loop. The smaller resultset becomes the outer table of the join, a loop runs across all the rows in that resultset and index seeks are done to look up the matching rows in the inner table. It’s important to note that the number of seeks against the inner table will not be less than the number of rows in the outer table, at the point the join occurs

If the one resultset has a small number of rows but there is no index on the other table on the join column, then a loop join can still be done, but is less optimal as the entire of the inner table (or a subset based on another filter condition) must be read on each iteration of the loop.

If both resultsets have large numbers of rows but there is an index on the join columns in one of the tables then the nested loop can still read through one of the resultsets and do index seeks to locate matching rows, but the number of rows in the outer table will mean lots and lots of seek operations, which may result in a sub-optimal plan.

(more…)

Multiple Execution Paths

It’s not uncommon to find stored procedures that have multiple IF statements controlling the flow of execution within the procedure. Now this seems to be a fairly logical thing to do, but there can be a subtle performance problem with this, one that may be hard to identify.

Let’s have a look at a simple example (using AdventureWorks)

CREATE PROCEDURE MultipleExecPaths (
@TransactionType char(1) = NULL
)
AS

IF @TransactionType IS NULL
SELECT max(transactionDate) from Production.TransactionHistory
ELSE
SELECT max(transactionDate) from Production.TransactionHistory
WHERE TransactionType = @TransactionType

GO

Nice and simple. If the parameter is passed, get the latest date for that transaction type, if the parameter is not passed, ie is null, get the latest date over all transaction types. So what’s wrong with this?

The problem goes back to parameter sniffing. When the procedure is first executed the first time all queries in the procedure are parsed, bound and optimised. When the optimiser processes each statement to generate an execution plan it uses the values passed for the various parameters to estimate the number of rows affected. The number of rows that the optimiser thinks the queries will process affects the choice of operators for the plan. Operators that are optimal for small numbers of rows are not always optimal for large numbers of rows, and sometimes the difference can be astounding.

Let’s see how the example above plays out  to understand what’s happening here.

(more…)

Functions, IO statistics and the Execution plan

It’s no secret that I’m not overly fond of most user-defined functions. This isn’t just a pet hate, I have some good reasons for disliking them. All too often they’re performance bottlenecks, but that can be said about many things in SQL. The bigger problem is that they’re hidden performance bottlenecks that often go overlooked and ignored for too long.

I’m going to start with this fairly simple scalar function, created in the AdventureWorks database

Create function LineItemTotal(@ProductID int)
returns money
as
begin
declare @Total money

select @Total = sum(LineTotal) from sales.SalesOrderDetail where productid = @ProductID

return @Total
end

So, given that function, the following two queries should be equivalent.

SELECT productid, productnumber, dbo.LineItemTotal(productid) as SumTotal
FROM Production.Product p

SELECT productid, productnumber,
(select sum(LineTotal) from sales.SalesOrderDetail where productid = p.productid) AS SumTotal
FROM Production.Product p

No problems so far. They both return 504 rows (in my copy of AW, which has been slightly padded out with more data). Now, let’s look at the execution characteristics by running them again with Statistics IO and Statistics Time on.

Query 1, the one with the scalar function:

Table ‘Product’. Scan count 1, logical reads 4, physical reads 0.

SQL Server Execution Times:
CPU time = 47297 ms,  elapsed time = 47541 ms.

Query 2, the one with the correlated subquery:

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘SalesOrderDetail’. Scan count 3, logical reads 22536, physical reads 0.
Table ‘Product’. Scan count 3, logical reads 40, physical reads 0.

SQL Server Execution Times:
CPU time = 1047 ms, elapsed time = 1249 ms.

(more…)

Catch-all queries

The query form that I refer to as ‘Catch-all’ typically results from search screens in the application where the user may enter any one (or more) of a number of optional parameters. One of the more common ways for such a query to be written in SQL is with multiple predicates in the where clause of the form (WHERE SomeColumn = @SomeVariable OR @SomeVariable IN NULL)

Now this does work, the problem is that it works fairly inefficiently and, on large tables, can result in really poor query performance. I’m going to take a look at why that is the case and what alternatives there are.

Erland Sommarskog has written on this as well, and in a lot more detail than I’m going to. His article on dynamic search conditions is well worth reading, as are the rest of his articles.

A typical example of a ‘catch-all’ query would be this one, based off a table in the AdventureWorks database.

CREATE PROCEDURE SearchHistory
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity,
TransactionDate, ActualCost from Production.TransactionHistory
WHERE (ProductID = @Product Or @Product IS NULL)
AND (ReferenceOrderID = @OrderID OR @OrderID Is NULL)
AND (TransactionType = @TransactionType OR @TransactionType Is NULL)
AND (Quantity = @Qty Or @Qty is null)
GO

Now, let’s say that I run that query and pass values for the ProductID and the Transaction type. Let’s further say that there’s a nonclustered index (called idx_TranHistory_TranTypeProductID) on those two columns.

EXEC SearchHistory @Product = 978, @TransactionType = 'W'

Now this returns 52 rows out of 980000 that are in the table, so we’d expect that SQL would use an index seek operation on that index, followed by a bookmark lookup. (more…)

When is a seek actually a scan?

Most people who know SQL execution plans will say, without reservation, that an index seek on a particular index is better than an index scan on the same index. In the vast majority of cases, that’s true, but there are times when what appears in the execution plan as an index seek is actually an index scan.

Let me show an example

CREATE TABLE TestingSeeks (
id int identity (1,1) not null,
SomeStr char(6) default '' -- a filler
)
GO

insert into TestingSeeks (SomeStr)
select top (500000) ''
from sys.columns c1 cross join sys.columns c2

We have a table here with an identity column on it, starting at 1 and incrementing by 1 row. Hence, there will be no negative values in the table. I’m going to then put a nonclustered index on that column (the table has no cluster, it’s a heap)

CREATE NONCLUSTERED INDEX idx_Seek ON TestingSeeks (id)

Fair enough. If I query all the rows in the table and retrieve just the ID column, I’ll get a scan on that index, as is pretty much expected and Statistics IO tells me that 935 pages were read

(more…)

Identifying inaccurate statistics

I wrote previously about statistics, what they’re needed for and I briefly mentioned what can happen when they’re inaccurate.

One question I’ve seen asked a few times is on how to identify stats that are inaccurate, what DMV to use. The bad new is that there is no DMV that identifies inaccurate statistics. The Stats_Date function can be used to see when the stats were last updated, but that doesn’t say if that was too long ago ort not. If a table is readonly, or is changed twice a year, statistics that are four months old are perfectly valid.

The rowmodcnt column in sysindexes can give an approximate count of the number of rows that have changed since the stats were last updated, although sysindexes is deprecated and will disappear in the next version of SQL and the rowmodcnt column is no longer completely accurate. The rowmodcnt however is just a count of changes. It gives no indication of whether that number is too high and causing a problem. In some cases 50% of the table may have changed before there’s a problem with queries. In other cases 5% is enough to cause problems.

(more…)