Syndication

Catch-all queries

The query form that I refer to as ‘Catch-all’ typically results from search screens in the application where the user may enter any one (or more) of a number of optional parameters. One of the more common ways for such a query to be written in SQL is with multiple predicates in the where clause of the form (WHERE SomeColumn = @SomeVariable OR @SomeVariable IN NULL)

Now this does work, the problem is that it works fairly inefficiently and, on large tables, can result in really poor query performance. I’m going to take a look at why that is the case and what alternatives there are.

Erland Sommarskog has written on this as well, and in a lot more detail than I’m going to. His article on dynamic search conditions is well worth reading, as are the rest of his articles.

A typical example of a ‘catch-all’ query would be this one, based off a table in the AdventureWorks database.

CREATE PROCEDURE SearchHistory
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity,
TransactionDate, ActualCost from Production.TransactionHistory
WHERE (ProductID = @Product Or @Product IS NULL)
AND (ReferenceOrderID = @OrderID OR @OrderID Is NULL)
AND (TransactionType = @TransactionType OR @TransactionType Is NULL)
AND (Quantity = @Qty Or @Qty is null)
GO

Now, let’s say that I run that query and pass values for the ProductID and the Transaction type. Let’s further say that there’s a nonclustered index (called idx_TranHistory_TranTypeProductID) on those two columns.

EXEC SearchHistory @Product = 978, @TransactionType = 'W'

Now this returns 52 rows out of 980000 that are in the table, so we’d expect that SQL would use an index seek operation on that index, followed by a bookmark lookup. (more…)

When is a seek actually a scan?

Most people who know SQL execution plans will say, without reservation, that an index seek on a particular index is better than an index scan on the same index. In the vast majority of cases, that’s true, but there are times when what appears in the execution plan as an index seek is actually an index scan.

Let me show an example

CREATE TABLE TestingSeeks (
id int identity (1,1) not null,
SomeStr char(6) default '' -- a filler
)
GO

insert into TestingSeeks (SomeStr)
select top (500000) ''
from sys.columns c1 cross join sys.columns c2

We have a table here with an identity column on it, starting at 1 and incrementing by 1 row. Hence, there will be no negative values in the table. I’m going to then put a nonclustered index on that column (the table has no cluster, it’s a heap)

CREATE NONCLUSTERED INDEX idx_Seek ON TestingSeeks (id)

Fair enough. If I query all the rows in the table and retrieve just the ID column, I’ll get a scan on that index, as is pretty much expected and Statistics IO tells me that 935 pages were read

(more…)

Estimated and Actual execution plan revisited

After an interesting discussion on SQLServerCentral last week, I realised that the terms ‘estimated execution plan’ and ‘actual execution plan’ are perhaps a little bit misleading.

The only thing estimated about the estimated execution plan is the rowcounts, costs and row size. The plan itself isn’t an estimate. It’s not as if the optimiser, when asked for an estimated plan, does a less thorough job than when asked to compile a plan for a query’s execution.

The two forms of execution plan are better described as ‘execution plan with run-time information’ and ‘execution plan without run-time information’

When, in Management Studio, someone clicks the ‘display estimated execution plan’ button, the query is submitted to SQL Server, parsed and bound, algebratised and optimised just as if it was going to be executed. But the query is not executed, and as such, the plan when returned contains no run time information.

If there is a matching cached query plan, that cached plan is what’s returned and no optimisation is done. This can be seen by using profiler with the Cache hit, cache miss and cache insert events being traced.

When, in Management Studio, the query is run with the execution plan option enabled, the query is submitted to SQL Server, parsed and bound, algebratised, optimised and executed. The returned plan does contain the run-time for that specific execution, hence the plan contains things like ‘actual row count, actual IO cost’, etc

If there’s a matching query plan in cache then that cached plan will be used for the query’s execution and will be the one returned, though with the run-time information added

When a plan is cached, only the compile-time information is cached. The detailed run-time information on the actual number of rows and actual executions is discarded after updating the aggregated query stats. Hence, when you retrieve a query from the plan cache, it will not contain the run-time information. Consider a plan that’s been used 20 times. Which execution’s run-time information would it contain? Remember that there’s only one plan in cache per procedure.

Hence, a plan fetched from cache will be identical to the plan returned by requesting the estimated execution plan for a specific query (Providing there’s nothing happening to invalidate the cached plan)

Profiler can capture (depending on event) the plan without the run-time information or the plan with the run-time information. There’s a nice table in chapter 2 of one of Itzik’s books that shows the various profiler events, when they fire and what they return.

All indexes are unique

Well, that’s a rather contentious title. There are probably several people shaking their heads at this point. Let me explain.

I was listening to a podcast with Kimberly Tripp this morning, and she mentioned this briefly. I thought it would be a good discussion to end a short series on indexes and selectivity.

The Clustered Index

A clustered index has to be unique, because the clustering key acts as the row’s location in the table. If the index is not defined as unique, SQL will make it unique by adding a uniquifier, a 4-byte integer that’s hidden behind the scenes and is added when necessary to make the clustered index unique.

It’s not documented anywhere clearly, but it is mentioned in a couple of places. From msdn:

If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier. This four-byte value is not visible to users. It is only added when required to make the clustered key unique for use in nonclustered indexes. SQL Server retrieves the data row by searching the clustered index using the clustered index key stored in the leaf row of the nonclustered index.

So all clustered indexes are unique.

(more…)

Index columns, selectivity and inequality predicates

So, following on from my post last week, I’m going to take a look at how selectivity and index column order affect inequality predicates.

One thing to note straight off is that the selectivity of a column is much less important for inequality predicates than it was for equality. For equality predicates, the selectivity alone can give a reasonable idea of the number of rows a particular predicate will return. That’s not the case with inequalities. Also, with inequality predicates, the order of columns in the index becomes very important.

One of the most important considerations with inequality predicates is the number of rows that the predicate will return. An identity column may be highly selective, but if the filter is for all rows > 0 and the identity values start t one, then an index on that column is not going to be very useful.

The other consideration when there are inequality predicates is that only that column and columns to the left of it in the index key can be used for index seeks. Any columns to the right of the column with the inequality is no longer eligible for seeking.

To explain with an example, consider our hypothetical table from the previous post (with one small change):

CREATE TABLE ConsideringIndexOrder (
ID INT,
SomeString VARCHAR (100),
SomeDate DATETIME DEFAULT GETDATE()
);  

The same as previously, there’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.

If there’s an inequality predicate, then then the index is only fully seekable for the following queries
…  WHERE ID = @ID AND SomeDate = @dt AND SomeString > @str
…  WHERE ID = @ID AND SomeDate > @dt
…  WHERE ID > @ID

(more…)

A Bookmark lookup, by any other name…

I think I may have confused some people by talking about bookmark lookups. I’ll attempt to clarify things.

The operator that I’m talking about is the one that fetches extra columns from the clustered index when the nonclustered index that’s used to retrieve the rows doesn’t have all of the columns required.

In SQL 2000, that operator appeared in the execution plan as a bookmark lookup and it appeared as follows:

In SQL 2005, the name was changed, and the bookmark lookup no longer appeared in the execution plan. In it’s place appeared a clustered index seek, joined back to the original index seek by a nested loop join. It appeared as follows (the highlighted operator is the ‘bookmark lookup’)

This change made it harder to see what was going on as clustered index seeks are usually considered ‘good’. The trick to see if it is actually a ‘bookmark lookup’ is to look at the objects involved. When the nonclustered index seek and the clustered index seek are both on the same table, then it’s a ‘bookmark lookup’

I don’t recall what this appeared as when the base table was a heap, not a cluster.

In SQL 2005 SP2, the name of the operator was changed again, now appearing as a key lookup (when the base table has a clustered index) or a RID lookup (when the base table is a heap). It now looks like this:

The thing to note is that it’s not the version of the server that’s important. The format of the XML for the exec plan has not changed since SQL 2005 RTM (I can and have created a .sqlplan file from SQL 2008 and opened that file in SQL 2005’s management studio).

It’s the version of management studio that affects how the execution plans are displayed. If the server is SQL 2005 SP3, but the client tools are still RTM, the bookmark lookup will appear as a clustered index seek. Another reason to patch the client as well as the server

I hope that clears up some of the confusion around the naming. So, in future, what should I refer to this as? A bookmark lookup? A Key lookup?

Index columns, selectivity and equality predicates

Or “Which column goes first?

There’s a common piece of advice given about columns in an index key that says that the most selective column should go first. I’m not going to say that’s incorrect, because it’s not. The problem is that it’s often given without any explanation as to why the most selective column should go first, nor are the other considerations for index key order mentioned.

This can lead to misunderstandings like, in the extreme case, where one person after hearing that advice went and added the primary key column as the leading column of every single nonclustered index (because it’s highly selective), and then wondered why his database performance decreased dramatically.

The comment about selectivity is because of the way SQL keeps statistics on indexes (see my post on statistics for more info on what they are). SQL only keeps the histogram for the first column of the index. That means that it only knows the actual distribution of values of the first column. If the first column is not selective, the index may not be used. However, that’s not the whole story.

SQL also, in addition to the histogram, keeps density values for all of the left-based subsets of the index keys. So, for a 3 column index key, SQL knows the density of the first column, of the first and second and of all three. The density is, in a nutshell, a value that shows how unique the set of columns is. It’s 1/(distinct values). The value can be seen for any index using DBCC Show_Statistics with the DENSITY_VECTOR option.

This means, while SQL only knows the actual data distribution of the first column, it does know, on average, how many rows will be returned by an equality match on any left-based subset of the index keys

So, what’s my rule for the order of columns in an index key? Put the most selective columns first, when all other considerations are equal.

(more…)

Seek or scan?

One very common question that I see on the forums is on index seeks and index scans. A query is resulting in a table/clustered index scan, even though there’s an index on one or more of the columns been searched on.

One of the more common reasons that this happens is because the index in question is not covering, and SQL has determined that the cost of doing the lookups to fetch the extra columns is higher than the cost of scanning the entire table.

If an index does not cover a query, then bookmark lookups are required to get the additional columns, bookmark lookups are run one row at a time, and are seeks on the clustered index. Hence it’s clear that bookmark lookups on a large number of rows are exceedingly expensive and that is why SQL will switch to a clustered index/table scan when lookups are required on a significant percentage of the rows in the table.

So, what constitutes a significant percentage of the rows in the table? 50%? 20%? 10%?

(more…)

Identifying inaccurate statistics

I wrote previously about statistics, what they’re needed for and I briefly mentioned what can happen when they’re inaccurate.

One question I’ve seen asked a few times is on how to identify stats that are inaccurate, what DMV to use. The bad new is that there is no DMV that identifies inaccurate statistics. The Stats_Date function can be used to see when the stats were last updated, but that doesn’t say if that was too long ago ort not. If a table is readonly, or is changed twice a year, statistics that are four months old are perfectly valid.

The rowmodcnt column in sysindexes can give an approximate count of the number of rows that have changed since the stats were last updated, although sysindexes is deprecated and will disappear in the next version of SQL and the rowmodcnt column is no longer completely accurate. The rowmodcnt however is just a count of changes. It gives no indication of whether that number is too high and causing a problem. In some cases 50% of the table may have changed before there’s a problem with queries. In other cases 5% is enough to cause problems.

(more…)