Syndication

Index columns, selectivity and inequality predicates

So, following on from my post last week, I’m going to take a look at how selectivity and index column order affect inequality predicates.

One thing to note straight off is that the selectivity of a column is much less important for inequality predicates than it was for equality. For equality predicates, the selectivity alone can give a reasonable idea of the number of rows a particular predicate will return. That’s not the case with inequalities. Also, with inequality predicates, the order of columns in the index becomes very important.

One of the most important considerations with inequality predicates is the number of rows that the predicate will return. An identity column may be highly selective, but if the filter is for all rows > 0 and the identity values start t one, then an index on that column is not going to be very useful.

The other consideration when there are inequality predicates is that only that column and columns to the left of it in the index key can be used for index seeks. Any columns to the right of the column with the inequality is no longer eligible for seeking.

To explain with an example, consider our hypothetical table from the previous post (with one small change):

CREATE TABLE ConsideringIndexOrder (
ID INT,
SomeString VARCHAR (100),
SomeDate DATETIME DEFAULT GETDATE()
);  

The same as previously, there’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.

If there’s an inequality predicate, then then the index is only fully seekable for the following queries
…  WHERE ID = @ID AND SomeDate = @dt AND SomeString > @str
…  WHERE ID = @ID AND SomeDate > @dt
…  WHERE ID > @ID

(more…)

A Bookmark lookup, by any other name…

I think I may have confused some people by talking about bookmark lookups. I’ll attempt to clarify things.

The operator that I’m talking about is the one that fetches extra columns from the clustered index when the nonclustered index that’s used to retrieve the rows doesn’t have all of the columns required.

In SQL 2000, that operator appeared in the execution plan as a bookmark lookup and it appeared as follows:

In SQL 2005, the name was changed, and the bookmark lookup no longer appeared in the execution plan. In it’s place appeared a clustered index seek, joined back to the original index seek by a nested loop join. It appeared as follows (the highlighted operator is the ‘bookmark lookup’)

This change made it harder to see what was going on as clustered index seeks are usually considered ‘good’. The trick to see if it is actually a ‘bookmark lookup’ is to look at the objects involved. When the nonclustered index seek and the clustered index seek are both on the same table, then it’s a ‘bookmark lookup’

I don’t recall what this appeared as when the base table was a heap, not a cluster.

In SQL 2005 SP2, the name of the operator was changed again, now appearing as a key lookup (when the base table has a clustered index) or a RID lookup (when the base table is a heap). It now looks like this:

The thing to note is that it’s not the version of the server that’s important. The format of the XML for the exec plan has not changed since SQL 2005 RTM (I can and have created a .sqlplan file from SQL 2008 and opened that file in SQL 2005’s management studio).

It’s the version of management studio that affects how the execution plans are displayed. If the server is SQL 2005 SP3, but the client tools are still RTM, the bookmark lookup will appear as a clustered index seek. Another reason to patch the client as well as the server

I hope that clears up some of the confusion around the naming. So, in future, what should I refer to this as? A bookmark lookup? A Key lookup?

Index columns, selectivity and equality predicates

Or “Which column goes first?

There’s a common piece of advice given about columns in an index key that says that the most selective column should go first. I’m not going to say that’s incorrect, because it’s not. The problem is that it’s often given without any explanation as to why the most selective column should go first, nor are the other considerations for index key order mentioned.

This can lead to misunderstandings like, in the extreme case, where one person after hearing that advice went and added the primary key column as the leading column of every single nonclustered index (because it’s highly selective), and then wondered why his database performance decreased dramatically.

The comment about selectivity is because of the way SQL keeps statistics on indexes (see my post on statistics for more info on what they are). SQL only keeps the histogram for the first column of the index. That means that it only knows the actual distribution of values of the first column. If the first column is not selective, the index may not be used. However, that’s not the whole story.

SQL also, in addition to the histogram, keeps density values for all of the left-based subsets of the index keys. So, for a 3 column index key, SQL knows the density of the first column, of the first and second and of all three. The density is, in a nutshell, a value that shows how unique the set of columns is. It’s 1/(distinct values). The value can be seen for any index using DBCC Show_Statistics with the DENSITY_VECTOR option.

This means, while SQL only knows the actual data distribution of the first column, it does know, on average, how many rows will be returned by an equality match on any left-based subset of the index keys

So, what’s my rule for the order of columns in an index key? Put the most selective columns first, when all other considerations are equal.

(more…)

Seek or scan?

One very common question that I see on the forums is on index seeks and index scans. A query is resulting in a table/clustered index scan, even though there’s an index on one or more of the columns been searched on.

One of the more common reasons that this happens is because the index in question is not covering, and SQL has determined that the cost of doing the lookups to fetch the extra columns is higher than the cost of scanning the entire table.

If an index does not cover a query, then bookmark lookups are required to get the additional columns, bookmark lookups are run one row at a time, and are seeks on the clustered index. Hence it’s clear that bookmark lookups on a large number of rows are exceedingly expensive and that is why SQL will switch to a clustered index/table scan when lookups are required on a significant percentage of the rows in the table.

So, what constitutes a significant percentage of the rows in the table? 50%? 20%? 10%?

(more…)

Identifying inaccurate statistics

I wrote previously about statistics, what they’re needed for and I briefly mentioned what can happen when they’re inaccurate.

One question I’ve seen asked a few times is on how to identify stats that are inaccurate, what DMV to use. The bad new is that there is no DMV that identifies inaccurate statistics. The Stats_Date function can be used to see when the stats were last updated, but that doesn’t say if that was too long ago ort not. If a table is readonly, or is changed twice a year, statistics that are four months old are perfectly valid.

The rowmodcnt column in sysindexes can give an approximate count of the number of rows that have changed since the stats were last updated, although sysindexes is deprecated and will disappear in the next version of SQL and the rowmodcnt column is no longer completely accurate. The rowmodcnt however is just a count of changes. It gives no indication of whether that number is too high and causing a problem. In some cases 50% of the table may have changed before there’s a problem with queries. In other cases 5% is enough to cause problems.

(more…)