Indexes

All you need to know about Columnstore Indexes in one article

I realised the other week, that despite a bunch of posts on indexes over the years, I’ve never written a blog post on Columnstore indexes. Time to fix that. Here’s everything you need to know to get started using columnstore indexes. (Note, this is not, in any way, everything there is to know about columnstore indexes, for that see Nico’s blog series, currently at 131 entries long)

Before I dive into columnstores, for comparison let me first discuss rowstore indexes.

In a rowstore index (what we previously would have just called an index), pages are in a b-tree, the leaf levels containing all the rows and the upper levels containing one row per page below them.

The columns that make up the row (or at least those which are part of the index) are found together in the page. Hence, wanting one column from the index means reading the entire page and all of the other columns that are part of that index.

Columnstores are… different.

The first, and I would say most important thing to realise about columnstore indexes is that they don’t have keys. These are not seekable indexes. Any access to a columnstore index is going to be a scan.

Instead of storing the rows together on a page, a columnstore index instead stores column values together. The rows in the table are divided into chunks of max a million rows, called a row group, and the columns are then stored separately, in what are called segments. A segment will only ever contain one column’s values.

The segments are then compressed, and because typically there will be repeated values in a column, they can compress quite well.

Because of this architecture, if a query needs to get all the values for Column 2 and Column 3, it’s a very efficient access. All of the segments that contain Column 2 and Column 3 can be read, and the segments that contain the other columns can be completely ignored. Conversely, if the query needs all of the columns for a handful of rows, it’s quite an inefficient access. Because there’s no way to seek against a columnstore index, all of the segments would have to be read* to locate the column values that make up the row and the row would have to be reconstructed.

(*) There is a process called rowgroup elimination which can remove row groups from consideration when locating rows. I will not be going into that in this article.

An index of this form is not editable, not easily anyway. The columns are all compressed, so changing a single value could require the entire segment be decompressed before updating a value. The first version of columnstore indexes were indeed readonly, but that made them less than useful in many cases, and so since SQL 2014 columnstore indexes are updatable. Kinda

Well, the compressed segments are not updatable, nor can they be added to, nor can they be deleted from. Directly, that is. Instead what’s done is that new rows are added to something called a ‘delta store’, which is a b-tree index associated with the columnstore. Newly inserted rows are added to this delta store, not to the compressed segments directly. Once the delta store reaches a certain size it’s closed, compressed and the contents are a added to the columnstore index as a new rowgroup. Deletes are handled similarly. When a row is deleted, a flag is set in a deleted bitmap indicating that the row is no longer present. When the index is read, any rows marked deleted in the bitmap are removed from the resultset. Updates are split into deletes and inserts, and hence flag the old version of the row as deleted and then insert the new version of the row into the delta store.

Index rebuild will recreate all of the rowgroups and remove the logically deleted rows, as well as force any open rowgroups to be compressed. Index reorganise operations will force open rowgroups to be compressed and will remove logically deleted rows if more than 10% of the rows in a rowgroup are flagged deleted.

All well and good so far, but why bother? The simple answer here is speed. Columnstore indexes are fast for operations involving large numbers of rows for a couple reasons

  • Column-based storage
  • Compression
  • Batch mode query processing

The first one we’ve already looked at. When a query needs most of the rows in the table but only a few of the columns, the column-based storage is more efficient.

Compression is the second reason. Columnstore indexes are compressed, and because there’s more compression opportunities for column values stored together than rows, they typically compress very well. Good compression means less data which needs to be moved around, which generally translates to better performance

Batch mode used to be the third reason. Batch mode is a query processor feature where multiple rows can be processed at once, rather than one row at a time. It can be a lot faster than row mode. When columnstores were introduced, batch mode was only an option when there was a columnstore index used in the query, but since SQL 2017 batch mode has been available for rowstore indexes as well.

How much faster? Well…

The only difference between the above two queries is that one ran against a table with a rowstore clustered index and one ran against a table with a columnstore clustered index. Oh, and 19 seconds of CPU time. Both tables have exactly the same data in them, 88 million rows.

That’s about all for a high-level overview of columnstore indexes. Of course this isn’t everything there is to know about columnstores, just what you need to know to get started using them. If you want everything, there’s always Nico’s blog series.

Index selectivity and index scans

There was a question raised some time back ‘If an index is not selective, will the query operators that use it always be index scans’?

It’s an interesting question and requires a look at what’s going on behind the scenes in order to answer properly..

Short answer: No, not always.

Long answer…

Selectivity

Selectivity is a measure of what portion of the table satisfies a particular query predicate. The Microsoft whitepaper on statistics as used by the query optimiser defines selectivity as follows.

The fraction of rows from the input set of the predicate that satisfy the predicate. More sophisticated selectivity measures are also used to estimate the number of rows produced by joins, DISTINCT, and other operators.

Bart Duncan wrote a nice detailed blog post a while back explaining the difference between density, selectivity and cardinality. In summary, indexes have density, a measure of how unique the left-based column subsets within them are; predicates have selectivity, a measure of what portion of the table they affect; operators have cardinality, a measure of how many rows the operator processes.

Indexes cannot be said to be selective or not, they can only be said to have a high or low density. It is possible for a predicate on a very low density column (unique) to have a very poor selectivity (large percentage of the table affected) Imagine ID > 0 where ID is an int identity column. The column is unique, but the predicate affects the entire table. Low density (which is good), but poor selectivity.

So let’s alter the original question. “If an index has a high density (not very unique, lots of duplicate values), will query operators against it always be index scans rather than index seeks?”

Seek and Scan

Before we go on, I want to quickly look at the main difference between a seek operation and a scan operation.

A seek is an operation which navigates down the index’s b-tree looking for a row or for the start/end of a range of rows. A seek requires a predicate and that predicate must be of the form that can be used as a search argument (SARGable)

A scan is a read of part or all of the leaf level of an index.

High-density indexes

So what exactly is the problem with a high density index? In short, it returns a lot of rows for any predicate filters against it (unless there’s a TOP involved, but let’s ignore those cases here). If the index has a high density (and lets assume for simplicity there’s no data skew here), any predicate using that index automatically has a poor selectivity, it returns a large portion of the table.

If we take as an example a 100 000 row table, with an column called status that has 4 values only, then, assuming that the distribution of those values is equal, a query with a predicate searching for one of those values will read 25000 rows. If we have a nonclustered index on that integer column, it works out that the nonclustered index has 223 pages at the leaf level and is 2 levels deep in total. Given that the four values have equal distribution, an index seek to retrieve the rows for one of those status values will require approximately 57 pages to be read.

Is the index scan better?

The scan will read all the leaf pages, that’s what a scan does (ignoring cases like min, max and top where it can scan and read only part of the index). So if SQL decided to use an index scan because of the high density of the index it will have to read all 100 000 rows on all 223 pages (plus the index root page)

57 pages for the index seek vs 224 pages for the index scan. Looks pretty obvious which is better. To prove that I’m not making things up, let me test this and get actual numbers.

First the setup:

CREATE TABLE TestingIndexSeeks (
   Status INT,
   Filler CHAR(795) DEFAULT ''
);

INSERT INTO TestingIndexSeeks (Status)
SELECT NTILE(4) OVER (ORDER BY (SELECT 1)) AS Status FROM (
    SELECT TOP (100000) 1 AS Number FROM sys.columns a CROSS JOIN sys.columns b
) sub

CREATE NONCLUSTERED INDEX idx_Testing_Status ON dbo.TestingIndexSeeks (Status)

GO

Then the  test:

SELECT status FROM dbo.TestingIndexSeeks WITH (FORCESEEK) WHERE Status = 3

SELECT status FROM dbo.TestingIndexSeeks WITH (FORCESCAN) WHERE Status = 3

Statistics IO for the two queries:

Seek
Table ‘TestingIndexSeeks’. Scan count 1, logical reads 59, physical reads 0.

Scan
Table ‘TestingIndexSeeks’. Scan count 1, logical reads 225, physical reads 0.

Yup, index seek is better and the one that the optimiser choses if it is allowed to chose.

IndexSeek

High density indexes and the clustered index

So why the confusion around index scans on high density indexes? I suspect it’s because of the way the optimiser handles noncovering indexes where the predicates are not selective. This has nothing to do with the efficiency of the seek or scan operators on the nonclustered index though, it’s got to do with the mechanism used for the key lookup.

If a nonclustered index that SQL could use for a query is not covering, then for each row in that resultset it has to do a lookup back to the cluster/heap for the rest of the columns. Those key (or RID) lookups are expensive operations. If there are too many needed then the optimiser switches to a scan, not of the nonclustered index (it would be pointless, it’s still not covering), but of the clustered index because that at least has all the columns needed for the query (it could also switch to a scan of a different nonclustered index if there is one that’s covering but with columns in the wrong order to be seekable)

Summary

In summary, does having a high density nonclustered index result in index scans of that index? No (unless the predicate is not SARGable), however it can result in scans of a different index (probably the cluster) if the index is not covering for a query and that high density index being unused.

SQL University: Advanced Indexing – Indexing Strategies

Right, I know it’s Friday and everyone’s tired and looking forward to the weekend, but I do need to finish off this indexing section and I’ll try to keep this short and interesting and hopefully keep everyone awake.

There’s no shortage of information available on how to create indexes. Hell, I’ve written a copious amount myself. Most of these many articles however are written from the point of indexing single queries. What you chose for a where clause, what has to go into the include to create the perfect index for this query. Now that’s all well and good, but I’ve never met a system that had only one query per table (maybe there is such a system out there, but I’ve never found it)

So what I’m going to try to do today is address the topic of a strategy for indexing. How to approach indexing, not for a single query, but for the system as a whole. I won’t be able to cover this in-depth, this is material worthy of an entire book chapter, if not an entire book, but I can at least touch on the essential portions.

Now, there’s two main positions that we could be in when considering indexing strategies for an entire system
1) A brand new system that’s still in development
2) An existing system that’s being used actively.

One at a time…

Indexing strategies for a brand new system

Start by choosing a good clustered index. What makes a good clustered index? Well, it depends 🙂

The clustered index is the base, it will affect each and every nonclustered index, and it’s not trivial to change once the system is in use, so chose carefully. I’m not saying another word on the subject of a clustered index, not today.

Once that’s done…

(more…)

SQL University: Advanced Indexing – Filtered Indexes

Welcome back to day 2 of Advanced Indexing. Today we’re going to look at a feature that was added in SQL Server 2008 – filtered indexes.

In versions previous, indexes were always on the entire table. An index would always have the same number of rows as the table it was built on did (which is why COUNT(*) can just scan the smallest index on the table)

With filtered indexes, it’s possible to have an index that’s built on a subset of the rows in the table. The definition for a filtered index contains a WHERE clause predicate that determines if a row in the table will be in the index or not.

This can be a major advantage on really large tables where most queries are only interested in a small fraction of the table. A normal index would be based on the entire table regardless of the fact that most of the table is of no interest, meaning the index would be larger than necessary, deeper than necessary and take up more space than would be ideal. With a filtered index on just the interesting portion of the table, the index size is kept to a minimum, meaning it’s shallower than an index on the entire table and hence more efficient.

A simple example of a filtered index would be

CREATE NONCLUSTERED INDEX idx_Example
ON Account (AccountNumber)
WHERE Active = 1;

There are two main uses of a filtered index:
1) Enforcing moderately complex uniqueness requirements
2) Supporting queries that filter on common subsets of a table

Filtered indexes and unique columns

One very interesting use of filtered indexes is in enforcing uniqueness over portions of a table. One requirement that come up again and again is to have a nullable column that must have unique entries in it, but whose entries are optional. Basically, the column must be unique or null. Sounds easy, but the problem is that a unique index allows only one null. So much for nulls not being equal to anything including other nulls.

Prior to SQL 2008 implementing such a constraint required computed columns, indexed views or triggers. With SQL 2008’s filtered indexes, it’s trivial.

(more…)

SQL University: Advanced indexing – Sorting and Grouping

Good day everyone and welcome to another week of SQL University. I know we’re getting close to the end of the year and everyone’s looking forward to a nice long vacation soaking up the sun at the beach, but a little bit of attention would be nice. Thank you.

This week is Advanced Indexing, and I mean advanced, none of that selectivity, SARGable, predicate stuff that gets repeated all over the place. If you need a refresher on the basics before we get started, the following can be considered pre-requisite reading for this course

There’s also some additional background material available for more enthusiastic students:

Right, now that the admin has been handled, let’s get straight into things. Nothing like starting at the deep end…

Most people would rightly associate indexes with where clause predicates and joins, after all, the main usage of an index is to reduce the rows in consideration for a query as fast as possible. However there’s another portion of your queries that indexes can, if appropriately designed, help with – grouping and sorting.

Sorting is an extremely expensive operation, especially on large numbers of rows. For the academics in the audience, the algorithmic complexity of sorting is above linear, the time to sort a set of data increases faster than the number of items in the list. The common sorting algorithms have an average time complexity of O(n log n). It’s better than O(n2), but it can still hurt at the higher row counts.

O(n^2) O(n log n)

O(n2) on the left, O(n log n) on the right (Thanks to QuickMath)

Right, the non-academics can wake up now.

The other reason that sorting hurts is that it needs a memory grant to do the sort. If there’s a shortage of memory the query could have to wait a while for the memory to be granted and, if the optimiser mis-estimates the number of rows to be sorted, the memory grant could be insufficient and the sort would have to spill into TempDB. You don’t want that happening.

Finally, sort is a blocking operator in the execution plan (all rows must have been fetched from the source before any output can be returned), and so the client won’t start seeing rows returned until the entire sort has been completed. This can make the query feel like it takes longer than it really does.

Grouping and aggregation are much the same. To aggregate one set of values based on another set of values, SQL has to get all the like values of the grouping columns together so that it can do the aggregation. That sounds suspiciously like a sort doesn’t it?

SQL doesn’t always sort to do aggregations, but the alternative – hash tables – isn’t exactly free (homework exercise – read up on hash tables)

So for both sorting and grouping, the query processor’s job would be a lot easier if there was some way that it could get the data ordered by the sorting or grouping columns without having to do the work of actually sorting. Sounds impossible? No.

(more…)

Indexing for ORs

All of the indexing strategy posts I’ve written in the past have been concerned with predicates combined with ANDs. That’s only one half of the possibilities though. There’s the case of predicates combines with ORs, and the guidelines for indexing that work well with ANDs don’t work with ORs

When dealing with predicates combined with AND, the predicates are cumulative, each one operates to further reduce the resultset.

For this reason, multi-column indexes support multiple predicates combined with AND operators.

If we look at a quick example, consider the following.

CREATE TABLE Customers (
  CustomerID INT IDENTITY PRIMARY KEY,
  Surname VARCHAR(30) NOT NULL,
  FirstName VARCHAR(30),
  Title VARCHAR(5),
  CustomerType CHAR(1) NOT NULL,
  IsActive BIT DEFAULT 1 NOT NULL,
  RegistrationDate DATETIME NOT NULL DEFAULT GETDATE()
);</pre>
CREATE INDEX idx_Customers_SurnameFirstName ON Customers (Surname, FirstName);

Again I’m going to be lazy and get SQLDataGenerator to generate a few rows.

With that two column index on those columns and a query that looks for Surname = ‘Kelley’ AND Name = ‘Rick’, SQL can do a double column seek to go directly to the start of the range then just read down the index to the end of the range, basically until it finds the first row that it’s not interested in.

So how does that that differ when the operator is an OR?

The main difference is that with an OR, the predicates are independent. The second doesn’t serve to reduce the recordset, but rather to expand it. It’s similar to evaluating two separate predicates and combining the result. Let’s have a look at that 2 column index again when the two predicates are combined with an OR.

SELECT CustomerID
  FROM Customers
  WHERE Surname = 'Kelley' OR FirstName = 'Rick';

If we try to use that index to evaluate Surname = ‘Kelley’ OR Name = ‘Rick’, there’s a problem. While the first of those predicates can be evaluated with a seek (it’s a sargable predicate on the left-most column of an index), the second predicate cannot. It’s sargable, but it is on the second column of the index (and for the moment let’s assume there are no other indexes on the table). Seeks are only possible if the predicate filters on a left-based subset of the index key.

Hence to evaluate that predicate SQL will have to do an index scan. Since it has to do a scan to evaluate the one predicate, it won’t bother also doing a seek to evaluate the first predicate as it can also evaluate that during the scan.

Hence, in this case, the query will execute with a single index scan.

IndexScanWithOr

So how do we get this query to rather seek?

(more…)

Is a clustered index best for range queries?

I see a lot of advice that talks about the clustered index been the best index for use for range queries, that is queries with inequalities filters, queries that retrieve ranges of rows, as opposed to singleton queries, queries that retrieve single rows (including, unfortunately, a Technet article).

I suspect the reasoning behind this advice is the idea that the clustered index stores the data in order of the clustering key (ack) and hence it’s ‘logical’ that such a structure would be best for range scans as SQL can simply start at the beginning of the range and read sequentially to the end.

Question is, is that really the case?

Let’s do some experiments and find out.

CREATE TABLE TestingRangeQueries (
ID INT IDENTITY,
SomeValue NUMERIC(7,2),
Filler CHAR(500) DEFAULT ''
)

-- 1 million rows
INSERT INTO TestingRangeQueries (SomeValue)
SELECT TOP (1000000) RAND(CAST(a.object_id AS BIGINT) + b.column_id*2511)
FROM msdb.sys.columns a CROSS JOIN msdb.sys.columns b

-- One cluster and two nonclustered indexes on the column that will be used for the range filter

CREATE CLUSTERED INDEX idx_RangeQueries_Cluster
ON TestingRangeQueries (ID)

CREATE NONCLUSTERED INDEX idx_RangeQueries_NC1
ON TestingRangeQueries (ID)

CREATE NONCLUSTERED INDEX idx_RangeQueries_NC2
ON TestingRangeQueries (ID)
INCLUDE (SomeValue)
GO

The query that I’ll be testing with will do a sum of the SomeValue column for a large range of ID values. That means that of the three indexes that I’m testing, one is clustered, one is a nonclustered that does not cover the query and the third is a covering nonclustered index.

SELECT SUM(SomeValue)
FROM TestingRangeQueries
WHERE ID BETWEEN 20000 and 200000 -- 180 001 rows, 18% of the table

I’m going to run the same range scan query three times, each with an index hint so that SQL will use the three different indexes, regardless of which one it thinks is best.

First up, the clustered index.

As expected, we get a clustered index seek (the predicate is SARGable) and a stream aggregate.

ClusteredIndex

Table ‘TestingRangeQueries’. Scan count 1, logical reads 12023, physical reads 0.

SQL Server Execution Times:
CPU time = 94 ms,  elapsed time = 110 ms.

(more…)

One wide index or multiple narrow indexes?

TSQL2sDay150x150 Or “If one index is good, surely many indexes (indexes? indices? indi?) will be better

This is a question that comes up very often on the forums. Something along the lines of:

I have a query with multiple where clause conditions on a table. Should I create one index for each condition, or one index with all the columns in it?

The question basically boils down to this: Which is more optimal and more likely for the optimiser to pick, a single seek operation against a wide index that seeks on all three conditions in one go, or three seek operations against three indexes followed by a join to get back the final set of rows.

One thing to keep in mind is that one of the jobs of an index is to reduce the number of rows in consideration for a query as early as possible in the query’s execution.

So let’s take a made-up example. Let’s say we have a table with a number of columns in it. A query is run against that table with three conditions in the where clause

WHERE ColA = @A AND ColB = @B AND ColC = @C

Let’s further say that 1000 rows qualify for the condition ColA = @A, 15000 rows qualify for ColB = @B and 30000 rows qualify for ColC = @C. The total number of rows that qualify for all three conditions is 25.

Which sounds like it would be more efficient?

  • Seek on an index with all three columns and retrieve just 25 rows
  • Seek on an index on ColA, retrieve 1000 rows, seek on an index on ColB, retrieve 15000 rows, seek on an index on ColC, retrieve 30000 rows then join the three result-sets together to get the desired 25 rows (called an Index Intersection)

Time for some tests to find out.

(more…)

Is a scan a bad thing?

This one comes up from time to time, so I thought I’d have a go at addressing it.

Let’s imagine a hypothetical DBA who’s doing some performance tuning. He looks at a query plan for a moderately complex query and panics because there’s a couple of index scans and he wants to rather see index seeks.

Is that correct, are index scans bad and index seeks good?

Well, most of the time yes. Most of the time a scan is a problem and indicates a missing index or a query problem, but there are other times that it’s the most optimal way to get the required rows from the table.

I’ve previously looked at the case where the index seeks actually reads the the entire table, in this post I’m going to be evaluating some common query constructs to see when a seek really is the most optimal operator.

Let’s start with the simplest case, and I’m going to use the AdventureWorks database for these queries.

[source:sql]select ProductID, Name from Production.Product[/source]

In this case I get an index scan on the AK_Product_Name index and that makes perfect sense. I’m asking for all the rows in the table. there is no way that SQL can use a seek to execute that query. For there to be a seek, there has to be a SARGable predicate within the query that can be used for the seek.

(more…)

When is a seek actually a scan?

Most people who know SQL execution plans will say, without reservation, that an index seek on a particular index is better than an index scan on the same index. In the vast majority of cases, that’s true, but there are times when what appears in the execution plan as an index seek is actually an index scan.

Let me show an example


CREATE TABLE TestingSeeks (
id int identity (1,1) not null,
SomeStr char(6) default '' -- a filler
)
GO

insert into TestingSeeks (SomeStr)
select top (500000) ''
from sys.columns c1 cross join sys.columns c2

We have a table here with an identity column on it, starting at 1 and incrementing by 1 row. Hence, there will be no negative values in the table. I’m going to then put a nonclustered index on that column (the table has no cluster, it’s a heap)


CREATE NONCLUSTERED INDEX idx_Seek ON TestingSeeks (id)

Fair enough. If I query all the rows in the table and retrieve just the ID column, I’ll get a scan on that index, as is pretty much expected and Statistics IO tells me that 935 pages were read

(more…)