Performance

Catch-all queries

13 March 2018: There is an updated post on this subject – https://www.sqlinthewild.co.za/index.php/2018/03/13/revisiting-catch-all-queries/

The query form that I refer to as ‘Catch-all’ typically results from search screens in the application where the user may enter any one (or more) of a number of optional parameters. One of the more common ways for such a query to be written in SQL is with multiple predicates in the where clause of the form (WHERE SomeColumn = @SomeVariable OR @SomeVariable IN NULL)

Now this does work, the problem is that it works fairly inefficiently and, on large tables, can result in really poor query performance. I’m going to take a look at why that is the case and what alternatives there are.

Erland Sommarskog has written on this as well, and in a lot more detail than I’m going to. His article on dynamic search conditions is well worth reading, as are the rest of his articles.

A typical example of a ‘catch-all’ query would be this one, based off a table in the AdventureWorks database.

CREATE PROCEDURE SearchHistory
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity,
TransactionDate, ActualCost from Production.TransactionHistory
WHERE (ProductID = @Product Or @Product IS NULL)
AND (ReferenceOrderID = @OrderID OR @OrderID Is NULL)
AND (TransactionType = @TransactionType OR @TransactionType Is NULL)
AND (Quantity = @Qty Or @Qty is null)
GO

Now, let’s say that I run that query and pass values for the ProductID and the Transaction type. Let’s further say that there’s a nonclustered index (called idx_TranHistory_TranTypeProductID) on those two columns.

EXEC SearchHistory @Product = 978, @TransactionType = 'W'

Now this returns 52 rows out of 980000 that are in the table, so we’d expect that SQL would use an index seek operation on that index, followed by a bookmark lookup.

Nope. It’s using that index all right, but it’s doing a scan, not a seek. Ok, not great, but not bad. Let me try a different set of parameters

EXEC SearchHistory @Qty = 100

The plan’s exactly the same. No surprise, it was cached the first time and then reused. There’s a problem here though, the index that’s used is completely inappropriate and there’s a bookmark lookup that ran almost a million times. No wonder this execution took 3 seconds and 2,949,715 IOs to return 29 rows.

Ok, so let me try a different form of the catch-all query

CREATE PROCEDURE SearchHistory_Improved
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity, TransactionDate, ActualCost from Production.TransactionHistory
WHERE (ProductID = CASE WHEN @Product IS NULL THEN ProductID ELSE @Product END)
AND (ReferenceOrderID = CASE WHEN @OrderID IS NULL THEN ReferenceOrderID ELSE @OrderID END)
AND (TransactionType = CASE WHEN @TransactionType IS NULL THEN TransactionType ELSE @TransactionType END)
AND (Quantity = CASE WHEN @Qty IS NULL THEN Quantity ELSE @Qty END)
GO

Let’s see what that does for the first test:

EXEC SearchHistory_Improved @Product = 978, @TransactionType = 'W'

Well that’s no better. Full blown table scan.

The problem with these types of queries is that there is no stable plan. The optimal plan differs completely depending on what paramters are passed. The optimiser can tell that and it plays safe. It creates plans that will always work. That’s (one of the reasons) why in the first example it was an index scan, not an index seek.

The downside of the safe plan is that it’s highly unlikely to be a good plan and, even if it is, it won’t be good for all possible combinations of plans.

So, how to handle this type of query? Well, there are typically two ways.

Recompile

This is only an option on SQL 2008. On 2008, if the query is specified with the OPTION (RECOMPILE) hint, then the optimiser knows it doesn’t have to worry about safe plans because the plan will never be reused. In fact, if I add that hint to the query in the first example, I get the expected index seek.

Dynamic SQL

The other option is to build up the query string dynamically, based on the parameters passed and then to use sp_executesql to run it. There are the ususal downsides to dynamic SQL but, it may be that the performance improvement is worth it.

CREATE PROCEDURE SearchHistory_Dynamic (@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
DECLARE @sSQL NVARCHAR(2000), @Where NVARCHAR(1000) = ''
SET @sSQL = 'SELECT ProductID, ReferenceOrderID, TransactionType, Quantity, TransactionDate, ActualCost
from Production.TransactionHistory '

IF @Product is not null
SET @Where = @Where + 'AND ProductID = @_Product '
IF @OrderID is not null
SET @Where = @Where + 'AND ReferenceOrderID = @_OrderID '
IF @TransactionType IS NOT NULL
SET @Where = @Where + 'AND TransactionType = @_TransactionType '
IF @Qty IS NOT NULL
SET @Where = @Where + 'AND Quantity = @_Qty '

IF LEN(@Where) > 0
SET @sSQL = @sSQL + 'WHERE ' + RIGHT(@Where, LEN(@Where)-3)

EXEC sp_executesql @sSQL,
N'@_Product int, @_OrderID int, @_TransactionType char(1), @_Qty int',
@_Product = @Product, @_OrderID = @OrderID, @_TransactionType = @TransactionType, @_Qty = @Qty

GO

Note that there’s no SQL injection vulnerability in this. The parameters are never concatenated into the string and the execution is parametrised.

Now each different set of parameters gets a different cached plan, optimal for that particular set of parameters.

EXEC SearchHistory_Dynamic @Product = 978, @TransactionType = 'W'

EXEC SearchHistory_Dynamic @Qty = 100

The first gets an index seek, the second a clustered index scan (because there’s no index on Quantity). Much better than the behaviour with the earlier non-dynamic versions.

When is a seek actually a scan?

Most people who know SQL execution plans will say, without reservation, that an index seek on a particular index is better than an index scan on the same index. In the vast majority of cases, that’s true, but there are times when what appears in the execution plan as an index seek is actually an index scan.

Let me show an example


CREATE TABLE TestingSeeks (
id int identity (1,1) not null,
SomeStr char(6) default '' -- a filler
)
GO

insert into TestingSeeks (SomeStr)
select top (500000) ''
from sys.columns c1 cross join sys.columns c2

We have a table here with an identity column on it, starting at 1 and incrementing by 1 row. Hence, there will be no negative values in the table. I’m going to then put a nonclustered index on that column (the table has no cluster, it’s a heap)


CREATE NONCLUSTERED INDEX idx_Seek ON TestingSeeks (id)

Fair enough. If I query all the rows in the table and retrieve just the ID column, I’ll get a scan on that index, as is pretty much expected and Statistics IO tells me that 935 pages were read

(more…)

Identifying inaccurate statistics

I wrote previously about statistics, what they’re needed for and I briefly mentioned what can happen when they’re inaccurate.

One question I’ve seen asked a few times is on how to identify stats that are inaccurate, what DMV to use. The bad new is that there is no DMV that identifies inaccurate statistics. The Stats_Date function can be used to see when the stats were last updated, but that doesn’t say if that was too long ago ort not. If a table is readonly, or is changed twice a year, statistics that are four months old are perfectly valid.

The rowmodcnt column in sysindexes can give an approximate count of the number of rows that have changed since the stats were last updated, although sysindexes is deprecated and will disappear in the next version of SQL and the rowmodcnt column is no longer completely accurate. The rowmodcnt however is just a count of changes. It gives no indication of whether that number is too high and causing a problem. In some cases 50% of the table may have changed before there’s a problem with queries. In other cases 5% is enough to cause problems.

(more…)

What are statistics and why do we need them?

Statistics are another area of SQL that seem to be fairly poorly understood, on what they are, how they are used and how they are maintained..

Statistics store information about the distribution of data in the columns. The information includes the density of the key columns which gives a measure of the uniqueness of the index, and a histogram that stores information on how the distribution of values within the column.

Creating statistics

When an index is created, statistics will automatically be created on the index columns. The stats on an index cannot be dropped while the index exists. If the index is a multi-column one, the density is stored for all columns, but the histogram is only stored for the leading column.

It’s also possible to have statistics that aren’t associated with indexes. These are created automatically by the query optimiser (providing auto_create_statistics is enabled) or they can be created manually with the CREATE STATISTICS statement.

It’s not usually necessary to create statistics manually as the optimiser usually knows best what stats are required.

Updating Statistics

SQL keeps a count of the number of changes to a column and will update the statistics when it deems sufficient changes to have occurred (providing auto_update_stats is enabled).The stats are not updates as part of a data modification. Rather the next select that requires those statistics will trigger the update. In most cases, the auto-update doesn’t cause problems. It runs in the read-uncommitted isolation level and hence won’t cause blocking and SQL will, by default, only sample a portion of the table to calculate the stats.

(more…)

Deadlock Graphs

A deadlock can appear to be a very obtuse error, as the only advice offered with the error message is an instruction to ‘rerun the transaction’, and there’s no apparent way to find the cause of the deadlock.

There are in fact several ways to get information on what caused the deadlock and on what resources the deadlock occurred. This information is generally referred to as the ‘deadlock graph’

Traceflag 1204

This is the oldest of the ways to get at the deadlock graph and it’s the main way used on SQL 2000. If traceflag 1204 is enabled server-wide, either by adding it as a startup parameter or by using the TraceOn command (DBCC TRACEON(1204,-1)), then anytime the SQL deadlock detector detects and resolves a deadlock, the information on that is written into the SQL Server error log. It looks something like this:

[source:XML] Deadlock encountered …. Printing deadlock information
Wait-for graph

Node:1

OBJECT: 12:629577281:0         CleanCnt:3 Mode:S Flags: 0x1
Grant List 2:
Owner:0x000000008673B900 Mode: S        Flg:0x40 Ref:2 Life:02000000 SPID:52 ECID:0 XactLockInfo: 0x0000000085F64ED0
SPID: 52 ECID: 0 Statement Type: UPDATE Line #: 11
Input Buf: Language Event: exec CauseDeadlock
Requested by:
ResType:LockOwner Stype:’OR’Xdes:0x0000000085F65740 Mode: IX SPID:53 BatchID:0 ECID:0 TaskProxy:(0x0000000085F5C538) Value:0x85b66a00 Cost:(0/0)

Node:2

OBJECT: 12:629577281:0         CleanCnt:3 Mode:S Flags: 0x1
Grant List 2:
Owner:0x00000000867FA180 Mode: S        Flg:0x40 Ref:2 Life:02000000 SPID:53 ECID:0 XactLockInfo: 0x0000000085F65780
SPID: 53 ECID: 0 Statement Type: UPDATE Line #: 11
Input Buf: Language Event: exec CauseDeadlock
Requested by:
ResType:LockOwner Stype:’OR’Xdes:0x0000000085F64E90 Mode: IX SPID:52 BatchID:0 ECID:0 TaskProxy:(0x00000000810B8538) Value:0x85b66a40 Cost:(0/0)

Victim Resource Owner:
ResType:LockOwner Stype:’OR’Xdes:0x0000000085F65740 Mode: IX SPID:53 BatchID:0 ECID:0 TaskProxy:(0x0000000085F5C538) Value:0x85b66a00 Cost:(0/0) [/source]

(more…)

Comparing date truncations

Removing the time from a datetime is a task that comes up again and again and it sometimes seems like everyone’s got their own favorite way of doing it. I though I’d take a run through some of the ways I’ve seen and see how the compare in terms of performance.

The methods that I’ll be comparing are:

  1. Cast to varchar and back to datetime
  2. Cast to float, floor and cast back to datetime
  3. Use dateadd and datediff to get rid of the time
  4. Cast to date and back to datetime (SQL 2008 specific)

I ran each one 4 times and averaged the results

CREATE TABLE #DateTest (
ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
DateValue  DATETIME
)

INSERT INTO #DateTest (DateValue)
SELECT TOP 1000000
DATEADD(mi,RAND(s1.number*5000 + s2.number)*5000,'2000/01/01' )
FROM master..spt_values s1 CROSS JOIN master..spt_values s2
WHERE s1.[name] IS NULL AND s2.NAME IS null
GO

SET NOCOUNT ON
GO
SET STATISTICS TIME ON
GO

SELECT dateadd(dd, datediff(dd,0, DateValue),0)
FROM #DateTest
GO 4

SELECT convert(datetime,convert(varchar(10),DateValue,101))
FROM #DateTest
GO 4

SELECT CAST(FLOOR(CAST(DateValue AS FLOAT)) AS DATETIME)
FROM #DateTest
GO 4

SELECT CAST(CAST(DateValue AS DATE) AS DATETIME)
FROM #DateTest
GO 4

DROP TABLE #DateTest

And so, the winner of the date challenge:

  1. Dateadd and datediff – CPU time 348ms, execution time 8092 ms
  2. Cast to float and floor – CPU time 395ms, execution time 8270ms
  3. Cast to varchar – CPU time 1450ms, execution time 9280ms
  4. Cast to Date – CPU time 3063ms, execution time 7905ms

I was not expecing the cast to Date to take so much processing time. I think I’m going to take a look at the new date types in more detail, see what they perform like.

Views or Functions?

Someone asked this question on SQLServerCentral, I thought I’d post the test here for interest.

The question was on which would execute faster, creating a view, then creating a procedure that accepted a parameter and filtered the view, or creating a table valued function that took a parameter.

From testing, the answer is the procedure, but not by much. A inline table-valued-function comes a very close second and the multi-statement table-valued-function waddles in last. Not really surprising. (more…)

Parameter sniffing, pt 3

Last, but not least, here’s one final look at parameter sniffing.

In part 1 of this mini-series I wrote about data skew and the problems it can cause with parameters. In part 2, I looked at what happens when the optimiser can’t sniff values. In this post, I’m going to look at a third thing that can cause parameter sniffing problems. It’s fairly obvious once you know about it, but the first time it was pointed out to me I was a little surprised.

So, to recap. When the query optimiser gets a stored procedure to compile, it knows the values of any parameter passed to that procedure. It will then compile the procedure and optimise the queries based upon the value of those parameters. The optimiser cannot sniff the values of variables, because the values of the variables have not been set at the time that the procedure is optimised.

I’m going to use the same sample code and data as in the first article, to keep things consistent.

From the tests that were done before, I know that the query

select * from largetable where somestring = 'zzz'

executes optimally with an index seek and returns 9 rows. Likewise, I know that the query

select * from largetable where somestring = 'abc'

executes optimally with a clustered index scan and returns 1667 rows.

Now, let’s see if I can get the optimiser to make the wrong choice.

(more…)

Parameter sniffing, pt 2

A while back I wrote about parameter sniffing, the situation where SQL compiles and caches an execution plan that is appropriate for a certain value of a parameter, but is non optimal for other values. There’s another side to parameter sniffing though – when the optimiser can’t sniff at all.

When a batch is submitted to the optimiser, the value of any parameters (from stored procedure, auto parametrisation or sp_executesql) are known to the optimiser. So are the value of any constants used in the SQL statement. The value of local variables, however, are not.

When the value of a variable used in the where clause is not known, the optimiser has know idea what value to use to estimate the number of affected rows. Hence, it has no idea how many rows from the table will satisfy the condition.

(more…)

Parameter sniffing

This seems to come up again and again on the forums.

At its heart, parameter sniffing is the ability of the SQL Server optimiser to know the values of parameters passed to a stored proc at the point that it compiles the procedure. The idea is that if the parameter values are known, then the appropriate column statistics can be used and the optimiser can estimate the number of rows that the various query operators will have to process for various different possible execution plans.

Since the approximate number of rows is known, the cost of each possible plan can be more accurately calculated and a more accurate execution plan can be selected.

So, why is parameter sniffing so often a problem? Well, mainly, because parameter values do change.

(more…)