T-SQL

IN vs INNER JOIN

Often in forum threads discussing query performance I’ll see people recommending replacing an INNER JOIN with an IN or EXISTS (or recommending replacing an IN or EXISTS with an INNER JOIN) for performance reasons. I’ve previously looked at how the IN and EXISTS compared, now I’m going to investigate and see how IN compares with the join.

One very important thing to note right off is that they are not equivalent in all cases.

An inner join between two tables does a complete join, it checks for matches and returns rows. This means, if there are multiple matching rows in the second table, multiple rows will be returned. Also, when two tables are joined, columns can be returned from either.  As a quick example:

DECLARE @BigTable TABLE (
 id INT IDENTITY PRIMARY KEY,
 SomeColumn CHAR(4),
 Filler CHAR(100)
)

Insert into @BigTable(SomeColumn) Values (1)
Insert into @BigTable(SomeColumn) Values (2)
Insert into @BigTable(SomeColumn) Values (3)
Insert into @BigTable(SomeColumn) Values (4)
Insert into @BigTable(SomeColumn) Values (5)


DECLARE @SomeTable TABLE (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM @BigTable b INNER JOIN @SomeTable  s ON b.SomeColumn = s.IntCol

This returns 7 rows and returns columns from both tables. Because the values in @SomeTable are duplicated, the matching rows from BigTable are returned twice.

With an IN, what is done is a semi-join, a join that checks for matches but does not return rows. This means if there are multiple matching tables in the resultset used for the IN, it doesn’t matter. Only one row from the first table will be returned. Also, because the rows are not returned, columns from the table referenced in the IN cannot be returned. As a quick example

DECLARE @BigTable TABLE (
 id INT IDENTITY PRIMARY KEY,
 SomeColumn CHAR(4),
 Filler CHAR(100)
)

Insert into @BigTable(SomeColumn) Values (1)
Insert into @BigTable(SomeColumn) Values (2)
Insert into @BigTable(SomeColumn) Values (3)
Insert into @BigTable(SomeColumn) Values (4)
Insert into @BigTable(SomeColumn) Values (5)

DECLARE @SomeTable TABLE (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM @BigTable
 WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

This returns 5 rows and only columns from BigTable.

So, that said, how does the performance of the two differ for the cases where the results are identical (no duplicates in the second table, no columns needed from the second table)? For that, I’m going to need larger tables to play with. (more…)

EXISTS vs IN

This one’s come up a few times recently, so I’ll take a look at it.

The EXISTS and IN clauses at first glance look fairly similar. They both use a subquery to evaluate rows, but they do it in a slightly different way

IN does a direct match between the column specified before the IN keyword and the values returned by the subquery. When using IN there can only be a single column specified in the select clause of the subquery

Let’s have a look at a quick example

DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable
WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

So this query returns all the rows from BigTable where SomeColumn has any of the values returned by the subquery, that is 1,2,3,4 or 5

But what if there were duplicate values returned by the subquery?

(more…)

Functions, IO statistics and the Execution plan

It’s no secret that I’m not overly fond of most user-defined functions. This isn’t just a pet hate, I have some good reasons for disliking them. All too often they’re performance bottlenecks, but that can be said about many things in SQL. The bigger problem is that they’re hidden performance bottlenecks that often go overlooked and ignored for too long.

I’m going to start with this fairly simple scalar function, created in the AdventureWorks database

Create function LineItemTotal(@ProductID int)
returns money
as
begin
declare @Total money

select @Total = sum(LineTotal) from sales.SalesOrderDetail where productid = @ProductID

return @Total
end

So, given that function, the following two queries should be equivalent.

SELECT productid, productnumber, dbo.LineItemTotal(productid) as SumTotal
FROM Production.Product p

SELECT productid, productnumber,
(select sum(LineTotal) from sales.SalesOrderDetail where productid = p.productid) AS SumTotal
FROM Production.Product p

No problems so far. They both return 504 rows (in my copy of AW, which has been slightly padded out with more data). Now, let’s look at the execution characteristics by running them again with Statistics IO and Statistics Time on.

Query 1, the one with the scalar function:

Table ‘Product’. Scan count 1, logical reads 4, physical reads 0.

SQL Server Execution Times:
CPU time = 47297 ms,  elapsed time = 47541 ms.

Query 2, the one with the correlated subquery:

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘SalesOrderDetail’. Scan count 3, logical reads 22536, physical reads 0.
Table ‘Product’. Scan count 3, logical reads 40, physical reads 0.

SQL Server Execution Times:
CPU time = 1047 ms, elapsed time = 1249 ms.

(more…)

On Counts

Or “What’s the fastest way to count the rows?”

It’s a fairly common to need to know the number of rows in a table, the number of rows that match a certain condition or whether or not there are any rows that match a condition. There’s also a number of ways of doing so, some better than others. The problem being that counting is not a cheap operation, especially on big tables. It’s not as bad as a sort, but it still can be expensive.

So, given that, let’s take a look at some of the ways.

Querying the metadata

If all that’s needed is the number of rows in the table, and it’s not 100% important that the value be completely accurate all the time, the system metadata can be queried. In SQL 2000 and below, that info was in sysindexes. In 2005 and higher it’s been moved into sys.partitions.

SELECT OBJECT_NAME(object_id) AS TableName, SUM(rows) AS TotalRows
FROM sys.partitions
WHERE index_id in (0,1)
AND object_id = OBJECT_ID('TableName')
GROUP BY object_id

The advantage of this approach is that it is fast. Since it’s not actually counting anything and, in fact, isn’t even accessing the table that’s being counted, it’s the fastest way to get the count of rows in the table.

The disadvantage is it can only get the number of rows in the table and cannot consider any criteria at all. It also may not be 100% accurate, depending how and when the table’s rowcount metadata is updated by the SQL engine. (more…)

Dynamic SQL and SQL injection

When I wrote about the catch-all queries, someone asked why the dynamic SQL that I offered wasn’t vulnerable to SQL injection. I thought I’d go into the whys and the wherefores of that in a little bit more detail.

I’m just going to look at SQL injection from the aspect of dynamic SQL. The front-end code aspect has been dealt with hundreds of times, most recently here – http://www.simple-talk.com/community/blogs/philfactor/archive/2009/03/30/72651.aspx

The most important thing to realise with SQL Injection (and with all other forms of command injection) is that it requires that a user-inputted string be incorperated as part of a command that’s going to be executed. Not as part of a paramter value, but as part of the command itself.

Let me show you want I mean.

DECLARE @sSQL varchar(500)
SET @sSQL = 'SELECT * FROM sys.objects'

EXECUTE (@sSQL)

In this exeedingly simple example, there’s no possibility for SQL injection. There’s no user-inputted string that can become part of the command. Let’s look at two slightly more complex examples

Example 1:


DECLARE @inputParam VARCHAR(100) -- Assume this comes from user input
DECLARE @sSQL varchar(500)

SET @sSQL = 'SELECT * FROM '

IF @inputParam = 'Table1'
SET @sSQL = @sSQL + 'Table1'
IF @inputParam = 'Table2'
SET @sSQL = @sSQL + 'Table2'
IF @inputParam = 'Table3'
SET @sSQL = @sSQL + 'Table3'
IF @inputParam = 'Table4'
SET @sSQL = @sSQL + 'Table4'

EXECUTE (@sSQL)

Example 2:


DECLARE @inputParam VARCHAR(100) -- Assume this comes from user input
DECLARE @sSQL varchar(500)

SET @sSQL = 'SELECT * FROM ' + @inputParam

EXECUTE (@sSQL)

(more…)

Catch-all queries

13 March 2018: There is an updated post on this subject – https://www.sqlinthewild.co.za/index.php/2018/03/13/revisiting-catch-all-queries/

The query form that I refer to as ‘Catch-all’ typically results from search screens in the application where the user may enter any one (or more) of a number of optional parameters. One of the more common ways for such a query to be written in SQL is with multiple predicates in the where clause of the form (WHERE SomeColumn = @SomeVariable OR @SomeVariable IN NULL)

Now this does work, the problem is that it works fairly inefficiently and, on large tables, can result in really poor query performance. I’m going to take a look at why that is the case and what alternatives there are.

Erland Sommarskog has written on this as well, and in a lot more detail than I’m going to. His article on dynamic search conditions is well worth reading, as are the rest of his articles.

A typical example of a ‘catch-all’ query would be this one, based off a table in the AdventureWorks database.

CREATE PROCEDURE SearchHistory
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity,
TransactionDate, ActualCost from Production.TransactionHistory
WHERE (ProductID = @Product Or @Product IS NULL)
AND (ReferenceOrderID = @OrderID OR @OrderID Is NULL)
AND (TransactionType = @TransactionType OR @TransactionType Is NULL)
AND (Quantity = @Qty Or @Qty is null)
GO

Now, let’s say that I run that query and pass values for the ProductID and the Transaction type. Let’s further say that there’s a nonclustered index (called idx_TranHistory_TranTypeProductID) on those two columns.

EXEC SearchHistory @Product = 978, @TransactionType = 'W'

Now this returns 52 rows out of 980000 that are in the table, so we’d expect that SQL would use an index seek operation on that index, followed by a bookmark lookup.

Nope. It’s using that index all right, but it’s doing a scan, not a seek. Ok, not great, but not bad. Let me try a different set of parameters

EXEC SearchHistory @Qty = 100

The plan’s exactly the same. No surprise, it was cached the first time and then reused. There’s a problem here though, the index that’s used is completely inappropriate and there’s a bookmark lookup that ran almost a million times. No wonder this execution took 3 seconds and 2,949,715 IOs to return 29 rows.

Ok, so let me try a different form of the catch-all query

CREATE PROCEDURE SearchHistory_Improved
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity, TransactionDate, ActualCost from Production.TransactionHistory
WHERE (ProductID = CASE WHEN @Product IS NULL THEN ProductID ELSE @Product END)
AND (ReferenceOrderID = CASE WHEN @OrderID IS NULL THEN ReferenceOrderID ELSE @OrderID END)
AND (TransactionType = CASE WHEN @TransactionType IS NULL THEN TransactionType ELSE @TransactionType END)
AND (Quantity = CASE WHEN @Qty IS NULL THEN Quantity ELSE @Qty END)
GO

Let’s see what that does for the first test:

EXEC SearchHistory_Improved @Product = 978, @TransactionType = 'W'

Well that’s no better. Full blown table scan.

The problem with these types of queries is that there is no stable plan. The optimal plan differs completely depending on what paramters are passed. The optimiser can tell that and it plays safe. It creates plans that will always work. That’s (one of the reasons) why in the first example it was an index scan, not an index seek.

The downside of the safe plan is that it’s highly unlikely to be a good plan and, even if it is, it won’t be good for all possible combinations of plans.

So, how to handle this type of query? Well, there are typically two ways.

Recompile

This is only an option on SQL 2008. On 2008, if the query is specified with the OPTION (RECOMPILE) hint, then the optimiser knows it doesn’t have to worry about safe plans because the plan will never be reused. In fact, if I add that hint to the query in the first example, I get the expected index seek.

Dynamic SQL

The other option is to build up the query string dynamically, based on the parameters passed and then to use sp_executesql to run it. There are the ususal downsides to dynamic SQL but, it may be that the performance improvement is worth it.

CREATE PROCEDURE SearchHistory_Dynamic (@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
DECLARE @sSQL NVARCHAR(2000), @Where NVARCHAR(1000) = ''
SET @sSQL = 'SELECT ProductID, ReferenceOrderID, TransactionType, Quantity, TransactionDate, ActualCost
from Production.TransactionHistory '

IF @Product is not null
SET @Where = @Where + 'AND ProductID = @_Product '
IF @OrderID is not null
SET @Where = @Where + 'AND ReferenceOrderID = @_OrderID '
IF @TransactionType IS NOT NULL
SET @Where = @Where + 'AND TransactionType = @_TransactionType '
IF @Qty IS NOT NULL
SET @Where = @Where + 'AND Quantity = @_Qty '

IF LEN(@Where) > 0
SET @sSQL = @sSQL + 'WHERE ' + RIGHT(@Where, LEN(@Where)-3)

EXEC sp_executesql @sSQL,
N'@_Product int, @_OrderID int, @_TransactionType char(1), @_Qty int',
@_Product = @Product, @_OrderID = @OrderID, @_TransactionType = @TransactionType, @_Qty = @Qty

GO

Note that there’s no SQL injection vulnerability in this. The parameters are never concatenated into the string and the execution is parametrised.

Now each different set of parameters gets a different cached plan, optimal for that particular set of parameters.

EXEC SearchHistory_Dynamic @Product = 978, @TransactionType = 'W'

EXEC SearchHistory_Dynamic @Qty = 100

The first gets an index seek, the second a clustered index scan (because there’s no index on Quantity). Much better than the behaviour with the earlier non-dynamic versions.

On the OUTPUT of a data modification

or “Who needs a trigger anyway?

The output clause was, I think, one of those wonderful features of SQL 2005 that very few people used, myself included. Now in 2008, it’s even better, but still doesn’t appear to be widely used.

The output clause can be used to get, as a resultset, data from the inserted and deleted tables that are usually only visible in a trigger. As a very simple example:

Create Table #Testing (
id int identity,
somedate datetime default getdate()
)

insert into #Testing
output inserted.*
default values 

Neat. We can get back the inserted values as a result set. We can also insert them into a table variable for later processing. Using the same temp table

declare @OutputTable TABLE (id int, somedate datetime)

insert into #Testing
output inserted.* into @OutputTable
default values

select * from @OutputTable 

Very neat. Now how about a practical example? Say we have the following three tables in a database.

Create Table ParentTable (
ID int identity primary key,
ParentDescription varchar(50),
CreationDate DATETIME DEFAULT GETDATE()
)
GO
CREATE TABLE ChildTable (
ID Int identity Primary Key,
ParentID int not null constraint fk_parent foreign key references ParentTable(ID),
Somedescription varchar(20),
SomeValue Money
)
GO

Create Table AuditTable (
AuditID int identity primary key,
ChildID int,
SomeValue Money,
InsertDate DATETIME DEFAULT GETDATE(),
OriginatingLogin VARCHAR(50) DEFAULT ORIGINAL_LOGIN()
)

We get a set of data (perhaps in a temp table, perhaps in an xml document) that needs to be inserted into those tables. The source data will have multiple parent rows, each with multiple child rows. Those need to be inserted into the appropriate tables and the foreign keys have to be assigned correctly. In addition, the ID of the child rows, along with the value and the current date must be written into an audit table, along with the login name of the current user.

(more…)

Common T-SQL mistakes

I have the pleasure at the moment of doing a code review on some vendor code. No names will be mentioned. I’ve seen better. I’ve seen a lot better. I’m seeing very common mistakes in the code, so, in the interests of my sanity, I’m going to go over a couple of common T-SQL mistakes in the hopes that the next batch of code I get to review doesn’t have these mistakes in…

1. Error Handling

Proper error handling is hard. SQL 2005 has made it a lot easier with the TRY…CATCH blocks, but it still means that everything that can throw an error be wrapped inside a TRY block, with an appropriate CATCH block to handle any errors.

It was a lot harder on SQL 2000 when all we had to work with was @@Error. What I think was not well understood was what statements set and reset @@Error, and how long a non-zero value persists, leading to code constructs like this

Insert into SomeTable ...
Update SomeTable SET ...
Delete From SomeOtherTable ...

IF @@Error !=0
Print 'An error occured'

(more…)

Comparisons with NULL

Or ‘True, False or FileNotFound’ (1)

Null seems to confuse a lot of people. The main issue with null is that it is not a value. It’s the absence of a value, a place holder for unknown data. Hence, when it is used in a comparison, it returns results that look surprising. Take this as a quick example

DECLARE @Var1 INT
SET @Var1 = NULL

SELECT
CASE WHEN @Var1 = NULL THEN ‘Var1 = Null’ ELSE ‘Var1 <> NULL’ END AS EqualityTest,
CASE WHEN @Var1 <> NULL THEN ‘Var1 <> Null’ ELSE ‘Var1 = NULL’ END AS InequalityTest

The results look a little strange. They’re supposed to. (No, I didn’t mean IS NULL/IS NOT NULL)

(more…)

Temp tables and table variables

I’ve encountered a fair bit of confusion on various forums as to the differences between temporary tables and table variables. As a quick article (I’m knee-deep in some AI stuff at the moment) I thought I’d quickly go over some points on temp tables and table variables.

Temporary Tables

  • Created using the Create table syntax, preceding the table name with a’#’ for a local temp table and ‘##’ for a global temp table
  • Allocated storage space within the TempDB database and entered into the TempDB system tables 1
  • The table’s actual name is the name is was created with, a large number of underscores and a hash value, to prevent object name collisions if two connections create a temp table with the same name
  • Can have a primary key, defaults, constraints and indexes (however the names of these are not hashed, possibly leading to duplicate object errors for constraints and defaults)
  • May not have triggers.
  • Foreign keys are permitted, but are not enforced
  • Have column statistics kept on them. The algorithm for determining when to update is different to permanent tables
  • Exist until they are dropped, or the connection closes.
  • Are visible in any child procedures called from the one where the table was created. Are not visible to parent procedures
  • Are not persisted to disk unless there is memory pressure, or the table is too large to fit in the data cache

(more…)