EXISTS vs IN

This one’s come up a few times recently, so I’ll take a look at it.

The EXISTS and IN clauses at first glance look fairly similar. They both use a subquery to evaluate rows, but they do it in a slightly different way

IN does a direct match between the column specified before the IN keyword and the values returned by the subquery. When using IN there can only be a single column specified in the select clause of the subquery

Let’s have a look at a quick example

DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable
WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

So this query returns all the rows from BigTable where SomeColumn has any of the values returned by the subquery, that is 1,2,3,4 or 5

But what if there were duplicate values returned by the subquery?

Well, it actually doesn’t matter. All that SQL is looking for is what values the subquery returns to process the filter. It’s not joining the two resultsets together so it makes no difference to the results if there are duplicate values returned by the subquery.

To put it more technically, SQL’s doing a semi-join, a join that can only eliminate or qualify rows from the first table, but cannot duplicate them.

So that’s IN. What about EXISTS

Exists doesn’t check for a match, it doesn’t care in the slightest what values are been returned from the expression, it just checks for whether a row exists or not. Because of that, if there’s no predicate in the WHERE clause of the subquery that compares rows in the subquery with rows in the outer query, EXISTS will either return true for all the rows in the outer query or it will return false for all the rows in the outer query

DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable
WHERE EXISTS (Select IntCol FROM @SomeTable)

This will also return every single row in BigTable, because Select IntCol FROM @SomeTable returns 5 rows and hence the EXISTS predicate is always true.

Hence, to use EXISTS to do the same kind of thing as IN, there must be a correlation predicate within the subquery

DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable bt
WHERE EXISTS (Select IntCol FROM @SomeTable st WHERE bt.SomeColumn = st.IntCol)

Now this will behave like the IN because it’s checking for matching rows and only returning true when there is a match. This will return all the rows from BigTable where SomeColumn has values 1,2,3,4 or 5 because those are the

Exists is better for when comparisons are needed on two or more columns. For eg, this cannot be done easily with an IN

DECLARE @SomeTable (IntCol int, charCol char(1))
Insert into @SomeTable (IntCol, charCol) Values (1, 'a')
Insert into @SomeTable (IntCol, charCol) Values (2, 'a')
Insert into @SomeTable (IntCol, charCol) Values (3, 'a')
Insert into @SomeTable (IntCol, charCol) Values (4, 'b')
Insert into @SomeTable (IntCol, charCol) Values (5, 'b')

SELECT *
FROM BigTable bt
WHERE EXISTS (Select IntCol FROM @SomeTable st
WHERE bt.SomeColumn = st.IntCol AND bt.SomeOtherColumn = st.charCol)

So that covers how they work, but how do they perform in comparison with each other? To answer that question, first I need some fairly large tables.

Create Table BigTable (
id int identity primary key,
SomeColumn char(4),
Filler char(100)
)

Create Table SmallerTable (
id int identity primary key,
LookupColumn char(4),
SomeArbDate Datetime default getdate()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3955 row(s) affected)

Let’s first try without indexes and see how EXISTS and IN compare.

-- Query 1
SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn IN (SELECT LookupColumn FROM SmallerTable)

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

The first thing to note is that the execution plans are identical. Two clustered index scans and a hash join (right semi-join).

The IOs are also identical.

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 15, physical reads 0.

So these two queries are executed by SQL in exactly the same way. No performance differences here.

Now, let me add indexes to both tables, on that join column and see what changes.

CREATE INDEX idx_BigTable_SomeColumn
ON BigTable (SomeColumn)
CREATE INDEX idx_SmallerTable_LookupColumn
ON SmallerTable (LookupColumn)

With those created, I’m going to run the above two queries again. Again the execution plans of the two are identical, though the hash join and clustered index scans are gone, replaced by index scans, a stream aggregate and a merge join (inner join)

The IOs are again identical and the execution times very close.

Table ‘BigTable’. Scan count 1, logical reads 343, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.

So IN and EXISTS appear to perform identically both when there are no indexes on the matching columns and when there are, and this is true regardless of whether of not there are nulls in either the subquery or in the outer table.

Next up, a look at how IN compares to Inner Join for the purposes of finding matching rows

11 Comments

  1. Brian Tkatch

    Always to review the basics. Thanx!

    Reply
  2. Saulius

    Nice. Thanks.

    Reply
  3. Ziangi

    great mail… thanks

    Reply
  4. Jonathan

    Thanks, When the value is IN another table I always use EXISTS.

    Reply
  5. Pingback: SQL Server Central

  6. Beulah Nelson

    Great series very helpful

    Reply
  7. Pingback: WHERE NOT EXISTS in the Twelve Days of SQL | Noel NOT NULL;

  8. Sarah Kholusi

    Thank you so much, you saved me before entering tomorrow’s exam & I had a great confusion in this part. Thanks again.

    Reply
  9. Pingback: Exists versus In – SQL | michaelmholt.com

  10. Richard Armstrong-Finnerty
    1. Gail (Post author)

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.