EXISTS vs IN

This one’s come up a few times recently, so I’ll take a look at it.

The EXISTS and IN clauses at first glance look fairly similar. They both use a subquery to evaluate rows, but they do it in a slightly different way

IN does a direct match between the column specified before the IN keyword and the values returned by the subquery. When using IN there can only be a single column specified in the select clause of the subquery

Let’s have a look at a quick example

DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable
WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

So this query returns all the rows from BigTable where SomeColumn has any of the values returned by the subquery, that is 1,2,3,4 or 5

But what if there were duplicate values returned by the subquery?

Well, it actually doesn’t matter. All that SQL is looking for is what values the subquery returns to process the filter. It’s not joining the two resultsets together so it makes no difference to the results if there are duplicate values returned by the subquery.

To put it more technically, SQL’s doing a semi-join, a join that can only eliminate or qualify rows from the first table, but cannot duplicate them.

So that’s IN. What about EXISTS

Exists doesn’t check for a match, it doesn’t care in the slightest what values are been returned from the expression, it just checks for whether a row exists or not. Because of that, if there’s no predicate in the WHERE clause of the subquery that compares rows in the subquery with rows in the outer query, EXISTS will either return true for all the rows in the outer query or it will return false for all the rows in the outer query

DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable
WHERE EXISTS (Select IntCol FROM @SomeTable)

This will also return every single row in BigTable, because Select IntCol FROM @SomeTable returns 5 rows and hence the EXISTS predicate is always true.

Hence, to use EXISTS to do the same kind of thing as IN, there must be a correlation predicate within the subquery

DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable bt
WHERE EXISTS (Select IntCol FROM @SomeTable st WHERE bt.SomeColumn = st.IntCol)

Now this will behave like the IN because it’s checking for matching rows and only returning true when there is a match. This will return all the rows from BigTable where SomeColumn has values 1,2,3,4 or 5 because those are the

Exists is better for when comparisons are needed on two or more columns. For eg, this cannot be done easily with an IN

DECLARE @SomeTable (IntCol int, charCol char(1))
Insert into @SomeTable (IntCol, charCol) Values (1, 'a')
Insert into @SomeTable (IntCol, charCol) Values (2, 'a')
Insert into @SomeTable (IntCol, charCol) Values (3, 'a')
Insert into @SomeTable (IntCol, charCol) Values (4, 'b')
Insert into @SomeTable (IntCol, charCol) Values (5, 'b')

SELECT *
FROM BigTable bt
WHERE EXISTS (Select IntCol FROM @SomeTable st
WHERE bt.SomeColumn = st.IntCol AND bt.SomeOtherColumn = st.charCol)

So that covers how they work, but how do they perform in comparison with each other? To answer that question, first I need some fairly large tables.

Create Table BigTable (
id int identity primary key,
SomeColumn char(4),
Filler char(100)
)

Create Table SmallerTable (
id int identity primary key,
LookupColumn char(4),
SomeArbDate Datetime default getdate()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3955 row(s) affected)

Let’s first try without indexes and see how EXISTS and IN compare.

-- Query 1
SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn IN (SELECT LookupColumn FROM SmallerTable)

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

The first thing to note is that the execution plans are identical. Two clustered index scans and a hash join (right semi-join).

The IOs are also identical.

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 15, physical reads 0.

So these two queries are executed by SQL in exactly the same way. No performance differences here.

Now, let me add indexes to both tables, on that join column and see what changes.

CREATE INDEX idx_BigTable_SomeColumn
ON BigTable (SomeColumn)
CREATE INDEX idx_SmallerTable_LookupColumn
ON SmallerTable (LookupColumn)

With those created, I’m going to run the above two queries again. Again the execution plans of the two are identical, though the hash join and clustered index scans are gone, replaced by index scans, a stream aggregate and a merge join (inner join)

The IOs are again identical and the execution times very close.

Table ‘BigTable’. Scan count 1, logical reads 343, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.

So IN and EXISTS appear to perform identically both when there are no indexes on the matching columns and when there are, and this is true regardless of whether of not there are nulls in either the subquery or in the outer table.

Next up, a look at how IN compares to Inner Join for the purposes of finding matching rows

11 Comments

Brian Tkatch 22 September 2009 at 16:05

Always to review the basics. Thanx!

Reply ↓
Saulius 30 October 2009 at 16:14

Nice. Thanks.

Reply ↓
Ziangi 29 January 2010 at 09:33

great mail… thanks

Reply ↓
Jonathan 25 March 2010 at 08:22

Thanks, When the value is IN another table I always use EXISTS.

Reply ↓
Pingback: SQL Server Central
Beulah Nelson 29 September 2010 at 02:06

Great series very helpful

Reply ↓
Pingback: WHERE NOT EXISTS in the Twelve Days of SQL | Noel NOT NULL;
Sarah Kholusi 1 May 2012 at 16:36

Thank you so much, you saved me before entering tomorrow’s exam & I had a great confusion in this part. Thanks again.

Reply ↓
Pingback: Exists versus In – SQL | michaelmholt.com
Richard Armstrong-Finnerty 30 December 2016 at 15:48

Worth remembering the potential issue with NULLs:
http://www.tech-recipes.com/rx/57894/sql-server-three-valued-logic-not-clause-null-values/

Reply ↓
1. Gail (Post author)31 December 2016 at 12:32
  
  Which I discussed in the post on NOT IN: https://www.sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/
  
  Reply ↓

EXISTS vs IN

Share this:

11 Comments

Leave a Comment Cancel reply