{"id":812,"date":"2011-01-18T16:00:18","date_gmt":"2011-01-18T14:00:18","guid":{"rendered":"https:\/\/www.sqlinthewild.co.za\/?p=812"},"modified":"2011-01-19T21:02:38","modified_gmt":"2011-01-19T19:02:38","slug":"distincting-an-in-subquery","status":"publish","type":"post","link":"https:\/\/www.sqlinthewild.co.za\/index.php\/2011\/01\/18\/distincting-an-in-subquery\/","title":{"rendered":"Distincting an IN subquery"},"content":{"rendered":"<p>This is going to be a quick one\u2026<\/p>\n<p>I keep seeing forum code (and production code) that includes the DISTINCT in IN or EXISTS subqueries. The rationale is given either as a performance enhancement or as necessary for correct results.<\/p>\n<p>Is it necessary or useful? Only one way to find out.<\/p>\n<p>Let&#8217;s check for correct results first, because that can be done with nice small tables.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">CREATE TABLE DistinctOuter (\r\nID INT\r\n);\r\n\r\nCREATE TABLE DistinctInner (\r\nID INT\r\n);\r\n\r\nINSERT INTO DistinctOuter\r\nVALUES (1), (2), (3), (4), (5), (6), (7), (8)\r\n\r\nINSERT INTO DistinctInner\r\nVALUES (1), (2), (2), (2), (2), (4), (6), (7)<\/pre>\n<p><a href=\"https:\/\/www.sqlinthewild.co.za\/wp-content\/uploads\/2010\/12\/DistinctIN.png\"><img loading=\"lazy\" decoding=\"async\" style=\"display: inline; border-width: 0px;\" title=\"DistinctIN\" src=\"https:\/\/www.sqlinthewild.co.za\/wp-content\/uploads\/2010\/12\/DistinctIN_thumb.png\" border=\"0\" alt=\"DistinctIN\" width=\"364\" height=\"298\" \/><\/a><\/p>\n<p><!--more--><\/p>\n<p>No difference there, results are the same. I&#8217;m not going to run the test for EXISTS because, if anyone remembers how EXISTS works (or remembers a <a href=\"https:\/\/www.sqlinthewild.co.za\/index.php\/2009\/08\/17\/exists-vs-in\/\">blog post<\/a> I wrote a while back), EXISTS doesn&#8217;t depend on what&#8217;s in the SELECT clause at all, it just looks for existence of rows, and DISTINCT cannot remove unique rows, just duplicates.<\/p>\n<p>A look at the execution plan shows why there are no duplicate values returned in the first query (the one without DISTINCT).<\/p>\n<p><a href=\"https:\/\/www.sqlinthewild.co.za\/wp-content\/uploads\/2010\/12\/DistinctIN2.png\"><img loading=\"lazy\" decoding=\"async\" style=\"display: inline; border-width: 0px;\" title=\"DistinctIN2\" src=\"https:\/\/www.sqlinthewild.co.za\/wp-content\/uploads\/2010\/12\/DistinctIN2_thumb.png\" border=\"0\" alt=\"DistinctIN2\" width=\"364\" height=\"164\" \/><\/a><\/p>\n<p>That&#8217;s a semi-join there, not a complete join. A semi-join is a join that just checks for matches but doesn&#8217;t return rows from the second table. Since it&#8217;s just a check for existence, duplicate rows in the inner table are not going to make any difference to the results.<\/p>\n<p>So that answers the correctness aspect, distinct is not necessary to get correct results. But does it improve performance by having it there? Or does it perhaps reduce the performance? Time for larger tables.<\/p>\n<p>Stolen from my <a href=\"https:\/\/www.sqlinthewild.co.za\/index.php\/2010\/04\/27\/in-exists-and-join-a-roundup\/\">last look<\/a> at EXISTS and IN:<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">CREATE TABLE PrimaryTable_Large (\r\nid INT IDENTITY PRIMARY KEY,\r\nSomeColumn char(4) NOT NULL,\r\nFiller CHAR(100)\r\n);\r\n\r\nCREATE TABLE SecondaryTable_Large (\r\nid INT IDENTITY PRIMARY KEY,\r\nLookupColumn char(4) NOT NULL,\r\nSomeArbDate Datetime default getdate()\r\n);\r\nGO\r\n\r\nINSERT INTO PrimaryTable_Large (SomeColumn)\r\nSELECT top 1000000\r\nchar(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +\r\nchar(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))\r\nfrom msdb.sys.columns a cross join msdb.sys.columns b;\r\n\r\nINSERT INTO SecondaryTable_Large (LookupColumn)\r\nSELECT SomeColumn\r\nFROM PrimaryTable_Large TABLESAMPLE (25 PERCENT);<\/pre>\n<p>Some row counts first.<\/p>\n<ul>\n<li>Total rows in PrimaryTable_Large: 1000000<\/li>\n<li>Total rows in SecondaryTable_Large: 256335<\/li>\n<li>Total distinct values in LookupColumn in SecondaryTable_Large: 10827<\/li>\n<\/ul>\n<p>First test is without indexes on the lookup columns:<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">SELECT ID, SomeColumn FROM PrimaryTable_Large\r\nWHERE SomeColumn IN (SELECT LookupColumn FROM SecondaryTable_Large)\r\n\r\nSELECT ID, SomeColumn FROM PrimaryTable_Large\r\nWHERE SomeColumn IN (SELECT DISTINCT LookupColumn FROM SecondaryTable_Large)<\/pre>\n<p>The reads are identical, which shouldn&#8217;t be a surprise as there&#8217;s no way with the current tables to run those queries without doing a full table scan.<\/p>\n<blockquote><p>Table &#8216;Worktable&#8217;. Scan count 0, logical reads 0, physical reads 0.<br \/>\nTable &#8216;PrimaryTable_Large&#8217;. Scan count 1, logical reads 14548, physical reads 0.<br \/>\nTable &#8216;SecondaryTable_Large&#8217;. Scan count 1, logical reads 798, physical reads 0.<\/p><\/blockquote>\n<p>For durations and CPU, I&#8217;m going to run each 10 times and aggregate the results from the profiler T-SQL:BatchCompleted event. The results are just about identical.<\/p>\n<ul>\n<li>IN without DISTINCT: CPU 1.21 seconds, duration 12.2 seconds<\/li>\n<li>IN with DISTINCT: CPU 1.25 seconds, duration 11.9 seconds<\/li>\n<\/ul>\n<p>Furthermore, the execution plans are identical. Something interesting to notice in this case is that the join is not a semi-join, it&#8217;s a complete join and to ensure that the complete join doesn&#8217;t return duplicate rows (which would be incorrect), there&#8217;s a hash match (aggregate) right before the join that&#8217;s removing duplicate rows from the inner resultset, and that&#8217;s present in both execution plans, when the distinct is specified and when it&#8217;s not.<\/p>\n<p><a href=\"https:\/\/www.sqlinthewild.co.za\/wp-content\/uploads\/2010\/12\/LargeINNoIndexes.png\"><img loading=\"lazy\" decoding=\"async\" style=\"display: inline; border-width: 0px;\" title=\"LargeINNoIndexes\" src=\"https:\/\/www.sqlinthewild.co.za\/wp-content\/uploads\/2010\/12\/LargeINNoIndexes_thumb.png\" border=\"0\" alt=\"LargeINNoIndexes\" width=\"484\" height=\"266\" \/><\/a><\/p>\n<p>One last question to answer &#8211; does the presence of indexes change anything?<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">CREATE INDEX idx_Primary\r\nON dbo.PrimaryTable_Large (SomeColumn)\r\n\r\nCREATE INDEX idx_Secondary\r\nON dbo.SecondaryTable_Large (LookupColumn)<\/pre>\n<p>The execution plan has changed, in operators if not in general form. The hash join is replaced by a merge join (still a complete join, not a semi-join), the hash match (aggregate) has been replaced by a stream aggregate and the clustered index scans are now (nonclustered) index scans<\/p>\n<p><a href=\"https:\/\/www.sqlinthewild.co.za\/wp-content\/uploads\/2010\/12\/LargeINIndexes.png\"><img loading=\"lazy\" decoding=\"async\" style=\"display: inline; border-width: 0px;\" title=\"LargeINIndexes\" src=\"https:\/\/www.sqlinthewild.co.za\/wp-content\/uploads\/2010\/12\/LargeINIndexes_thumb.png\" border=\"0\" alt=\"LargeINIndexes\" width=\"484\" height=\"278\" \/><\/a><\/p>\n<p>The reads are still identical between the two, which should be no surprise at all. As for the durations:<\/p>\n<ul>\n<li>IN without DISTINCT: CPU 0.82 seconds, duration 10.1 seconds<\/li>\n<li>IN with DISTINCT: CPU 0.79 seconds, duration 10.5 seconds<\/li>\n<\/ul>\n<p>Again so close that the small difference should be ignored.<\/p>\n<p>So in conclusion, is there any need or use for DISTINCT in the subquery for an IN predicate? By all appearances, none whatsoever. The SQL query optimiser is smart enough to ignore the specified DISTINCT if it&#8217;s not necessary (as we saw in the first example) and to add an operator to remove duplicates if it is necessary (as we saw in the 2nd and 3rd examples), regardless of whether or not there&#8217;s a DISTINCT specified in the query.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is going to be a quick one\u2026 I keep seeing forum code (and production code) that includes the DISTINCT in IN or EXISTS subqueries. The rationale is given either as a performance enhancement or as necessary for correct results&#8230;. <a class=\"read-more-button\" href=\"https:\/\/www.sqlinthewild.co.za\/index.php\/2011\/01\/18\/distincting-an-in-subquery\/\">(Read more)<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[25,15,16],"tags":[],"class_list":["post-812","post","type-post","status-publish","format-standard","hentry","category-performance","category-sql-server","category-syndication"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p7h6n-d6","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.sqlinthewild.co.za\/index.php\/wp-json\/wp\/v2\/posts\/812","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.sqlinthewild.co.za\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sqlinthewild.co.za\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sqlinthewild.co.za\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sqlinthewild.co.za\/index.php\/wp-json\/wp\/v2\/comments?post=812"}],"version-history":[{"count":0,"href":"https:\/\/www.sqlinthewild.co.za\/index.php\/wp-json\/wp\/v2\/posts\/812\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.sqlinthewild.co.za\/index.php\/wp-json\/wp\/v2\/media?parent=812"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sqlinthewild.co.za\/index.php\/wp-json\/wp\/v2\/categories?post=812"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sqlinthewild.co.za\/index.php\/wp-json\/wp\/v2\/tags?post=812"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}