It’s about time I picked this series up again.
I’m not going to go into too much detail on joins. There are some very good articles elsewhere on joins. The important thing to notice about joins, in the context of an execution plan, is that there are six logical join operators and three physical join operators. The logical operators are what you ask for in the context of the query, the physical operators are what the optimiser picks to do the join.
The six logical operators are:
- Inner Join
- Outer Join
- Cross Join
- Cross Apply (new in SQL 2005)
- Anti Semi-Join
Craig Freedman wrote a long article on the logical join operators – Introduction to Joins
The semi-joins are the exception, in that they cannot be specified in a query. Nonetheless, they are present in disguise. They’re the logical operators for EXISTS, IN, NOT EXISTS and NOT IN. They’re used when matching is required, but not a complete join.
The three physical operators are what the optimiser uses to evaluate the logical join. There are various conditions that affect the physical operator the will be used for a particular join. The three operators are:
- Nested Loop Join
- Merge Join
- Hash join
The nested loop join works by looping through all the rows of one input and for each row looping through all the rows of the other input, looking for matches. The nested loop join works best when one or both of the input row sets is small. Since the input that is chosen as the second is read as many times as there are rows in the outer, this join can get very expensive as the size of the inputs increases.
For more detail on the nested loop, see Craig Freedman’s post
The merge join works by running through the two inputs, comparing rows and outputting matched rows. Both inputs must be sorted on the joining columns for this join to be possible. Since both inputs are only read once, this is an efficient join for larger row sets. This efficiency may be offset by the sorted requirement. If the join column is not indexed so as to retrieve the data already sorted, then an explicit sort is required.
For more detail on the merge join, see Craig Freedman’s post
The hash join is one of the more expensive join operations, as it requires the creation of a hash table to do the join. That said, it’s the join that’s best for large, unsorted inputs. It is the most memory-intensive of any of the joins
The hash join first reads one of the inputs and hashes the join column and puts the resulting hash and the column values into a hash table built up in memory. Then it reads all the rows in the second input, hashes those and checks the rows in the resulting hash bucket for the joining rows.
For more detail on the hash join, see Craig Freedman’s post
That’s pretty much that on joins. Next up, aggregates.