SQL Server Today: Physical Join Operator

Showing posts with label Physical Join Operator. Show all posts

Thursday, November 3, 2011

Hash Join

SQL Server supports three physical join operators: Nested Loops, Merge and Hash.

The hash join works well for large sets of data. The hash join has two phases, the build and probe. First in the build phase, the rows are read from the first table and hashes the rows based on the join keys and creates has table in memory. The second phase, the probe phase, the hash join reads all the rows from the second table and hashes these rows based on the same join keys. The hash join then returns the matching rows.

In pseudo-code, it shall look something like the following.

for each row Row1 in the build Table
     begin
          perform hash value calc on Row1 row join key
          insert Row1 row into hash bucket
          end
          for every row Row2 in the probe table
          begin
               calc the hash value on row Row2 join key
               for every row Row1 in the hash bucket
               if row Row1 joins with row Row2
                    return (row Row1, row Row2)
          end

Below is an example of a select statement in which the optimizer should use a merge join operator.

select * from AdventureWorks.Sales.Customer c
left join AdventureWorks.Sales.SalesOrderHeader soh on c.CustomerID = soh.CustomerID

SQL Server tries to use the smaller of the two tables as the build table. SQL Server does this to try and reserve precious memory. While SQL Server attempts a best guess at guessing the amount of memory needed, if SQL has guessed to low it will spill out to tempdb to continue joining for the hash join operation.

Wednesday, November 2, 2011

Merge Join

SQL Server supports three physical join operators: Nested Loops, Merge and Hash.

The merge join requires that each table be sorted on the join keys. The merge join works by simultaneously reading and comparing the two sorted tables one row at a time. At each row, the merge join compares the next row from each table. If the rows match, the merge join outputs the joined row and continues on. If the rows do not match, the merge join discards the lesser of the two rows from the tables and continues. Because the tables are sorted, the merge join knows that it is discarding a row that is less than any remaining rows in either table.

In pseudo-code, it shall look something like the following.

get first row Row1 from Table1
get first row Row2 from Table2
while not at the end of either Table
     begin
          if Row1 joins with Row2
          begin
                    return (Row1, Row2)
                    get next row Row2 from Table2
          end
          else if Row1 < Row2
               get next row Row1 from Table1
          else
               get next row Row2 from Table2
          end

Below is an example of a select statement in which the optimizer should use a merge join operator.

select * from AdventureWorks.Person.Contact e left join AdventureWorks.HumanResources.Employee c on c.ContactID = e.ContactID

Performance of the merge join is associated to the number of rows in each table. A merge join more than likely is a better choice for larger inputs compared to a nested loops join. Each table in the merge join is read only once.

Tuesday, November 1, 2011

Nested Loops Join

SQL Server supports three physical join operators: Nested Loops, Merge and Hash.

The nested loops join takes each row from the outer table and compares it to each row in the inner table and returns rows that are satisfied by the join predicate.

In pseudo-code, it shall look something like the following.

for each row Row1 in the outer table Table1
     for each row Row2 in the inner table Table2
          if Row1 joins with Row2 (predicate)
               return (Row1, Row2)

Below is an example of a select statement in which the optimizer should use a nested loops join operator.

select * from AdventureWorks.HumanResources.Employee e
inner join AdventureWorks.Person.Contact c on e.ContactID = c.ContactID

The nested loops join performance is directly related to the number of rows in the outer table times the number of rows in the inner table. As the number of rows increase in the inner table, the cost grows quickly.

The nested loops joins is said to work best for small data sets and or where there is an index on the inner table.

Pages