Thursday, November 3, 2011

Hash Join

SQL Server supports three physical join operators: Nested Loops, Merge and Hash.

The hash join works well for large sets of data. The hash join has two phases, the build and probe. First in the build phase, the rows are read from the first table and hashes the rows based on the join keys and creates has table in memory. The second phase, the probe phase, the hash join reads all the rows from the second table and hashes these rows based on the same join keys. The hash join then returns the matching rows.

In pseudo-code, it shall look something like the following.

for each row Row1 in the build Table
     begin
          perform hash value calc on Row1 row join key
          insert Row1 row into hash bucket
          end
          for every row Row2 in the probe table
          begin
               calc the hash value on row Row2 join key
               for every row Row1 in the hash bucket
               if row Row1 joins with row Row2
                    return (row Row1, row Row2)
          end

Below is an example of a select statement in which the optimizer should use a merge join operator.

select * from AdventureWorks.Sales.Customer c
left join AdventureWorks.Sales.SalesOrderHeader soh on c.CustomerID = soh.CustomerID


SQL Server tries to use the smaller of the two tables as the build table. SQL Server does this to try and reserve precious memory. While SQL Server attempts a best guess at guessing the amount of memory needed, if SQL has guessed to low it will spill out to tempdb to continue joining for the hash join operation.

No comments:

Post a Comment