In RDBMS (say Oracle or Sybase) a hash join comes to play when you tend to join 
two tables with a large set of information (large being a relative term). The 
smaller of two tables is hashed into memory.  

 

For Hive, I have two tables one called  t with 1.7 million rows and another 
called smallt with 100 rows. Using three ways of joining the two tables on 
object_id column

 

1)    WHERE EXISTS

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from t WHERE EXISTS 
(select 1 from smallt where t.object_id = smallt.object_id);

+------+--+

| _c0  |

+------+--+

| 100  |

+------+--+

1 row selected (66.369 seconds)

 

2)    IN

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from t where t.object_id 
IN (select smallt.object_id from smallt);

+------+--+

| _c0  |

+------+--+

| 100  |

+------+--+

1 row selected (66.158 seconds)

 

3)    Classic Join

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from t, smallt where 
t.object_id = smallt.object_id;

+------+--+

| _c0  |

+------+--+

| 100  |

+------+--+

1 row selected (68.978 seconds)

 

 

You can see the results and judge for yourself

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf>
 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Raviv Murciano-Goroff [mailto:ravi...@gmail.com] 
Sent: 29 August 2015 00:50
To: user@hive.apache.org
Subject: Join vs. Where...In

 

Hi,

 

I often have the following situation: I have a small table with a list of 
unique IDs and a very large table of events associated with the IDs. I want to 
perform some aggregation including only events associated with IDs from the 
small table.

 

Is there a rule of thumb for whether performing a JOIN on the unique ID is 
faster or slower than using WHERE id IN (SELECT id FROM small_table...)?

 

Thank you for your advice,

Raviv

Reply via email to