Thanks everyone! I had a typo when setting auto convert to true. You can actually see it in my first email ('set' was repeated twice but there was no syntax error). With map joins enabled, my join finished in 30 minutes. Sweet!
Looks like 'true' should be the default option for auto.convert Another article explaining hive joins with pictures can be found at http://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919 On Sun, Mar 20, 2011 at 6:07 AM, Jov <zhao6...@gmail.com> wrote: > 2011/3/20 Igor Tatarinov <i...@decide.com>: > > I have the following join that takes 4.5 hours (with 12 nodes) mostly > > because of a single reduce task that gets the bulk of the work: > > SELECT ... > > FROM T > > LEFT OUTER JOIN S > > ON T.timestamp = S.timestamp and T.id = S.id > > This is a 1:0/1 join so the size of the output is exactly the same as the > > size of T (500M records). S is actually very small (5K). > > I've tried: > > - switching the order of the join conditions > > - using a different hash function setting (jenkins instead of murmur) > > - using SET set hive.auto.convert.join = true; > > are you sure your query convert to mapjoin? if not,try use explicit > mapjoin hint. > > > > - using SET hive.optimize.skewjoin = true; > > but nothing helped :( > > Anything else I can try? > > Thanks! >