Hi, When I want to do bucket map join between tables A and B (A has small buckets), which should load matching buckets of A into each mapper of B, the things I do is (1) bucket A and B on the join keys with equal number of buckets, (2) set hive.optimize.bucketmapjoin = true; (3) run the following query:
SELECT /*+ MAPJOIN(a) */ count(*) FROM a JOIN B ON a.join_key = b.join_key; Question 1: Is my sequence of actions enough for triggering bucket map join? After running the above query, what I observe is that a local task starts hashing A, and creating multiple hashed buckets and then loads it into mapper of B, and starts a job which doesn't have any reducers. Question 2: Why is it hashing it in the local task and not in the mapper stage? This seems to take a lot of time, which can be avoided. I mean why it doesn't upload the row bucket instead of hashed bucket, and do the hashing in the mapper (and parallelize the hashing, instead of doing it sequentialy)? If I have tables A and B sorted and bucketed by join keys and have the same number of buckets and setting the 3 flags specified in the manual<http://archive.cloudera.com/cdh/3/hive/language_manual/joins.html>, and again running the above query, I see that this time it doesn't do the hashing, but it does the join in just 1 mapper. Question 3: Why is it doing it in one mapper? Joins between different buckets seem to be independent, so why it doesn't doing it in multiple mappers? Thanks