Crunch,
The Crunch User Guide states that in a join the smaller collection should go on 
the left and the larger collection should go on the right.  I assume this is 
for performance reasons and seems simple enough.  I am interpreting that in 
this statement that the “smaller” PTable is the one with the least number of KV 
pairs.  However, a Bloom Filter join is to be used when the “vast majority of 
the keys in the right-hand side table will not match the keys in the left-hand 
side table.”  This got me wondering – is the definition of “smaller” in this 
type of join not the number of KV pairs but the number of distinct keys?

I noticed that when a BloomFilterJoin is used in a MRPipeline, a M/R job is 
kicked off to create the bloom filter hashes and write them to HDFS.  This job 
is processed on the left-hand side of the join, and of course a smaller input 
data set will make this job execute faster.  But, the output of that job would 
be the smaller set of distinct keys, and it’s that set of keys that is used in 
the join to the right-hand side table.

As an example in case it’s not clear, if PTable 1 has 1,000 entries with 10 
distinct keys and PTable 2 has 100 entries with 100 distinct keys, and all the 
keys in PTable 1 match keys in PTable 2, which should go on the left vs. right? 
 In my real-world example, my PTable 1 has millions of entries while my PTable 
2 has a few thousand, but PTable 2 does have more distinct keys.

Based on what I can gather, even though PTable 1 has less distinct keys it 
should still go on the right side of the join because of its significantly 
larger size, but I wanted to verify.  I was also curious what the impact is of 
swapping these.  As I stated above, I assume this has performance implications, 
but I’m also wondering if it might have memory implications as well?

Thanks,

Sean Griffin
Director, Revenue Cycle Reporting & Analytics
[email protected]<mailto:[email protected]> | 816-201-1599
www.cerner.com
<http://www.cerner.com/>[http://www.cerner.com/uploadedimages/email_logo.png]<http://www.cerner.com/>

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

Reply via email to