Hello
I have a cluster with 11 nodes  each of them have 16 GB RAM, 6 core CPU,
1 TB HDD and i am using cloudera distribution CHD4b with Pig. I have two Pig
Join queries  which are a Parallel and a Replicated version of pig Join and 
MapReduce Reduce side  and Map side joins.

Theoretically Replicated Join could be faster than Parallel join but in
my case Parallel is faster.
i have a questions :

1.I am wondering why the replicated join is so slowly how it works what is the 
behind the replicated join.
2. MR reduce side join was faster than parallel pig join, what is implemented 
background the parallel pig join. i guess pig implement also MR reduce side 
join.

Could you explain me about the Pig joins how it works and what is run behind 
the pig scripts


Replicated Join in HDFS Replicated Join in Hbase MR Reduce side join MR Joins (Singleton pattern)
obr_wp_annotation 1786MB
        29 sec  50 sec  36 sec  19
obr_ct_annotation 5916MB
        799 sec         523 sec
        108 sec         69
obr_pm_annotation 16983MB
        1794 sec
        707 sec         248 sec         138

the relation file is 659MB

 thanks you very much

Byambajargal

Reply via email to