in replicated join , the number of maps spawned should be same as the number of splits for the first join input. In case of default join, there would be additional map tasks for the 2nd input's splits. But if you are able to run the replicated join without running out of memory, then the 2nd input is likely to have only a handful of splits.

questions for you
1. was the replicated join successful ?
2. do you have pig.splitCombination  turned on ? (its on by default).
3. what version of pig are you using ?
4. what is the size of each input to join ?

Thanks,
Thejas



On 5/1/12 10:58 PM, shan s wrote:
By other steps, I mainly mean other default joins in the script.

The point is that when I use 'Replicated'  join, 2 maps tasks are
scheduled. When I use "default" join, 100+ map jobs are scheduled.
How do we explain this decision process?
How can I increase actual no. of maps scheduled for Replicated joins?

On Mon, Apr 30, 2012 at 11:59 PM, Prashant Kommireddi<[email protected]>
wrote:

2 map tasks for join vs 100+ in other steps, what are "other" steps here?

Your 2nd question, I think you are asking about Map and Reduce Task
capacity mentioned on the JobTracker page? That is governed based on
configuration properties set before hadoop is started on cluster.




On Mon, Apr 30, 2012 at 7:54 AM, shan s<[email protected]>  wrote:

Sorry for the previous incomplete message.
Here is the take 2:

When I use a Replicated Join only 2 map tasks get scheduled (compared to
100+ tasks for the other steps)
What is the idea behind this? What setting do I use to override this
behaviour?


Also, a basic question.
Does hadoop decide the map task capacity or it simply follows the
configuration?

Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted Nodes
Excluded Nodes
  64                         20                             1.00

Thanks, Prashant.



Reply via email to