Re: how to control the number of mappers?

Yang Wed, 11 Jan 2012 18:27:47 -0800

Prashant:

thanks.


by "reducing the block size", do you mean split size ? ---- block size
is fixed on a hadoop hdfs.

my application is not really data heavy, each line of input takes a
long while to process. as a result, the input size is small, but total
processing time is long, and the potential parallelism is high

Yang

On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
<[email protected]> wrote:
> Hi Yang,
>
> You cannot really control the number of mappers directly (depends on
> input splits), but surely can spawn more mappers in various ways, such
> as reducing the block size or setting pig.splitCombination to false
> (this *might* create more maps).
>
> Level of parallelization depends on how much data the 2 mappers are
> handling. You would not want a lot of maps handling too little data.
> For eg, if your input data set is only a few MB it would not be a good
> idea to have more than 1 or 2 maps.
>
> Thanks,
> Prashant
>
> Sent from my iPhone
>
> On Jan 11, 2012, at 6:13 PM, Yang <[email protected]> wrote:
>
>> I have a pig script  that does basically a map-only job:
>>
>> raw = LOAD 'input.txt' ;
>>
>> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
>>
>> store processed into 'output.txt';
>>
>>
>>
>> I have many nodes on my cluster, so I want PIG to process the input in
>> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
>> using 2 mappers.
>>
>> in hadoop job it's possible to pass mapper count and
>> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
>> keyword only works for reducers
>>
>>
>> Thanks
>> Yang

Re: how to control the number of mappers?

Reply via email to