Hi Yang,

You cannot really control the number of mappers directly (depends on
input splits), but surely can spawn more mappers in various ways, such
as reducing the block size or setting pig.splitCombination to false
(this *might* create more maps).

Level of parallelization depends on how much data the 2 mappers are
handling. You would not want a lot of maps handling too little data.
For eg, if your input data set is only a few MB it would not be a good
idea to have more than 1 or 2 maps.

Thanks,
Prashant

Sent from my iPhone

On Jan 11, 2012, at 6:13 PM, Yang <[email protected]> wrote:

> I have a pig script  that does basically a map-only job:
>
> raw = LOAD 'input.txt' ;
>
> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
>
> store processed into 'output.txt';
>
>
>
> I have many nodes on my cluster, so I want PIG to process the input in
> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
> using 2 mappers.
>
> in hadoop job it's possible to pass mapper count and
> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
> keyword only works for reducers
>
>
> Thanks
> Yang

Reply via email to