Prashant: thanks.
by "reducing the block size", do you mean split size ? ---- block size is fixed on a hadoop hdfs. my application is not really data heavy, each line of input takes a long while to process. as a result, the input size is small, but total processing time is long, and the potential parallelism is high Yang On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi <[email protected]> wrote: > Hi Yang, > > You cannot really control the number of mappers directly (depends on > input splits), but surely can spawn more mappers in various ways, such > as reducing the block size or setting pig.splitCombination to false > (this *might* create more maps). > > Level of parallelization depends on how much data the 2 mappers are > handling. You would not want a lot of maps handling too little data. > For eg, if your input data set is only a few MB it would not be a good > idea to have more than 1 or 2 maps. > > Thanks, > Prashant > > Sent from my iPhone > > On Jan 11, 2012, at 6:13 PM, Yang <[email protected]> wrote: > >> I have a pig script that does basically a map-only job: >> >> raw = LOAD 'input.txt' ; >> >> processed = FOREACH raw GENERATE convert_somehow($1,$2...); >> >> store processed into 'output.txt'; >> >> >> >> I have many nodes on my cluster, so I want PIG to process the input in >> more mappers. but it generates only 2 part-m-xxxxx files, i.e. >> using 2 mappers. >> >> in hadoop job it's possible to pass mapper count and >> -Dmapred.min.split.size= , would this also work for PIG? the PARALLEL >> keyword only works for reducers >> >> >> Thanks >> Yang
