Prashant: I tried splitting the input files, yes that worked, and multiple mappers were indeed created.
but then I would have to create a separate stage simply to split the input files, so that is a bit cumbersome. it would be nice if there is some control to directly limit map file input size etc. Thanks Yang On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <[email protected]>wrote: > By block size I mean the actual HDFS block size. Based on your requirement > it seems like the input files are extremely small and reducing the block > size is not an option. > > Specifying "mapred.min.split.size" would not work for both Hadoop/Java MR > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize). > > Your job is more CPU intensive than I/O. I can think of splitting your > files into multiple input files (equal to # of map tasks on your cluster) > and turning off splitCombination (pig.splitCombination=false). Though this > is generally a terrible MR practice! > > Another thing you could try is to give more memory to your map tasks by > increasing "mapred.child.java.opts" to a higher value. > > Thanks, > Prashant > > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <[email protected]> wrote: > > > Prashant: > > > > thanks. > > > > by "reducing the block size", do you mean split size ? ---- block size > > is fixed on a hadoop hdfs. > > > > my application is not really data heavy, each line of input takes a > > long while to process. as a result, the input size is small, but total > > processing time is long, and the potential parallelism is high > > > > Yang > > > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi > > <[email protected]> wrote: > > > Hi Yang, > > > > > > You cannot really control the number of mappers directly (depends on > > > input splits), but surely can spawn more mappers in various ways, such > > > as reducing the block size or setting pig.splitCombination to false > > > (this *might* create more maps). > > > > > > Level of parallelization depends on how much data the 2 mappers are > > > handling. You would not want a lot of maps handling too little data. > > > For eg, if your input data set is only a few MB it would not be a good > > > idea to have more than 1 or 2 maps. > > > > > > Thanks, > > > Prashant > > > > > > Sent from my iPhone > > > > > > On Jan 11, 2012, at 6:13 PM, Yang <[email protected]> wrote: > > > > > >> I have a pig script that does basically a map-only job: > > >> > > >> raw = LOAD 'input.txt' ; > > >> > > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...); > > >> > > >> store processed into 'output.txt'; > > >> > > >> > > >> > > >> I have many nodes on my cluster, so I want PIG to process the input in > > >> more mappers. but it generates only 2 part-m-xxxxx files, i.e. > > >> using 2 mappers. > > >> > > >> in hadoop job it's possible to pass mapper count and > > >> -Dmapred.min.split.size= , would this also work for PIG? the PARALLEL > > >> keyword only works for reducers > > >> > > >> > > >> Thanks > > >> Yang > > >
