http://pig.apache.org/docs/r0.9.1/cmds.html#set
"All Pig and Hadoop properties can be set, either in the Pig script or via the Grunt command line." On Tue, Jan 17, 2012 at 12:53 PM, Yang <[email protected]> wrote: > Prashant: > > I tried splitting the input files, yes that worked, and multiple mappers > were indeed created. > > but then I would have to create a separate stage simply to split the input > files, so that is a bit cumbersome. it would be nice if there is some > control to directly limit map file input size etc. > > Thanks > Yang > > On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <[email protected] > >wrote: > > > By block size I mean the actual HDFS block size. Based on your > requirement > > it seems like the input files are extremely small and reducing the block > > size is not an option. > > > > Specifying "mapred.min.split.size" would not work for both Hadoop/Java MR > > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize). > > > > Your job is more CPU intensive than I/O. I can think of splitting your > > files into multiple input files (equal to # of map tasks on your cluster) > > and turning off splitCombination (pig.splitCombination=false). Though > this > > is generally a terrible MR practice! > > > > Another thing you could try is to give more memory to your map tasks by > > increasing "mapred.child.java.opts" to a higher value. > > > > Thanks, > > Prashant > > > > > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <[email protected]> wrote: > > > > > Prashant: > > > > > > thanks. > > > > > > by "reducing the block size", do you mean split size ? ---- block size > > > is fixed on a hadoop hdfs. > > > > > > my application is not really data heavy, each line of input takes a > > > long while to process. as a result, the input size is small, but total > > > processing time is long, and the potential parallelism is high > > > > > > Yang > > > > > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi > > > <[email protected]> wrote: > > > > Hi Yang, > > > > > > > > You cannot really control the number of mappers directly (depends on > > > > input splits), but surely can spawn more mappers in various ways, > such > > > > as reducing the block size or setting pig.splitCombination to false > > > > (this *might* create more maps). > > > > > > > > Level of parallelization depends on how much data the 2 mappers are > > > > handling. You would not want a lot of maps handling too little data. > > > > For eg, if your input data set is only a few MB it would not be a > good > > > > idea to have more than 1 or 2 maps. > > > > > > > > Thanks, > > > > Prashant > > > > > > > > Sent from my iPhone > > > > > > > > On Jan 11, 2012, at 6:13 PM, Yang <[email protected]> wrote: > > > > > > > >> I have a pig script that does basically a map-only job: > > > >> > > > >> raw = LOAD 'input.txt' ; > > > >> > > > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...); > > > >> > > > >> store processed into 'output.txt'; > > > >> > > > >> > > > >> > > > >> I have many nodes on my cluster, so I want PIG to process the input > in > > > >> more mappers. but it generates only 2 part-m-xxxxx files, i.e. > > > >> using 2 mappers. > > > >> > > > >> in hadoop job it's possible to pass mapper count and > > > >> -Dmapred.min.split.size= , would this also work for PIG? the > PARALLEL > > > >> keyword only works for reducers > > > >> > > > >> > > > >> Thanks > > > >> Yang > > > > > >
