Re: how to control the number of mappers?

Yang Tue, 17 Jan 2012 13:29:03 -0800

ok, I see, I was using pig 0.5

tried 0.9, works now


thanks!

On Tue, Jan 17, 2012 at 1:20 PM, Yang <[email protected]> wrote:

> weird
>
> I tried
>
> # head a.pg
>
> set job.name 'blah';
> SET mapred.map.tasks.speculative.execution false;
> set mapred.min.split.size 10000;
>
> set mapred.tasktracker.map.tasks.maximum 10000;
>
>
> [root@]# pig a.pg
> 2012-01-17 16:19:18,407 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /mnt/pig_1326835158407.log
> 2012-01-17 16:19:18,564 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: hdfs://
> ec2-107-22-118-169.compute-1.amazonaws.com:8020/
> 2012-01-17 16:19:18,749 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to map-reduce job tracker at:
> ec2-107-22-118-169.compute-1.amazonaws.com:8021
> 2012-01-17 16:19:18,858 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Unrecognized set key:
> mapred.map.tasks.speculative.execution
> Details at logfile: /mnt/pig_1326835158407.log
>
>
> Pig Stack Trace
> ---------------
> ERROR 1000: Error during parsing. Unrecognized set key:
> mapred.map.tasks.speculative.execution
>
> org.apache.pig.tools.pigscript.parser.ParseException: Unrecognized set
> key: mapred.map.tasks.speculative.execution
>         at
> org.apache.pig.tools.grunt.GruntParser.processSet(GruntParser.java:459)
>         at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:429)
>         at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
>         at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
>         at org.apache.pig.Main.main(Main.java:397)
>
> ================================================================================
>
>
> so the job.name param is accepted, but the next one mapred.map...... was
> unrecognized.
> but that is the one I pasted from the docs page
>
>
> On Tue, Jan 17, 2012 at 1:15 PM, Dmitriy Ryaboy <[email protected]>wrote:
>
>> http://pig.apache.org/docs/r0.9.1/cmds.html#set
>>
>> "All Pig and Hadoop properties can be set, either in the Pig script or via
>> the Grunt command line."
>>
>> On Tue, Jan 17, 2012 at 12:53 PM, Yang <[email protected]> wrote:
>>
>> > Prashant:
>> >
>> > I tried splitting the input files, yes that worked, and multiple mappers
>> > were indeed created.
>> >
>> > but then I would have to create a separate stage simply to split the
>> input
>> > files, so that is a bit cumbersome. it would be nice if there is some
>> > control to directly limit map file input size etc.
>> >
>> > Thanks
>> > Yang
>> >
>> > On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <
>> [email protected]
>> > >wrote:
>> >
>> > > By block size I mean the actual HDFS block size. Based on your
>> > requirement
>> > > it seems like the input files are extremely small and reducing the
>> block
>> > > size is not an option.
>> > >
>> > > Specifying "mapred.min.split.size" would not work for both
>> Hadoop/Java MR
>> > > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize).
>> > >
>> > > Your job is more CPU intensive than I/O. I can think of splitting your
>> > > files into multiple input files (equal to # of map tasks on your
>> cluster)
>> > > and turning off splitCombination (pig.splitCombination=false). Though
>> > this
>> > > is generally a terrible MR practice!
>> > >
>> > > Another thing you could try is to give more memory to your map tasks
>> by
>> > > increasing "mapred.child.java.opts" to a higher value.
>> > >
>> > > Thanks,
>> > > Prashant
>> > >
>> > >
>> > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <[email protected]> wrote:
>> > >
>> > > > Prashant:
>> > > >
>> > > > thanks.
>> > > >
>> > > > by "reducing the block size", do you mean split size ? ---- block
>> size
>> > > > is fixed on a hadoop hdfs.
>> > > >
>> > > > my application is not really data heavy, each line of input takes a
>> > > > long while to process. as a result, the input size is small, but
>> total
>> > > > processing time is long, and the potential parallelism is high
>> > > >
>> > > > Yang
>> > > >
>> > > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
>> > > > <[email protected]> wrote:
>> > > > > Hi Yang,
>> > > > >
>> > > > > You cannot really control the number of mappers directly (depends
>> on
>> > > > > input splits), but surely can spawn more mappers in various ways,
>> > such
>> > > > > as reducing the block size or setting pig.splitCombination to
>> false
>> > > > > (this *might* create more maps).
>> > > > >
>> > > > > Level of parallelization depends on how much data the 2 mappers
>> are
>> > > > > handling. You would not want a lot of maps handling too little
>> data.
>> > > > > For eg, if your input data set is only a few MB it would not be a
>> > good
>> > > > > idea to have more than 1 or 2 maps.
>> > > > >
>> > > > > Thanks,
>> > > > > Prashant
>> > > > >
>> > > > > Sent from my iPhone
>> > > > >
>> > > > > On Jan 11, 2012, at 6:13 PM, Yang <[email protected]> wrote:
>> > > > >
>> > > > >> I have a pig script  that does basically a map-only job:
>> > > > >>
>> > > > >> raw = LOAD 'input.txt' ;
>> > > > >>
>> > > > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
>> > > > >>
>> > > > >> store processed into 'output.txt';
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> I have many nodes on my cluster, so I want PIG to process the
>> input
>> > in
>> > > > >> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
>> > > > >> using 2 mappers.
>> > > > >>
>> > > > >> in hadoop job it's possible to pass mapper count and
>> > > > >> -Dmapred.min.split.size= ,  would this also work for PIG? the
>> > PARALLEL
>> > > > >> keyword only works for reducers
>> > > > >>
>> > > > >>
>> > > > >> Thanks
>> > > > >> Yang
>> > > >
>> > >
>> >
>>
>
>

Re: how to control the number of mappers?

Reply via email to