ok, I see, I was using pig 0.5 tried 0.9, works now
thanks! On Tue, Jan 17, 2012 at 1:20 PM, Yang <[email protected]> wrote: > weird > > I tried > > # head a.pg > > set job.name 'blah'; > SET mapred.map.tasks.speculative.execution false; > set mapred.min.split.size 10000; > > set mapred.tasktracker.map.tasks.maximum 10000; > > > [root@]# pig a.pg > 2012-01-17 16:19:18,407 [main] INFO org.apache.pig.Main - Logging error > messages to: /mnt/pig_1326835158407.log > 2012-01-17 16:19:18,564 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to hadoop file system at: hdfs:// > ec2-107-22-118-169.compute-1.amazonaws.com:8020/ > 2012-01-17 16:19:18,749 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to map-reduce job tracker at: > ec2-107-22-118-169.compute-1.amazonaws.com:8021 > 2012-01-17 16:19:18,858 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1000: Error during parsing. Unrecognized set key: > mapred.map.tasks.speculative.execution > Details at logfile: /mnt/pig_1326835158407.log > > > Pig Stack Trace > --------------- > ERROR 1000: Error during parsing. Unrecognized set key: > mapred.map.tasks.speculative.execution > > org.apache.pig.tools.pigscript.parser.ParseException: Unrecognized set > key: mapred.map.tasks.speculative.execution > at > org.apache.pig.tools.grunt.GruntParser.processSet(GruntParser.java:459) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:429) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) > at org.apache.pig.Main.main(Main.java:397) > > ================================================================================ > > > so the job.name param is accepted, but the next one mapred.map...... was > unrecognized. > but that is the one I pasted from the docs page > > > On Tue, Jan 17, 2012 at 1:15 PM, Dmitriy Ryaboy <[email protected]>wrote: > >> http://pig.apache.org/docs/r0.9.1/cmds.html#set >> >> "All Pig and Hadoop properties can be set, either in the Pig script or via >> the Grunt command line." >> >> On Tue, Jan 17, 2012 at 12:53 PM, Yang <[email protected]> wrote: >> >> > Prashant: >> > >> > I tried splitting the input files, yes that worked, and multiple mappers >> > were indeed created. >> > >> > but then I would have to create a separate stage simply to split the >> input >> > files, so that is a bit cumbersome. it would be nice if there is some >> > control to directly limit map file input size etc. >> > >> > Thanks >> > Yang >> > >> > On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi < >> [email protected] >> > >wrote: >> > >> > > By block size I mean the actual HDFS block size. Based on your >> > requirement >> > > it seems like the input files are extremely small and reducing the >> block >> > > size is not an option. >> > > >> > > Specifying "mapred.min.split.size" would not work for both >> Hadoop/Java MR >> > > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize). >> > > >> > > Your job is more CPU intensive than I/O. I can think of splitting your >> > > files into multiple input files (equal to # of map tasks on your >> cluster) >> > > and turning off splitCombination (pig.splitCombination=false). Though >> > this >> > > is generally a terrible MR practice! >> > > >> > > Another thing you could try is to give more memory to your map tasks >> by >> > > increasing "mapred.child.java.opts" to a higher value. >> > > >> > > Thanks, >> > > Prashant >> > > >> > > >> > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <[email protected]> wrote: >> > > >> > > > Prashant: >> > > > >> > > > thanks. >> > > > >> > > > by "reducing the block size", do you mean split size ? ---- block >> size >> > > > is fixed on a hadoop hdfs. >> > > > >> > > > my application is not really data heavy, each line of input takes a >> > > > long while to process. as a result, the input size is small, but >> total >> > > > processing time is long, and the potential parallelism is high >> > > > >> > > > Yang >> > > > >> > > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi >> > > > <[email protected]> wrote: >> > > > > Hi Yang, >> > > > > >> > > > > You cannot really control the number of mappers directly (depends >> on >> > > > > input splits), but surely can spawn more mappers in various ways, >> > such >> > > > > as reducing the block size or setting pig.splitCombination to >> false >> > > > > (this *might* create more maps). >> > > > > >> > > > > Level of parallelization depends on how much data the 2 mappers >> are >> > > > > handling. You would not want a lot of maps handling too little >> data. >> > > > > For eg, if your input data set is only a few MB it would not be a >> > good >> > > > > idea to have more than 1 or 2 maps. >> > > > > >> > > > > Thanks, >> > > > > Prashant >> > > > > >> > > > > Sent from my iPhone >> > > > > >> > > > > On Jan 11, 2012, at 6:13 PM, Yang <[email protected]> wrote: >> > > > > >> > > > >> I have a pig script that does basically a map-only job: >> > > > >> >> > > > >> raw = LOAD 'input.txt' ; >> > > > >> >> > > > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...); >> > > > >> >> > > > >> store processed into 'output.txt'; >> > > > >> >> > > > >> >> > > > >> >> > > > >> I have many nodes on my cluster, so I want PIG to process the >> input >> > in >> > > > >> more mappers. but it generates only 2 part-m-xxxxx files, i.e. >> > > > >> using 2 mappers. >> > > > >> >> > > > >> in hadoop job it's possible to pass mapper count and >> > > > >> -Dmapred.min.split.size= , would this also work for PIG? the >> > PARALLEL >> > > > >> keyword only works for reducers >> > > > >> >> > > > >> >> > > > >> Thanks >> > > > >> Yang >> > > > >> > > >> > >> > >
