Can you do a few successive ones? Also, please let us know which version of pig you are using, and which loaders.
D On Mon, Jun 18, 2012 at 2:51 PM, Danfeng Li <[email protected]> wrote: > This is the jstack output during the setup time, not exactly sure how to > interoperate it. > > Thanks. > Dan > > [dli@hmaster run]$ jstack 15640 > 2012-06-18 17:32:47 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b17 mixed mode): > > "Attach Listener" daemon prio=10 tid=0x0000000055dcb800 nid=0x431d waiting on > condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "Low Memory Detector" daemon prio=10 tid=0x0000000055105000 nid=0x3d3b > runnable [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "CompilerThread1" daemon prio=10 tid=0x0000000055103000 nid=0x3d3a waiting on > condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "CompilerThread0" daemon prio=10 tid=0x0000000055100000 nid=0x3d39 waiting on > condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "Signal Dispatcher" daemon prio=10 tid=0x00000000550fe000 nid=0x3d38 runnable > [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "Finalizer" daemon prio=10 tid=0x00000000550de800 nid=0x3d37 in Object.wait() > [0x0000000041d7a000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00002aaab48a3cf8> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) > - locked <0x00002aaab48a3cf8> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) > at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > > "Reference Handler" daemon prio=10 tid=0x00000000550dc800 nid=0x3d36 in > Object.wait() [0x0000000041093000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:485) > at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > - locked <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock) > > "main" prio=10 tid=0x0000000055065800 nid=0x3d25 runnable [0x0000000041653000] > java.lang.Thread.State: RUNNABLE > at > org.apache.pig.newplan.logical.expression.ProjectExpression.getFieldSchema(ProjectExpression.java:164) > at > org.apache.pig.newplan.logical.relational.LOInnerLoad.getSchema(LOInnerLoad.java:59) > at > org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:114) > at > org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:109) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at > org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:94) > at > org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) > at > org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43) > at > org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:281) > at org.apache.pig.PigServer.compilePp(PigServer.java:1365) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1207) > at org.apache.pig.PigServer.execute(PigServer.java:1201) > at org.apache.pig.PigServer.access$100(PigServer.java:129) > at org.apache.pig.PigServer$Graph.execute(PigServer.java:1528) > at org.apache.pig.PigServer.executeBatchEx(PigServer.java:373) > at org.apache.pig.PigServer.executeBatch(PigServer.java:340) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) > at org.apache.pig.Main.run(Main.java:396) > at org.apache.pig.Main.main(Main.java:107) > > "VM Thread" prio=10 tid=0x00000000550d8800 nid=0x3d35 runnable > > "GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000055078800 nid=0x3d26 > runnable > > "GC task thread#1 (ParallelGC)" prio=10 tid=0x000000005507a800 nid=0x3d27 > runnable > > "GC task thread#2 (ParallelGC)" prio=10 tid=0x000000005507c000 nid=0x3d28 > runnable > > "GC task thread#3 (ParallelGC)" prio=10 tid=0x000000005507e000 nid=0x3d29 > runnable > > "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000055080000 nid=0x3d2a > runnable > > "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000055081800 nid=0x3d2b > runnable > > "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000055083800 nid=0x3d2c > runnable > > "GC task thread#7 (ParallelGC)" prio=10 tid=0x0000000055085800 nid=0x3d2d > runnable > > "GC task thread#8 (ParallelGC)" prio=10 tid=0x0000000055087000 nid=0x3d2e > runnable > > "GC task thread#9 (ParallelGC)" prio=10 tid=0x0000000055089000 nid=0x3d2f > runnable > > "GC task thread#10 (ParallelGC)" prio=10 tid=0x000000005508b000 nid=0x3d30 > runnable > > "GC task thread#11 (ParallelGC)" prio=10 tid=0x000000005508c800 nid=0x3d31 > runnable > > "GC task thread#12 (ParallelGC)" prio=10 tid=0x000000005508e800 nid=0x3d32 > runnable > > "GC task thread#13 (ParallelGC)" prio=10 tid=0x0000000055090800 nid=0x3d33 > runnable > > "GC task thread#14 (ParallelGC)" prio=10 tid=0x0000000055092800 nid=0x3d34 > runnable > > "VM Periodic Task Thread" prio=10 tid=0x0000000055110000 nid=0x3d3c waiting > on condition > > JNI global references: 1463 > > > -----Original Message----- > From: Dmitriy Ryaboy [mailto:[email protected]] > Sent: Saturday, June 16, 2012 8:24 AM > To: [email protected] > Subject: Re: Job setup for a pig run takes ages > > What loader are you using? Jt is not the place to look at, try jstacking your > pig process. Most likely it's talking to the NamaNode most of the time > because the loader is doing some per-file lookups. > > On Jun 13, 2012, at 11:24 AM, Danfeng Li <[email protected]> wrote: > >> We also run into the long setup time issue, but our problem is >> different >> >> 1. The setup time takes about 20minutes, we can't see anything on the >> jobtracker during this setup time. >> 2. Our data is saved in flat file, uncompressed. >> 3. Our code consists of many small pig files, they are used in the >> following way in the main pig file >> data_1 = load ... >> data_2 = load ... >> ... >> data_n = load ... >> >> run -param ... pigfile1.pig >> run -param ... pigfile2.pig >> ... >> >> store out1 .. >> store out2 .. >> ... >> 4. here's the part of the log file during the setup time, notice the time >> difference between "13:46:42" to "14:05:23", during that time, we can't see >> anything on the jobtracker. >> ... >> 2012-06-13 13:46:30,488 [main] INFO org.apache.pig.Main - Logging >> error messages to: pig_1339609590477.log >> 2012-06-13 13:46:30,796 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >> Connecting to hadoop file system at: hdfs://master:9000 >> 2012-06-13 13:46:30,950 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >> Connecting to map-reduce job tracker at: master:9001 >> 2012-06-13 13:46:32,766 [main] WARN >> org.apache.pig.tools.parameters.PreprocessorContext - Warning : >> Multiple values found for rationale_fir. Using value : Account >> position (\\$ >> 2012-06-13 13:46:32,766 [main] WARN >> org.apache.pig.tools.parameters.PreprocessorContext - Warning : >> Multiple values found for rationale_sec. Using value K, >> 2012-06-13 13:46:32,766 [main] WARN >> org.apache.pig.tools.parameters.PreprocessorContext - Warning : >> Multiple values found for rationale_thi. Using value %) >> 2012-06-13 13:46:32,767 [main] WARN >> org.apache.pig.tools.parameters.PreprocessorContext - Warning : >> Multiple values found for detail_statment_pre. Using value - matures >> on >> 2012-06-13 13:46:32,767 [main] WARN >> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple >> values found for detail_statment_post. Using value . >> 2012-06-13 13:46:32,767 [main] WARN >> org.apache.pig.tools.parameters.PreprocessorContext - Warning : >> Multiple values found for rationale_fir. Using value : Maturity date >> 2012-06-13 13:46:32,767 [main] WARN >> org.apache.pig.tools.parameters.PreprocessorContext - Warning : >> Multiple values found for rationale_sec. Using value Account position >> (\\$ >> 2012-06-13 13:46:32,767 [main] WARN >> org.apache.pig.tools.parameters.PreprocessorContext - Warning : >> Multiple values found for rationale_thi. Using value K, >> 2012-06-13 13:46:32,767 [main] WARN >> org.apache.pig.tools.parameters.PreprocessorContext - Warning : >> Multiple values found for catalyst_pre. Using value matures on >> 2012-06-13 13:46:32,767 [main] WARN >> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple >> values found for catalyst_post. Using value . >> 2012-06-13 13:46:42,749 [main] INFO >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the >> script: >> REPLICATED_JOIN,HASH_JOIN,COGROUP,GROUP_BY,ORDER_BY,DISTINCT,STREAMING >> ,FILTER,CROSS,UNION >> 2012-06-13 13:46:42,749 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >> pig.usenewlogicalplan is set to true. New logical plan will be used. >> 2012-06-13 14:05:23,460 [main] INFO >> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns >> pruned for var_raw: $0, $1, $2, $6, $7, $8, $9, $10 >> 2012-06-13 14:05:23,474 [main] INFO >> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns >> pruned for var_mf: $5, $6, $7, $8, $9, $11, $12, $14, $15, $16, $17, >> $18, $19, $21, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, >> $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45 >> 2012-06-13 14:05:23,475 [main] INFO >> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for >> starmine: $0, $3, $4, $5, $6, $9, $10, $11 ... >> >> Any help will be appreciated. >> >> Thanks. >> Dan >> >> -----Original Message----- >> From: Markus Resch [mailto:[email protected]] >> Sent: Wednesday, June 13, 2012 2:24 AM >> To: [email protected] >> Subject: Re: Job setup for a pig run takes ages >> >> Hey Alex, >> >> On one side I think you're right but we need to keep in mind that the >> schema could change within some files of a glob (e.g. schema extension >> update) the Avro Storage should check at least some hash of the schema to >> verify all schemas of all input files are the same and/or to split them into >> groups of different schemas if required. >> >> I'm currently about to check this issue with the cloudera cdh4 pig version. >> I'll let you know if we get significant different behavior. >> >> Best >> Markus >> >> Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner: >>> Thejas, >>> >>> Why can't Pig cache results of getschema instead of repeatedly calling the >>> load function? This is causing us to put in lots of unnecessary workarounds >>> in place since during those calls one cant rely on the udf context as it >>> gets cleared somewhere inbetween those calls? >>> >>> Sent from my iPhone >>> >>> On Jun 1, 2012, at 12:45 PM, Thejas Nair <[email protected]> wrote: >>> >>>> Can you do a jstack <pid> on the pig client process id a few times and see >>>> what it is doing when it is taking so long and send the results ? >>>> With that we should be able to easily identify what might be happening and >>>> suggest a fix. >>>> >>>> I think pig calls getSchema() on load functions more times than actually >>>> necessary during the query optimization phase (each time after a >>>> transform). One theory I have is that your avro load function does a stat >>>> on all the 50k files each time to determine the common schema. This might >>>> be causing the delay to generate query plan. >>>> >>>> About the second large lag before queue seems to be time spend in input >>>> split calculations, not within pig logic. >>>> >>>> These are just theories I have. The jstack output would help determine if >>>> this is actually the case. >>>> >>>> Also, I think this delay is likely to be caused by the number of input >>>> files, and not the actual data size. You might also want to look at ways >>>> to reduce the actual number of input files. Is it generated by a previous >>>> MR job ? If yes, given your cluster size, it works against you to have so >>>> many maps or reducers in the previous MR job. If pig is generating the >>>> data, you should consider setting the parallelism (reducers) or the >>>> number of maps (see >>>> http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) . >>>> While reducing number of input files will most likely solve your problem, >>>> I am still interested in knowing where the delays are coming from and if >>>> we can fix something in pig to improve the situation. >>>> >>>> Thanks, >>>> Thejas >>>> >>>> >>>> On 6/1/12 6:34 AM, Markus Resch wrote: >>>>> Hi Prashant, Hi Thejas, >>>>> >>>>> thanks for your very quick answer. >>>>> No, this is not a typo. Those time stamps are true and as I said >>>>> the machines are not very busy during this time. >>>>> >>>>> As this is our test cluster I am sure I am the only one who is >>>>> running jobs on it. Another issue we have is that we are currently >>>>> only able to run one job at a time but this shouldn't be the topic of >>>>> this request. >>>>> We even have no continuous input stream to that cluster but copied >>>>> a bunch of data to it some time ago. >>>>> From my perspective the 464 GB of input data you are mentioned is >>>>> the uncompressed amount of the 160GByte compressed files. Which I >>>>> get when I use hadoop -f dus on that folder. >>>>> >>>>> Another interesting fact for you could be that we're running the >>>>> cloudera CDH3 Update 3 version on our systems. >>>>> >>>>> I suspect this could be due to some fancy avro schema validation >>>>> implicitly executed by the avro storage? If so, can this be avoided? >>>>> >>>>> Sadly I'm currently not able to provide you the actual script >>>>> currently as it contains confidential information but I will try to >>>>> provide you a version as soon as possible. But I'd rather think of >>>>> a configuration problem to the hadoop or pig anyways as the script >>>>> works fine with a smaller amount of input data >>>>> >>>>> I would ask the hadoop mailing list if this issue would occur >>>>> during the actual mapred run but as this occur even before a single >>>>> mapred job is launched I suspect pig to have a problem. >>>>> >>>>> Thanks >>>>> Markus >>>>> >>>>> This is the full log until the main work job starts: >>>>> mapred@ournamenode$ pig OurScript.pig >>>>> 2012-05-30 15:27:21,052 [main] INFO org.apache.pig.Main - Logging >>>>> error messages to: /tmp/pig_1338384441037.log >>>>> 2012-05-30 15:27:21,368 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >>>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000 >>>>> 2012-05-30 15:27:21,609 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >>>>> Connecting to map-reduce job tracker at: >>>>> dev-jobtracker001.eu-fra.adtech.com:54311 >>>>> 2012-05-30 15:57:27,814 [main] WARN org.apache.pig.PigServer - >>>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s). >>>>> 2012-05-30 15:57:27,816 [main] INFO >>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in >>>>> the >>>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER >>>>> 2012-05-30 15:57:27,816 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >>>>> pig.usenewlogicalplan is set to true. New logical plan will be used. >>>>> 2012-05-30 16:06:55,304 [main] INFO >>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns >>>>> pruned for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9 >>>>> 2012-05-30 16:06:55,308 [main] INFO >>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns >>>>> pruned for dataImport: $2, $3, $4 >>>>> 2012-05-30 16:06:55,441 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: >>>>> OutputData1: >>>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - >>>>> scope-521 Operator Key: scope-521) >>>>> 2012-05-30 16:06:55,441 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: >>>>> OutputData2: >>>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) - >>>>> scope-524 Operator Key: scope-524) >>>>> 2012-05-30 16:06:55,441 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: >>>>> OutputData2: >>>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) - >>>>> scope-483 Operator Key: scope-483) >>>>> 2012-05-30 16:06:55,453 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp >>>>> iler >>>>> - File concatenation threshold: 100 optimistic? false >>>>> 2012-05-30 16:06:55,467 [main] INFO >>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input >>>>> paths to process : 1 >>>>> 2012-05-30 16:06:55,471 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - >>>>> Total input paths to process : 1 >>>>> 2012-05-30 16:06:55,483 [main] WARN >>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native >>>>> library is available >>>>> 2012-05-30 16:06:55,484 [main] INFO >>>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop >>>>> library >>>>> 2012-05-30 16:06:55,484 [main] INFO >>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native >>>>> library loaded >>>>> 2012-05-30 16:06:55,486 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - >>>>> Total input paths (combined) to process : 1 >>>>> 2012-05-30 16:06:55,486 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp >>>>> iler >>>>> - number of input files: 1 >>>>> 2012-05-30 16:06:55,516 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQ >>>>> ueryOptimizer - MR plan size before optimization: 7 >>>>> 2012-05-30 16:06:55,516 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >>>>> - Merged 2 map-only splittees. >>>>> 20
