RE: Job setup for a pig run takes ages

Danfeng Li Mon, 18 Jun 2012 14:51:36 -0700

This is the jstack output during the setup time, not exactly sure how to 
interoperate it.


Thanks.
Dan

[dli@hmaster run]$ jstack 15640
2012-06-18 17:32:47
Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b17 mixed mode):

"Attach Listener" daemon prio=10 tid=0x0000000055dcb800 nid=0x431d waiting on 
condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Low Memory Detector" daemon prio=10 tid=0x0000000055105000 nid=0x3d3b runnable 
[0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x0000000055103000 nid=0x3d3a waiting on 
condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x0000000055100000 nid=0x3d39 waiting on 
condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00000000550fe000 nid=0x3d38 runnable 
[0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00000000550de800 nid=0x3d37 in Object.wait() 
[0x0000000041d7a000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00002aaab48a3cf8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
        - locked <0x00002aaab48a3cf8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00000000550dc800 nid=0x3d36 in 
Object.wait() [0x0000000041093000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:485)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x0000000055065800 nid=0x3d25 runnable [0x0000000041653000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.pig.newplan.logical.expression.ProjectExpression.getFieldSchema(ProjectExpression.java:164)
        at 
org.apache.pig.newplan.logical.relational.LOInnerLoad.getSchema(LOInnerLoad.java:59)
        at 
org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:114)
        at 
org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:109)
        at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
        at 
org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:94)
        at 
org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71)
        at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
        at 
org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
        at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
        at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:281)
        at org.apache.pig.PigServer.compilePp(PigServer.java:1365)
        at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1207)
        at org.apache.pig.PigServer.execute(PigServer.java:1201)
        at org.apache.pig.PigServer.access$100(PigServer.java:129)
        at org.apache.pig.PigServer$Graph.execute(PigServer.java:1528)
        at org.apache.pig.PigServer.executeBatchEx(PigServer.java:373)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:340)
        at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
        at org.apache.pig.Main.run(Main.java:396)
        at org.apache.pig.Main.main(Main.java:107)

"VM Thread" prio=10 tid=0x00000000550d8800 nid=0x3d35 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000055078800 nid=0x3d26 
runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x000000005507a800 nid=0x3d27 
runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x000000005507c000 nid=0x3d28 
runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x000000005507e000 nid=0x3d29 
runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000055080000 nid=0x3d2a 
runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000055081800 nid=0x3d2b 
runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000055083800 nid=0x3d2c 
runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x0000000055085800 nid=0x3d2d 
runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x0000000055087000 nid=0x3d2e 
runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x0000000055089000 nid=0x3d2f 
runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x000000005508b000 nid=0x3d30 
runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x000000005508c800 nid=0x3d31 
runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x000000005508e800 nid=0x3d32 
runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x0000000055090800 nid=0x3d33 
runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x0000000055092800 nid=0x3d34 
runnable

"VM Periodic Task Thread" prio=10 tid=0x0000000055110000 nid=0x3d3c waiting on 
condition

JNI global references: 1463


-----Original Message-----
From: Dmitriy Ryaboy [mailto:[email protected]]
Sent: Saturday, June 16, 2012 8:24 AM
To: [email protected]
Subject: Re: Job setup for a pig run takes ages

What loader are you using? Jt is not the place to look at, try jstacking your 
pig process. Most likely it's talking to the NamaNode most of the time because 
the loader is doing some per-file lookups.

On Jun 13, 2012, at 11:24 AM, Danfeng Li <[email protected]> wrote:

> We also run into the long setup time issue, but our problem is
> different
>
> 1. The setup time takes about 20minutes, we can't see anything on the 
> jobtracker during this setup time.
> 2. Our data is saved in flat file, uncompressed.
> 3. Our code consists of many small pig files, they are used in the
> following way in the main pig file
> data_1 = load ...
> data_2 = load ...
> ...
> data_n = load ...
>
> run -param ... pigfile1.pig
> run -param ... pigfile2.pig
> ...
>
> store out1 ..
> store out2 ..
> ...
> 4. here's the part of the log file during the setup time, notice the time 
> difference between "13:46:42" to "14:05:23", during that time, we can't see 
> anything on the jobtracker.
> ...
> 2012-06-13 13:46:30,488 [main] INFO  org.apache.pig.Main - Logging
> error messages to: pig_1339609590477.log
> 2012-06-13 13:46:30,796 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: hdfs://master:9000
> 2012-06-13 13:46:30,950 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to map-reduce job tracker at: master:9001
> 2012-06-13 13:46:32,766 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_fir. Using value : Account
> position (\\$
> 2012-06-13 13:46:32,766 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_sec. Using value K,
> 2012-06-13 13:46:32,766 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_thi. Using value %)
> 2012-06-13 13:46:32,767 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for detail_statment_pre. Using value  - matures
> on
> 2012-06-13 13:46:32,767 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for detail_statment_post. Using value .
> 2012-06-13 13:46:32,767 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_fir. Using value : Maturity date
> 2012-06-13 13:46:32,767 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_sec. Using value  Account position
> (\\$
> 2012-06-13 13:46:32,767 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for rationale_thi. Using value K,
> 2012-06-13 13:46:32,767 [main] WARN
> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
> Multiple values found for catalyst_pre. Using value  matures on
> 2012-06-13 13:46:32,767 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for catalyst_post. Using value .
> 2012-06-13 13:46:42,749 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script:
> REPLICATED_JOIN,HASH_JOIN,COGROUP,GROUP_BY,ORDER_BY,DISTINCT,STREAMING
> ,FILTER,CROSS,UNION
> 2012-06-13 13:46:42,749 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
> pig.usenewlogicalplan is set to true. New logical plan will be used.
> 2012-06-13 14:05:23,460 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
> pruned for var_raw: $0, $1, $2, $6, $7, $8, $9, $10
> 2012-06-13 14:05:23,474 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
> pruned for var_mf: $5, $6, $7, $8, $9, $11, $12, $14, $15, $16, $17,
> $18, $19, $21, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33,
> $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45
> 2012-06-13 14:05:23,475 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for 
> starmine: $0, $3, $4, $5, $6, $9, $10, $11 ...
>
> Any help will be appreciated.
>
> Thanks.
> Dan
>
> -----Original Message-----
> From: Markus Resch [mailto:[email protected]]
> Sent: Wednesday, June 13, 2012 2:24 AM
> To: [email protected]
> Subject: Re: Job setup for a pig run takes ages
>
> Hey Alex,
>
> On one side I think you're right but we need to keep in mind that the
> schema could change within some files of a glob (e.g. schema extension
> update) the Avro Storage should check at least some hash of the schema to 
> verify all schemas of all input files are the same and/or to split them into 
> groups of different schemas if required.
>
> I'm currently about to check this issue with the cloudera cdh4 pig version. 
> I'll let you know if we get significant different behavior.
>
> Best
> Markus
>
> Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner:
>> Thejas,
>>
>> Why can't Pig cache results of getschema instead of repeatedly calling the 
>> load function? This is causing us to put in lots of unnecessary workarounds 
>> in place since during those calls one cant rely on the udf context as it 
>> gets cleared somewhere inbetween those calls?
>>
>> Sent from my iPhone
>>
>> On Jun 1, 2012, at 12:45 PM, Thejas Nair <[email protected]> wrote:
>>
>>> Can you do a jstack <pid> on the pig client process id a few times and see 
>>> what it is doing when it is taking so long and send the results ?
>>> With that we should be able to easily identify what might be happening and 
>>> suggest a fix.
>>>
>>> I think pig calls getSchema() on load functions more times than actually 
>>> necessary during the query optimization phase (each time after a 
>>> transform). One theory I have is that your avro load function does a stat 
>>> on all the 50k files each time to determine the common schema. This might 
>>> be causing the delay to generate query plan.
>>>
>>> About the second large lag before queue seems to be time spend in input 
>>> split calculations, not within pig logic.
>>>
>>> These are just theories I have. The jstack output would help determine if 
>>> this is actually the case.
>>>
>>> Also, I think this delay is likely to be caused by the number of input 
>>> files, and not the actual data size. You might also want to look at ways to 
>>> reduce the actual number of input files. Is it generated by a previous MR 
>>> job ? If yes, given your cluster size, it works against you to have so many 
>>> maps or reducers in the previous MR job. If pig is generating the data,  
>>> you should consider setting the parallelism (reducers) or the number of 
>>> maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>>> While reducing number of input files will most likely solve your problem, I 
>>> am still interested in knowing where the delays are coming from and if we 
>>> can fix something in pig to improve the situation.
>>>
>>> Thanks,
>>> Thejas
>>>
>>>
>>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>>> Hi Prashant, Hi Thejas,
>>>>
>>>> thanks for your very quick answer.
>>>> No, this is not a typo. Those time stamps are true and as I said
>>>> the machines are not very busy during this time.
>>>>
>>>> As this is our test cluster I am sure I am the only one who is
>>>> running jobs on it. Another issue we have is that we are currently
>>>> only able to run one job at a time but this shouldn't be the topic of this 
>>>> request.
>>>> We even have no continuous input stream to that cluster but copied
>>>> a bunch of data to it some time ago.
>>>> From my perspective the 464 GB of input data you are mentioned is
>>>> the uncompressed amount of the 160GByte compressed files. Which I
>>>> get when I use hadoop -f dus on that folder.
>>>>
>>>> Another interesting fact for you could be that we're running the
>>>> cloudera CDH3 Update 3 version on our systems.
>>>>
>>>> I suspect this could be due to some fancy avro schema validation
>>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>>
>>>> Sadly I'm currently not able to provide you the actual script
>>>> currently as it contains confidential information but I will try to
>>>> provide you a version as soon as possible. But I'd rather think of
>>>> a configuration problem to the hadoop or pig anyways as the script
>>>> works fine with a smaller amount of input data
>>>>
>>>> I would ask the hadoop mailing list if this issue would occur
>>>> during the actual mapred run but as this occur even before a single
>>>> mapred job is launched I suspect pig to have a problem.
>>>>
>>>> Thanks
>>>> Markus
>>>>
>>>> This is the full log until the main work job starts:
>>>> mapred@ournamenode$ pig OurScript.pig
>>>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging
>>>> error messages to: /tmp/pig_1338384441037.log
>>>> 2012-05-30 15:27:21,368 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>>>> 2012-05-30 15:27:21,609 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> Connecting to map-reduce job tracker at:
>>>> dev-jobtracker001.eu-fra.adtech.com:54311
>>>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in
>>>> the
>>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>>> 2012-05-30 16:06:55,304 [main] INFO
>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>> pruned for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>>>> 2012-05-30 16:06:55,308 [main] INFO
>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>> pruned for dataImport: $2, $3, $4
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData1:
>>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-521 Operator Key: scope-521)
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData2:
>>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-524 Operator Key: scope-524)
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData2:
>>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-483 Operator Key: scope-483)
>>>> 2012-05-30 16:06:55,453 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>> iler
>>>> - File concatenation threshold: 100 optimistic? false
>>>> 2012-05-30 16:06:55,467 [main] INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>> paths to process : 1
>>>> 2012-05-30 16:06:55,471 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total input paths to process : 1
>>>> 2012-05-30 16:06:55,483 [main] WARN
>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>> library is available
>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>>>> library
>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>> library loaded
>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total input paths (combined) to process : 1
>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>> iler
>>>> - number of input files: 1
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQ
>>>> ueryOptimizer - MR plan size before optimization: 7
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>>>  - Merged 2 map-only splittees.
>>>> 20

RE: Job setup for a pig run takes ages

Reply via email to