Re: Job setup for a pig run takes ages

Dmitriy Ryaboy Mon, 18 Jun 2012 16:26:46 -0700

Can you do a few successive ones?
Also, please let us know which version of pig you are using, and which loaders.


D

On Mon, Jun 18, 2012 at 2:51 PM, Danfeng Li <[email protected]> wrote:
> This is the jstack output during the setup time, not exactly sure how to 
> interoperate it.
>
> Thanks.
> Dan
>
> [dli@hmaster run]$ jstack 15640
> 2012-06-18 17:32:47
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b17 mixed mode):
>
> "Attach Listener" daemon prio=10 tid=0x0000000055dcb800 nid=0x431d waiting on 
> condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Low Memory Detector" daemon prio=10 tid=0x0000000055105000 nid=0x3d3b 
> runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread1" daemon prio=10 tid=0x0000000055103000 nid=0x3d3a waiting on 
> condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread0" daemon prio=10 tid=0x0000000055100000 nid=0x3d39 waiting on 
> condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x00000000550fe000 nid=0x3d38 runnable 
> [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x00000000550de800 nid=0x3d37 in Object.wait() 
> [0x0000000041d7a000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x00002aaab48a3cf8> (a java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>        - locked <0x00002aaab48a3cf8> (a java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>
> "Reference Handler" daemon prio=10 tid=0x00000000550dc800 nid=0x3d36 in 
> Object.wait() [0x0000000041093000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock)
>        at java.lang.Object.wait(Object.java:485)
>        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>        - locked <0x00002aaab48a3cb0> (a java.lang.ref.Reference$Lock)
>
> "main" prio=10 tid=0x0000000055065800 nid=0x3d25 runnable [0x0000000041653000]
>   java.lang.Thread.State: RUNNABLE
>        at 
> org.apache.pig.newplan.logical.expression.ProjectExpression.getFieldSchema(ProjectExpression.java:164)
>        at 
> org.apache.pig.newplan.logical.relational.LOInnerLoad.getSchema(LOInnerLoad.java:59)
>        at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:114)
>        at 
> org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:109)
>        at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>        at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:94)
>        at 
> org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71)
>        at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
>        at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>        at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>        at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:281)
>        at org.apache.pig.PigServer.compilePp(PigServer.java:1365)
>        at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1207)
>        at org.apache.pig.PigServer.execute(PigServer.java:1201)
>        at org.apache.pig.PigServer.access$100(PigServer.java:129)
>        at org.apache.pig.PigServer$Graph.execute(PigServer.java:1528)
>        at org.apache.pig.PigServer.executeBatchEx(PigServer.java:373)
>        at org.apache.pig.PigServer.executeBatch(PigServer.java:340)
>        at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)
>        at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
>        at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
>        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
>        at org.apache.pig.Main.run(Main.java:396)
>        at org.apache.pig.Main.main(Main.java:107)
>
> "VM Thread" prio=10 tid=0x00000000550d8800 nid=0x3d35 runnable
>
> "GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000055078800 nid=0x3d26 
> runnable
>
> "GC task thread#1 (ParallelGC)" prio=10 tid=0x000000005507a800 nid=0x3d27 
> runnable
>
> "GC task thread#2 (ParallelGC)" prio=10 tid=0x000000005507c000 nid=0x3d28 
> runnable
>
> "GC task thread#3 (ParallelGC)" prio=10 tid=0x000000005507e000 nid=0x3d29 
> runnable
>
> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000055080000 nid=0x3d2a 
> runnable
>
> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000055081800 nid=0x3d2b 
> runnable
>
> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000055083800 nid=0x3d2c 
> runnable
>
> "GC task thread#7 (ParallelGC)" prio=10 tid=0x0000000055085800 nid=0x3d2d 
> runnable
>
> "GC task thread#8 (ParallelGC)" prio=10 tid=0x0000000055087000 nid=0x3d2e 
> runnable
>
> "GC task thread#9 (ParallelGC)" prio=10 tid=0x0000000055089000 nid=0x3d2f 
> runnable
>
> "GC task thread#10 (ParallelGC)" prio=10 tid=0x000000005508b000 nid=0x3d30 
> runnable
>
> "GC task thread#11 (ParallelGC)" prio=10 tid=0x000000005508c800 nid=0x3d31 
> runnable
>
> "GC task thread#12 (ParallelGC)" prio=10 tid=0x000000005508e800 nid=0x3d32 
> runnable
>
> "GC task thread#13 (ParallelGC)" prio=10 tid=0x0000000055090800 nid=0x3d33 
> runnable
>
> "GC task thread#14 (ParallelGC)" prio=10 tid=0x0000000055092800 nid=0x3d34 
> runnable
>
> "VM Periodic Task Thread" prio=10 tid=0x0000000055110000 nid=0x3d3c waiting 
> on condition
>
> JNI global references: 1463
>
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:[email protected]]
> Sent: Saturday, June 16, 2012 8:24 AM
> To: [email protected]
> Subject: Re: Job setup for a pig run takes ages
>
> What loader are you using? Jt is not the place to look at, try jstacking your 
> pig process. Most likely it's talking to the NamaNode most of the time 
> because the loader is doing some per-file lookups.
>
> On Jun 13, 2012, at 11:24 AM, Danfeng Li <[email protected]> wrote:
>
>> We also run into the long setup time issue, but our problem is
>> different
>>
>> 1. The setup time takes about 20minutes, we can't see anything on the 
>> jobtracker during this setup time.
>> 2. Our data is saved in flat file, uncompressed.
>> 3. Our code consists of many small pig files, they are used in the
>> following way in the main pig file
>> data_1 = load ...
>> data_2 = load ...
>> ...
>> data_n = load ...
>>
>> run -param ... pigfile1.pig
>> run -param ... pigfile2.pig
>> ...
>>
>> store out1 ..
>> store out2 ..
>> ...
>> 4. here's the part of the log file during the setup time, notice the time 
>> difference between "13:46:42" to "14:05:23", during that time, we can't see 
>> anything on the jobtracker.
>> ...
>> 2012-06-13 13:46:30,488 [main] INFO  org.apache.pig.Main - Logging
>> error messages to: pig_1339609590477.log
>> 2012-06-13 13:46:30,796 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to hadoop file system at: hdfs://master:9000
>> 2012-06-13 13:46:30,950 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to map-reduce job tracker at: master:9001
>> 2012-06-13 13:46:32,766 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_fir. Using value : Account
>> position (\\$
>> 2012-06-13 13:46:32,766 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_sec. Using value K,
>> 2012-06-13 13:46:32,766 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_thi. Using value %)
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for detail_statment_pre. Using value  - matures
>> on
>> 2012-06-13 13:46:32,767 [main] WARN  
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
>> values found for detail_statment_post. Using value .
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_fir. Using value : Maturity date
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_sec. Using value  Account position
>> (\\$
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for rationale_thi. Using value K,
>> 2012-06-13 13:46:32,767 [main] WARN
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning :
>> Multiple values found for catalyst_pre. Using value  matures on
>> 2012-06-13 13:46:32,767 [main] WARN  
>> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
>> values found for catalyst_post. Using value .
>> 2012-06-13 13:46:42,749 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> script:
>> REPLICATED_JOIN,HASH_JOIN,COGROUP,GROUP_BY,ORDER_BY,DISTINCT,STREAMING
>> ,FILTER,CROSS,UNION
>> 2012-06-13 13:46:42,749 [main] INFO  
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>> 2012-06-13 14:05:23,460 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>> pruned for var_raw: $0, $1, $2, $6, $7, $8, $9, $10
>> 2012-06-13 14:05:23,474 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>> pruned for var_mf: $5, $6, $7, $8, $9, $11, $12, $14, $15, $16, $17,
>> $18, $19, $21, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33,
>> $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45
>> 2012-06-13 14:05:23,475 [main] INFO
>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for 
>> starmine: $0, $3, $4, $5, $6, $9, $10, $11 ...
>>
>> Any help will be appreciated.
>>
>> Thanks.
>> Dan
>>
>> -----Original Message-----
>> From: Markus Resch [mailto:[email protected]]
>> Sent: Wednesday, June 13, 2012 2:24 AM
>> To: [email protected]
>> Subject: Re: Job setup for a pig run takes ages
>>
>> Hey Alex,
>>
>> On one side I think you're right but we need to keep in mind that the
>> schema could change within some files of a glob (e.g. schema extension
>> update) the Avro Storage should check at least some hash of the schema to 
>> verify all schemas of all input files are the same and/or to split them into 
>> groups of different schemas if required.
>>
>> I'm currently about to check this issue with the cloudera cdh4 pig version. 
>> I'll let you know if we get significant different behavior.
>>
>> Best
>> Markus
>>
>> Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner:
>>> Thejas,
>>>
>>> Why can't Pig cache results of getschema instead of repeatedly calling the 
>>> load function? This is causing us to put in lots of unnecessary workarounds 
>>> in place since during those calls one cant rely on the udf context as it 
>>> gets cleared somewhere inbetween those calls?
>>>
>>> Sent from my iPhone
>>>
>>> On Jun 1, 2012, at 12:45 PM, Thejas Nair <[email protected]> wrote:
>>>
>>>> Can you do a jstack <pid> on the pig client process id a few times and see 
>>>> what it is doing when it is taking so long and send the results ?
>>>> With that we should be able to easily identify what might be happening and 
>>>> suggest a fix.
>>>>
>>>> I think pig calls getSchema() on load functions more times than actually 
>>>> necessary during the query optimization phase (each time after a 
>>>> transform). One theory I have is that your avro load function does a stat 
>>>> on all the 50k files each time to determine the common schema. This might 
>>>> be causing the delay to generate query plan.
>>>>
>>>> About the second large lag before queue seems to be time spend in input 
>>>> split calculations, not within pig logic.
>>>>
>>>> These are just theories I have. The jstack output would help determine if 
>>>> this is actually the case.
>>>>
>>>> Also, I think this delay is likely to be caused by the number of input 
>>>> files, and not the actual data size. You might also want to look at ways 
>>>> to reduce the actual number of input files. Is it generated by a previous 
>>>> MR job ? If yes, given your cluster size, it works against you to have so 
>>>> many maps or reducers in the previous MR job. If pig is generating the 
>>>> data,  you should consider setting the parallelism (reducers) or the 
>>>> number of maps (see 
>>>> http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>>>> While reducing number of input files will most likely solve your problem, 
>>>> I am still interested in knowing where the delays are coming from and if 
>>>> we can fix something in pig to improve the situation.
>>>>
>>>> Thanks,
>>>> Thejas
>>>>
>>>>
>>>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>>>> Hi Prashant, Hi Thejas,
>>>>>
>>>>> thanks for your very quick answer.
>>>>> No, this is not a typo. Those time stamps are true and as I said
>>>>> the machines are not very busy during this time.
>>>>>
>>>>> As this is our test cluster I am sure I am the only one who is
>>>>> running jobs on it. Another issue we have is that we are currently
>>>>> only able to run one job at a time but this shouldn't be the topic of 
>>>>> this request.
>>>>> We even have no continuous input stream to that cluster but copied
>>>>> a bunch of data to it some time ago.
>>>>> From my perspective the 464 GB of input data you are mentioned is
>>>>> the uncompressed amount of the 160GByte compressed files. Which I
>>>>> get when I use hadoop -f dus on that folder.
>>>>>
>>>>> Another interesting fact for you could be that we're running the
>>>>> cloudera CDH3 Update 3 version on our systems.
>>>>>
>>>>> I suspect this could be due to some fancy avro schema validation
>>>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>>>
>>>>> Sadly I'm currently not able to provide you the actual script
>>>>> currently as it contains confidential information but I will try to
>>>>> provide you a version as soon as possible. But I'd rather think of
>>>>> a configuration problem to the hadoop or pig anyways as the script
>>>>> works fine with a smaller amount of input data
>>>>>
>>>>> I would ask the hadoop mailing list if this issue would occur
>>>>> during the actual mapred run but as this occur even before a single
>>>>> mapred job is launched I suspect pig to have a problem.
>>>>>
>>>>> Thanks
>>>>> Markus
>>>>>
>>>>> This is the full log until the main work job starts:
>>>>> mapred@ournamenode$ pig OurScript.pig
>>>>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging
>>>>> error messages to: /tmp/pig_1338384441037.log
>>>>> 2012-05-30 15:27:21,368 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>>>>> 2012-05-30 15:27:21,609 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>>> Connecting to map-reduce job tracker at:
>>>>> dev-jobtracker001.eu-fra.adtech.com:54311
>>>>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>>>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in
>>>>> the
>>>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>>>> 2012-05-30 16:06:55,304 [main] INFO
>>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>>> pruned for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>>>>> 2012-05-30 16:06:55,308 [main] INFO
>>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>>> pruned for dataImport: $2, $3, $4
>>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>>> OutputData1:
>>>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) -
>>>>> scope-521 Operator Key: scope-521)
>>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>>> OutputData2:
>>>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>>>>> scope-524 Operator Key: scope-524)
>>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>>> OutputData2:
>>>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>>>>> scope-483 Operator Key: scope-483)
>>>>> 2012-05-30 16:06:55,453 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>>> iler
>>>>> - File concatenation threshold: 100 optimistic? false
>>>>> 2012-05-30 16:06:55,467 [main] INFO
>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>>> paths to process : 1
>>>>> 2012-05-30 16:06:55,471 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>>> Total input paths to process : 1
>>>>> 2012-05-30 16:06:55,483 [main] WARN
>>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>>> library is available
>>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>>>>> library
>>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>>> library loaded
>>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>>> Total input paths (combined) to process : 1
>>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>>> iler
>>>>> - number of input files: 1
>>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQ
>>>>> ueryOptimizer - MR plan size before optimization: 7
>>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>>>>  - Merged 2 map-only splittees.
>>>>> 20

Re: Job setup for a pig run takes ages

Reply via email to