Re: Job setup for a pig run takes ages

Dmitriy Ryaboy Sat, 16 Jun 2012 08:25:06 -0700

What loader are you using? Jt is not the place to look at, try jstacking your 
pig process. Most likely it's talking to the NamaNode most of the time because 
the loader is doing some per-file lookups.


On Jun 13, 2012, at 11:24 AM, Danfeng Li <[email protected]> wrote:

> We also run into the long setup time issue, but our problem is different
> 
> 1. The setup time takes about 20minutes, we can't see anything on the 
> jobtracker during this setup time.
> 2. Our data is saved in flat file, uncompressed.
> 3. Our code consists of many small pig files, they are used in the following 
> way in the main pig file
> data_1 = load ...
> data_2 = load ...
> ...
> data_n = load ...
> 
> run -param ... pigfile1.pig
> run -param ... pigfile2.pig
> ...
> 
> store out1 ..
> store out2 ..
> ...
> 4. here's the part of the log file during the setup time, notice the time 
> difference between "13:46:42" to "14:05:23", during that time, we can't see 
> anything on the jobtracker.
> ...
> 2012-06-13 13:46:30,488 [main] INFO  org.apache.pig.Main - Logging error 
> messages to: pig_1339609590477.log
> 2012-06-13 13:46:30,796 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to hadoop file system at: hdfs://master:9000
> 2012-06-13 13:46:30,950 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to map-reduce job tracker at: master:9001
> 2012-06-13 13:46:32,766 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for rationale_fir. Using value : Account position (\\$
> 2012-06-13 13:46:32,766 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for rationale_sec. Using value K,
> 2012-06-13 13:46:32,766 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for rationale_thi. Using value %)
> 2012-06-13 13:46:32,767 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for detail_statment_pre. Using value  - matures on
> 2012-06-13 13:46:32,767 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for detail_statment_post. Using value .
> 2012-06-13 13:46:32,767 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for rationale_fir. Using value : Maturity date
> 2012-06-13 13:46:32,767 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for rationale_sec. Using value  Account position (\\$
> 2012-06-13 13:46:32,767 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for rationale_thi. Using value K,
> 2012-06-13 13:46:32,767 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for catalyst_pre. Using value  matures on
> 2012-06-13 13:46:32,767 [main] WARN  
> org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple 
> values found for catalyst_post. Using value .
> 2012-06-13 13:46:42,749 [main] INFO  
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: 
> REPLICATED_JOIN,HASH_JOIN,COGROUP,GROUP_BY,ORDER_BY,DISTINCT,STREAMING,FILTER,CROSS,UNION
> 2012-06-13 13:46:42,749 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
> pig.usenewlogicalplan is set to true. New logical plan will be used.
> 2012-06-13 14:05:23,460 [main] INFO  
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for 
> var_raw: $0, $1, $2, $6, $7, $8, $9, $10
> 2012-06-13 14:05:23,474 [main] INFO  
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for 
> var_mf: $5, $6, $7, $8, $9, $11, $12, $14, $15, $16, $17, $18, $19, $21, $23, 
> $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, 
> $39, $40, $41, $42, $43, $44, $45
> 2012-06-13 14:05:23,475 [main] INFO  
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for 
> starmine: $0, $3, $4, $5, $6, $9, $10, $11
> ...
> 
> Any help will be appreciated.
> 
> Thanks.
> Dan
> 
> -----Original Message-----
> From: Markus Resch [mailto:[email protected]]
> Sent: Wednesday, June 13, 2012 2:24 AM
> To: [email protected]
> Subject: Re: Job setup for a pig run takes ages
> 
> Hey Alex,
> 
> On one side I think you're right but we need to keep in mind that the schema 
> could change within some files of a glob (e.g. schema extension
> update) the Avro Storage should check at least some hash of the schema to 
> verify all schemas of all input files are the same and/or to split them into 
> groups of different schemas if required.
> 
> I'm currently about to check this issue with the cloudera cdh4 pig version. 
> I'll let you know if we get significant different behavior.
> 
> Best
> Markus
> 
> Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner:
>> Thejas,
>> 
>> Why can't Pig cache results of getschema instead of repeatedly calling the 
>> load function? This is causing us to put in lots of unnecessary workarounds 
>> in place since during those calls one cant rely on the udf context as it 
>> gets cleared somewhere inbetween those calls?
>> 
>> Sent from my iPhone
>> 
>> On Jun 1, 2012, at 12:45 PM, Thejas Nair <[email protected]> wrote:
>> 
>>> Can you do a jstack <pid> on the pig client process id a few times and see 
>>> what it is doing when it is taking so long and send the results ?
>>> With that we should be able to easily identify what might be happening and 
>>> suggest a fix.
>>> 
>>> I think pig calls getSchema() on load functions more times than actually 
>>> necessary during the query optimization phase (each time after a 
>>> transform). One theory I have is that your avro load function does a stat 
>>> on all the 50k files each time to determine the common schema. This might 
>>> be causing the delay to generate query plan.
>>> 
>>> About the second large lag before queue seems to be time spend in input 
>>> split calculations, not within pig logic.
>>> 
>>> These are just theories I have. The jstack output would help determine if 
>>> this is actually the case.
>>> 
>>> Also, I think this delay is likely to be caused by the number of input 
>>> files, and not the actual data size. You might also want to look at ways to 
>>> reduce the actual number of input files. Is it generated by a previous MR 
>>> job ? If yes, given your cluster size, it works against you to have so many 
>>> maps or reducers in the previous MR job. If pig is generating the data,  
>>> you should consider setting the parallelism (reducers) or the number of 
>>> maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>>> While reducing number of input files will most likely solve your problem, I 
>>> am still interested in knowing where the delays are coming from and if we 
>>> can fix something in pig to improve the situation.
>>> 
>>> Thanks,
>>> Thejas
>>> 
>>> 
>>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>>> Hi Prashant, Hi Thejas,
>>>> 
>>>> thanks for your very quick answer.
>>>> No, this is not a typo. Those time stamps are true and as I said
>>>> the machines are not very busy during this time.
>>>> 
>>>> As this is our test cluster I am sure I am the only one who is
>>>> running jobs on it. Another issue we have is that we are currently
>>>> only able to run one job at a time but this shouldn't be the topic of this 
>>>> request.
>>>> We even have no continuous input stream to that cluster but copied
>>>> a bunch of data to it some time ago.
>>>> From my perspective the 464 GB of input data you are mentioned is
>>>> the uncompressed amount of the 160GByte compressed files. Which I
>>>> get when I use hadoop -f dus on that folder.
>>>> 
>>>> Another interesting fact for you could be that we're running the
>>>> cloudera CDH3 Update 3 version on our systems.
>>>> 
>>>> I suspect this could be due to some fancy avro schema validation
>>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>> 
>>>> Sadly I'm currently not able to provide you the actual script
>>>> currently as it contains confidential information but I will try to
>>>> provide you a version as soon as possible. But I'd rather think of
>>>> a configuration problem to the hadoop or pig anyways as the script
>>>> works fine with a smaller amount of input data
>>>> 
>>>> I would ask the hadoop mailing list if this issue would occur
>>>> during the actual mapred run but as this occur even before a single
>>>> mapred job is launched I suspect pig to have a problem.
>>>> 
>>>> Thanks
>>>> Markus
>>>> 
>>>> This is the full log until the main work job starts:
>>>> mapred@ournamenode$ pig OurScript.pig
>>>> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging
>>>> error messages to: /tmp/pig_1338384441037.log
>>>> 2012-05-30 15:27:21,368 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> Connecting to hadoop file system at: hdfs://OurNamenode:9000
>>>> 2012-05-30 15:27:21,609 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> Connecting to map-reduce job tracker at:
>>>> dev-jobtracker001.eu-fra.adtech.com:54311
>>>> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
>>>> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in
>>>> the
>>>> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
>>>> 2012-05-30 15:57:27,816 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>>> 2012-05-30 16:06:55,304 [main] INFO
>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>> pruned for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
>>>> 2012-05-30 16:06:55,308 [main] INFO
>>>> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
>>>> pruned for dataImport: $2, $3, $4
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData1:
>>>> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-521 Operator Key: scope-521)
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData2:
>>>> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-524 Operator Key: scope-524)
>>>> 2012-05-30 16:06:55,441 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
>>>> OutputData2:
>>>> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
>>>> scope-483 Operator Key: scope-483)
>>>> 2012-05-30 16:06:55,453 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>> iler
>>>> - File concatenation threshold: 100 optimistic? false
>>>> 2012-05-30 16:06:55,467 [main] INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>> paths to process : 1
>>>> 2012-05-30 16:06:55,471 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total input paths to process : 1
>>>> 2012-05-30 16:06:55,483 [main] WARN
>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>> library is available
>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
>>>> library
>>>> 2012-05-30 16:06:55,484 [main] INFO
>>>> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
>>>> library loaded
>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total input paths (combined) to process : 1
>>>> 2012-05-30 16:06:55,486 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRComp
>>>> iler
>>>> - number of input files: 1
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQ
>>>> ueryOptimizer - MR plan size before optimization: 7
>>>> 2012-05-30 16:06:55,516 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>>>  - Merged 2 map-only splittees.
>>>> 20

Re: Job setup for a pig run takes ages

Reply via email to