Re: Job setup for a pig run takes ages

Ashutosh Chauhan Fri, 01 Jun 2012 06:52:50 -0700

Hey Markus,

I am also interested to look at your pig script. I think there is some
insight to be gained here.


Thanks,
Ashutosh
On Fri, Jun 1, 2012 at 6:34 AM, Markus Resch <[email protected]> wrote:

> Hi Prashant, Hi Thejas,
>
> thanks for your very quick answer.
> No, this is not a typo. Those time stamps are true and as I said the
> machines are not very busy during this time.
>
> As this is our test cluster I am sure I am the only one who is running
> jobs on it. Another issue we have is that we are currently only able to
> run one job at a time but this shouldn't be the topic of this request.
> We even have no continuous input stream to that cluster but copied a
> bunch of data to it some time ago.
> From my perspective the 464 GB of input data you are mentioned is the
> uncompressed amount of the 160GByte compressed files. Which I get when I
> use hadoop -f dus on that folder.
>
> Another interesting fact for you could be that we're running the
> cloudera CDH3 Update 3 version on our systems.
>
> I suspect this could be due to some fancy avro schema validation
> implicitly executed by the avro storage? If so, can this be avoided?
>
> Sadly I'm currently not able to provide you the actual script currently
> as it contains confidential information but I will try to provide you a
> version as soon as possible. But I'd rather think of a configuration
> problem to the hadoop or pig anyways as the script works fine with a
> smaller amount of input data
>
> I would ask the hadoop mailing list if this issue would occur during the
> actual mapred run but as this occur even before a single mapred job is
> launched I suspect pig to have a problem.
>
> Thanks
> Markus
>
> This is the full log until the main work job starts:
> mapred@ournamenode$ pig OurScript.pig
> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /tmp/pig_1338384441037.log
> 2012-05-30 15:27:21,368 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: hdfs://OurNamenode:9000
> 2012-05-30 15:27:21,609 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to map-reduce job tracker at:
> dev-jobtracker001.eu-fra.adtech.com:54311
> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
> 2012-05-30 15:57:27,816 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
> 2012-05-30 15:57:27,816 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> pig.usenewlogicalplan is set to true. New logical plan will be used.
> 2012-05-30 16:06:55,304 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
> 2012-05-30 16:06:55,308 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> for dataImport: $2, $3, $4
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData1:
> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
> Operator Key: scope-521)
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData2:
> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
> scope-524 Operator Key: scope-524)
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData2:
> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
> scope-483 Operator Key: scope-483)
> 2012-05-30 16:06:55,453 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> - File concatenation threshold: 100 optimistic? false
> 2012-05-30 16:06:55,467 [main] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 1
> 2012-05-30 16:06:55,471 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1
> 2012-05-30 16:06:55,483 [main] WARN
> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
> is available
> 2012-05-30 16:06:55,484 [main] INFO
> org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
> library
> 2012-05-30 16:06:55,484 [main] INFO
> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
> loaded
> 2012-05-30 16:06:55,486 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 1
> 2012-05-30 16:06:55,486 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> - number of input files: 1
> 2012-05-30 16:06:55,516 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 7
> 2012-05-30 16:06:55,516 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - Merged 2 map-only splittees.
> 2012-05-30 16:06:55,516 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - Merged 2 out of total 3 MR operators.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - Merged 2 map-only splittees.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - Merged 2 out of total 3 MR operators.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - Merged 0 out of total 2 MR operators.
> 2012-05-30 16:06:55,517 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 3
> 2012-05-30 16:06:56,131 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> added to the job
> 2012-05-30 16:06:56,138 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> 2012-05-30 16:07:00,432 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
> 2012-05-30 16:07:00,468 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission
> 2012-05-30 16:07:00,969 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> 2012-05-30 16:07:01,460 [Thread-4] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 1
> 2012-05-30 16:07:01,460 [Thread-4] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1
> 2012-05-30 16:07:01,462 [Thread-4] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 1
> 2012-05-30 16:07:02,353 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_201205080931_0178
> 2012-05-30 16:07:02,353 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - More information at:
> http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
> 2012-05-30 16:07:12,421 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 16% complete
> 2012-05-30 16:07:13,929 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 33% complete
> 2012-05-30 16:07:16,969 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> added to the job
> 2012-05-30 16:07:16,970 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> 2012-05-30 16:07:26,888 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
> 2012-05-30 16:07:30,886 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> 2012-05-30 16:07:30,886 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Neither PARALLEL nor default parallelism is set for this job. Setting
> number of reducers to 464
> 2012-05-30 16:07:30,947 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission.
> 2012-05-30 16:15:38,022 [Thread-16] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 50353
> 2012-05-30 16:15:54,421 [Thread-16] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 6899
> 2012-05-30 16:15:57,844 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_201205080931_0179
>
>
>
>
> Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
> > Hi Markus,
> >
> > Have you checked the JobTracker at the time of launching the job that Map
> > slots were available?
> >
> > Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
> > jobs are running fine, there should be no reason a larger dataset should
> be
> > stuck, atleast not on Pig side. I can't think of a good reason why the
> job
> > does not take off other than the fact that cluster was busy running some
> > other job.
> >
> > I see that the number of files being processed is large, 50353. That
> could
> > be a reason for slowness, but ~8 minutes as shown in the logs seems to be
> > on the higher end for that.
> >
> > May be also post your script here.
> >
> > On Thu, May 31, 2012 at 2:38 AM, Markus Resch <[email protected]
> >wrote:
> >
> > > Hi all,
> > >
> > > when we're running a pig job for aggregating some amount of slightly
> > > compressed avro data (~160GByte), the time until the first actual
> mapred
> > > job starts takes ages:
> > > 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
> > > to:
> > > [...]
> > > 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
> > > Pig features used in the script:
> > > [...]
> > > 16:07:00,969 [main] INFO
> > >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - 0% complete
> > > [...]
> > > 16:07:30,886 [main] INFO
> > >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> > > - BytesPerReducer=1000000000 maxReducers=999
> totalInputFileSize=463325937621
> > > [...]
> > > 16:15:38,022 [Thread-16] INFO
> > > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> > > paths to process : 50353
> > >
> > > This log messages are from our test cluster which has a dedicated
> > > jobtracker and namenode each and 5 data nodes with a map task capacity
> > > of 15 and a reduce task capacity of 10. There were 6899 map tasks and
> > > 464 reduce tasks set up.
> > >
> > > During the initialisation phase we were observing the work load and
> > > memory usage of jobtracker, namenode and some data nodes using top.
> > > Those were nearly all the time kind of bored (e.g. 30% cpu load on the
> > > namenode, total idle on he data nodes). When the jobs were running most
> > > of the tasks where in "waiting for IO" most of the time. It seemed
> there
> > > was some swapping space reserved but rarely used in those times.
> > >
> > > In our eyes it looks like a hadoop config issue but we have no idea
> what
> > > it exaclty could be. Jobs with about 10GBytes of input data were
> running
> > > quite well.
> > >
> > > Any hint where to tweak will be appreciated
> > >
> > > Thanks
> > > Markus
> > >
> > >
>
> --
> Markus Resch
> Software Developer
> P: +49 6103-5715-236 | F: +49 6103-5715-111 |
> ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany
> www.adtech.com<http://www.adtech.com>
>
> ADTECH | A Division of Advertising.com Group - Residence of the Company:
> Dreieich, Germany - Registration Office: Offenbach, HRB 46021
> Management Board: Erhard Neumann, Mark Thielen
>
> This message contains privileged and confidential information. Any
> dissemination, distribution, copying or other use of this
> message or any of its content (data, prices...) to any third parties may
> only occur with ADTECH's prior consent.
>
>

Re: Job setup for a pig run takes ages

Reply via email to