Re: Job setup for a pig run takes ages

Markus Resch Fri, 01 Jun 2012 06:35:44 -0700

Hi Prashant, Hi Thejas,

thanks for your very quick answer.
No, this is not a typo. Those time stamps are true and as I said the
machines are not very busy during this time.


As this is our test cluster I am sure I am the only one who is running
jobs on it. Another issue we have is that we are currently only able to
run one job at a time but this shouldn't be the topic of this request.
We even have no continuous input stream to that cluster but copied a
bunch of data to it some time ago. 
>From my perspective the 464 GB of input data you are mentioned is the
uncompressed amount of the 160GByte compressed files. Which I get when I
use hadoop -f dus on that folder.

Another interesting fact for you could be that we're running the
cloudera CDH3 Update 3 version on our systems. 

I suspect this could be due to some fancy avro schema validation
implicitly executed by the avro storage? If so, can this be avoided? 

Sadly I'm currently not able to provide you the actual script currently
as it contains confidential information but I will try to provide you a
version as soon as possible. But I'd rather think of a configuration
problem to the hadoop or pig anyways as the script works fine with a
smaller amount of input data

I would ask the hadoop mailing list if this issue would occur during the
actual mapred run but as this occur even before a single mapred job is
launched I suspect pig to have a problem. 

Thanks
Markus

This is the full log until the main work job starts:
mapred@ournamenode$ pig OurScript.pig
2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
messages to: /tmp/pig_1338384441037.log
2012-05-30 15:27:21,368 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://OurNamenode:9000
2012-05-30 15:27:21,609 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to map-reduce job tracker at:
dev-jobtracker001.eu-fra.adtech.com:54311
2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
2012-05-30 15:57:27,816 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
2012-05-30 15:57:27,816 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
pig.usenewlogicalplan is set to true. New logical plan will be used.
2012-05-30 16:06:55,304 [main] INFO
org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
2012-05-30 16:06:55,308 [main] INFO
org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
for dataImport: $2, $3, $4
2012-05-30 16:06:55,441 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
OutputData1:
Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
Operator Key: scope-521)
2012-05-30 16:06:55,441 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
OutputData2:
Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
scope-524 Operator Key: scope-524)
2012-05-30 16:06:55,441 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
OutputData2:
Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
scope-483 Operator Key: scope-483)
2012-05-30 16:06:55,453 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- File concatenation threshold: 100 optimistic? false
2012-05-30 16:06:55,467 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 1
2012-05-30 16:06:55,471 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1
2012-05-30 16:06:55,483 [main] WARN
org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
is available
2012-05-30 16:06:55,484 [main] INFO
org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
library
2012-05-30 16:06:55,484 [main] INFO
org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
loaded
2012-05-30 16:06:55,486 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 1
2012-05-30 16:06:55,486 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- number of input files: 1
2012-05-30 16:06:55,516 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 7
2012-05-30 16:06:55,516 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 2 map-only splittees.
2012-05-30 16:06:55,516 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 2 out of total 3 MR operators.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 2 map-only splittees.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 2 out of total 3 MR operators.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 0 out of total 2 MR operators.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 3
2012-05-30 16:06:56,131 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
added to the job
2012-05-30 16:06:56,138 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-05-30 16:07:00,432 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up single store job
2012-05-30 16:07:00,468 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 1 map-reduce job(s) waiting for submission
2012-05-30 16:07:00,969 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete
2012-05-30 16:07:01,460 [Thread-4] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 1
2012-05-30 16:07:01,460 [Thread-4] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1
2012-05-30 16:07:01,462 [Thread-4] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 1
2012-05-30 16:07:02,353 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- HadoopJobId: job_201205080931_0178
2012-05-30 16:07:02,353 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- More information at: 
http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
2012-05-30 16:07:12,421 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 16% complete
2012-05-30 16:07:13,929 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 33% complete
2012-05-30 16:07:16,969 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
added to the job
2012-05-30 16:07:16,970 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-05-30 16:07:26,888 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up single store job
2012-05-30 16:07:30,886 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
2012-05-30 16:07:30,886 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Neither PARALLEL nor default parallelism is set for this job. Setting number 
of reducers to 464
2012-05-30 16:07:30,947 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 1 map-reduce job(s) waiting for submission.
2012-05-30 16:15:38,022 [Thread-16] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 50353
2012-05-30 16:15:54,421 [Thread-16] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 6899
2012-05-30 16:15:57,844 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- HadoopJobId: job_201205080931_0179




Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
> Hi Markus,
> 
> Have you checked the JobTracker at the time of launching the job that Map
> slots were available?
> 
> Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
> jobs are running fine, there should be no reason a larger dataset should be
> stuck, atleast not on Pig side. I can't think of a good reason why the job
> does not take off other than the fact that cluster was busy running some
> other job.
> 
> I see that the number of files being processed is large, 50353. That could
> be a reason for slowness, but ~8 minutes as shown in the logs seems to be
> on the higher end for that.
> 
> May be also post your script here.
> 
> On Thu, May 31, 2012 at 2:38 AM, Markus Resch <[email protected]>wrote:
> 
> > Hi all,
> >
> > when we're running a pig job for aggregating some amount of slightly
> > compressed avro data (~160GByte), the time until the first actual mapred
> > job starts takes ages:
> > 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
> > to:
> > [...]
> > 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
> > Pig features used in the script:
> > [...]
> > 16:07:00,969 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 0% complete
> > [...]
> > 16:07:30,886 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> > - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> > [...]
> > 16:15:38,022 [Thread-16] INFO
> > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> > paths to process : 50353
> >
> > This log messages are from our test cluster which has a dedicated
> > jobtracker and namenode each and 5 data nodes with a map task capacity
> > of 15 and a reduce task capacity of 10. There were 6899 map tasks and
> > 464 reduce tasks set up.
> >
> > During the initialisation phase we were observing the work load and
> > memory usage of jobtracker, namenode and some data nodes using top.
> > Those were nearly all the time kind of bored (e.g. 30% cpu load on the
> > namenode, total idle on he data nodes). When the jobs were running most
> > of the tasks where in "waiting for IO" most of the time. It seemed there
> > was some swapping space reserved but rarely used in those times.
> >
> > In our eyes it looks like a hadoop config issue but we have no idea what
> > it exaclty could be. Jobs with about 10GBytes of input data were running
> > quite well.
> >
> > Any hint where to tweak will be appreciated
> >
> > Thanks
> > Markus
> >
> >

-- 
Markus Resch
Software Developer 
P: +49 6103-5715-236 | F: +49 6103-5715-111 |
ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany
www.adtech.com<http://www.adtech.com>

ADTECH | A Division of Advertising.com Group - Residence of the Company:
Dreieich, Germany - Registration Office: Offenbach, HRB 46021
Management Board: Erhard Neumann, Mark Thielen

This message contains privileged and confidential information. Any
dissemination, distribution, copying or other use of this
message or any of its content (data, prices...) to any third parties may
only occur with ADTECH's prior consent.

Re: Job setup for a pig run takes ages

Reply via email to