Hi Prashant, Hi Thejas,
thanks for your very quick answer.
No, this is not a typo. Those time stamps are true and as I said the
machines are not very busy during this time.
As this is our test cluster I am sure I am the only one who is running
jobs on it. Another issue we have is that we are currently only able to
run one job at a time but this shouldn't be the topic of this request.
We even have no continuous input stream to that cluster but copied a
bunch of data to it some time ago.
From my perspective the 464 GB of input data you are mentioned is the
uncompressed amount of the 160GByte compressed files. Which I get when I
use hadoop -f dus on that folder.
Another interesting fact for you could be that we're running the
cloudera CDH3 Update 3 version on our systems.
I suspect this could be due to some fancy avro schema validation
implicitly executed by the avro storage? If so, can this be avoided?
Sadly I'm currently not able to provide you the actual script currently
as it contains confidential information but I will try to provide you a
version as soon as possible. But I'd rather think of a configuration
problem to the hadoop or pig anyways as the script works fine with a
smaller amount of input data
I would ask the hadoop mailing list if this issue would occur during the
actual mapred run but as this occur even before a single mapred job is
launched I suspect pig to have a problem.
Thanks
Markus
This is the full log until the main work job starts:
mapred@ournamenode$ pig OurScript.pig
2012-05-30 15:27:21,052 [main] INFO org.apache.pig.Main - Logging error
messages to: /tmp/pig_1338384441037.log
2012-05-30 15:27:21,368 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://OurNamenode:9000
2012-05-30 15:27:21,609 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to map-reduce job tracker at:
dev-jobtracker001.eu-fra.adtech.com:54311
2012-05-30 15:57:27,814 [main] WARN org.apache.pig.PigServer -
Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
2012-05-30 15:57:27,816 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
2012-05-30 15:57:27,816 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
pig.usenewlogicalplan is set to true. New logical plan will be used.
2012-05-30 16:06:55,304 [main] INFO
org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
2012-05-30 16:06:55,308 [main] INFO
org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
for dataImport: $2, $3, $4
2012-05-30 16:06:55,441 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
OutputData1:
Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
Operator Key: scope-521)
2012-05-30 16:06:55,441 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
OutputData2:
Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
scope-524 Operator Key: scope-524)
2012-05-30 16:06:55,441 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
OutputData2:
Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
scope-483 Operator Key: scope-483)
2012-05-30 16:06:55,453 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- File concatenation threshold: 100 optimistic? false
2012-05-30 16:06:55,467 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 1
2012-05-30 16:06:55,471 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1
2012-05-30 16:06:55,483 [main] WARN
org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
is available
2012-05-30 16:06:55,484 [main] INFO
org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop
library
2012-05-30 16:06:55,484 [main] INFO
org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library
loaded
2012-05-30 16:06:55,486 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 1
2012-05-30 16:06:55,486 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- number of input files: 1
2012-05-30 16:06:55,516 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 7
2012-05-30 16:06:55,516 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- Merged 2 map-only splittees.
2012-05-30 16:06:55,516 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- Merged 2 out of total 3 MR operators.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- Merged 2 map-only splittees.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- Merged 2 out of total 3 MR operators.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- Merged 0 out of total 2 MR operators.
2012-05-30 16:06:55,517 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 3
2012-05-30 16:06:56,131 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
added to the job
2012-05-30 16:06:56,138 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-05-30 16:07:00,432 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2012-05-30 16:07:00,468 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission
2012-05-30 16:07:00,969 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2012-05-30 16:07:01,460 [Thread-4] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 1
2012-05-30 16:07:01,460 [Thread-4] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1
2012-05-30 16:07:01,462 [Thread-4] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 1
2012-05-30 16:07:02,353 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_201205080931_0178
2012-05-30 16:07:02,353 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- More information at:
http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178
2012-05-30 16:07:12,421 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 16% complete
2012-05-30 16:07:13,929 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 33% complete
2012-05-30 16:07:16,969 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
added to the job
2012-05-30 16:07:16,970 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-05-30 16:07:26,888 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2012-05-30 16:07:30,886 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
2012-05-30 16:07:30,886 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Neither PARALLEL nor default parallelism is set for this job. Setting number
of reducers to 464
2012-05-30 16:07:30,947 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
2012-05-30 16:15:38,022 [Thread-16] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 50353
2012-05-30 16:15:54,421 [Thread-16] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 6899
2012-05-30 16:15:57,844 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_201205080931_0179
Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi:
Hi Markus,
Have you checked the JobTracker at the time of launching the job that Map
slots were available?
Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
jobs are running fine, there should be no reason a larger dataset should be
stuck, atleast not on Pig side. I can't think of a good reason why the job
does not take off other than the fact that cluster was busy running some
other job.
I see that the number of files being processed is large, 50353. That could
be a reason for slowness, but ~8 minutes as shown in the logs seems to be
on the higher end for that.
May be also post your script here.
On Thu, May 31, 2012 at 2:38 AM, Markus Resch<[email protected]>wrote:
Hi all,
when we're running a pig job for aggregating some amount of slightly
compressed avro data (~160GByte), the time until the first actual mapred
job starts takes ages:
15:27:21,052 [main] INFO org.apache.pig.Main - Logging error messages
to:
[...]
15:57:27,816 [main] INFO org.apache.pig.tools.pigstats.ScriptState -
Pig features used in the script:
[...]
16:07:00,969 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
[...]
16:07:30,886 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
[...]
16:15:38,022 [Thread-16] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 50353
This log messages are from our test cluster which has a dedicated
jobtracker and namenode each and 5 data nodes with a map task capacity
of 15 and a reduce task capacity of 10. There were 6899 map tasks and
464 reduce tasks set up.
During the initialisation phase we were observing the work load and
memory usage of jobtracker, namenode and some data nodes using top.
Those were nearly all the time kind of bored (e.g. 30% cpu load on the
namenode, total idle on he data nodes). When the jobs were running most
of the tasks where in "waiting for IO" most of the time. It seemed there
was some swapping space reserved but rarely used in those times.
In our eyes it looks like a hadoop config issue but we have no idea what
it exaclty could be. Jobs with about 10GBytes of input data were running
quite well.
Any hint where to tweak will be appreciated
Thanks
Markus