Hi Prashant, Hi Thejas, thanks for your very quick answer. No, this is not a typo. Those time stamps are true and as I said the machines are not very busy during this time.
As this is our test cluster I am sure I am the only one who is running jobs on it. Another issue we have is that we are currently only able to run one job at a time but this shouldn't be the topic of this request. We even have no continuous input stream to that cluster but copied a bunch of data to it some time ago. >From my perspective the 464 GB of input data you are mentioned is the uncompressed amount of the 160GByte compressed files. Which I get when I use hadoop -f dus on that folder. Another interesting fact for you could be that we're running the cloudera CDH3 Update 3 version on our systems. I suspect this could be due to some fancy avro schema validation implicitly executed by the avro storage? If so, can this be avoided? Sadly I'm currently not able to provide you the actual script currently as it contains confidential information but I will try to provide you a version as soon as possible. But I'd rather think of a configuration problem to the hadoop or pig anyways as the script works fine with a smaller amount of input data I would ask the hadoop mailing list if this issue would occur during the actual mapred run but as this occur even before a single mapred job is launched I suspect pig to have a problem. Thanks Markus This is the full log until the main work job starts: mapred@ournamenode$ pig OurScript.pig 2012-05-30 15:27:21,052 [main] INFO org.apache.pig.Main - Logging error messages to: /tmp/pig_1338384441037.log 2012-05-30 15:27:21,368 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://OurNamenode:9000 2012-05-30 15:27:21,609 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: dev-jobtracker001.eu-fra.adtech.com:54311 2012-05-30 15:57:27,814 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s). 2012-05-30 15:57:27,816 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER 2012-05-30 15:57:27,816 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used. 2012-05-30 16:06:55,304 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9 2012-05-30 16:06:55,308 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for dataImport: $2, $3, $4 2012-05-30 16:06:55,441 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: OutputData1: Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521 Operator Key: scope-521) 2012-05-30 16:06:55,441 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: OutputData2: Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) - scope-524 Operator Key: scope-524) 2012-05-30 16:06:55,441 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: OutputData2: Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) - scope-483 Operator Key: scope-483) 2012-05-30 16:06:55,453 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2012-05-30 16:06:55,467 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-05-30 16:06:55,471 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2012-05-30 16:06:55,483 [main] WARN org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library is available 2012-05-30 16:06:55,484 [main] INFO org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library 2012-05-30 16:06:55,484 [main] INFO org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library loaded 2012-05-30 16:06:55,486 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-05-30 16:06:55,486 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - number of input files: 1 2012-05-30 16:06:55,516 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 7 2012-05-30 16:06:55,516 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees. 2012-05-30 16:06:55,516 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators. 2012-05-30 16:06:55,517 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees. 2012-05-30 16:06:55,517 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators. 2012-05-30 16:06:55,517 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 0 out of total 2 MR operators. 2012-05-30 16:06:55,517 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3 2012-05-30 16:06:56,131 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-05-30 16:06:56,138 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-05-30 16:07:00,432 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-05-30 16:07:00,468 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission 2012-05-30 16:07:00,969 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2012-05-30 16:07:01,460 [Thread-4] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-05-30 16:07:01,460 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2012-05-30 16:07:01,462 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-05-30 16:07:02,353 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0178 2012-05-30 16:07:02,353 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://ourJobtracker:50030/jobdetails.jsp?jobid=job_201205080931_0178 2012-05-30 16:07:12,421 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete 2012-05-30 16:07:13,929 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete 2012-05-30 16:07:16,969 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-05-30 16:07:16,970 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-05-30 16:07:26,888 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-05-30 16:07:30,886 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621 2012-05-30 16:07:30,886 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 464 2012-05-30 16:07:30,947 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-05-30 16:15:38,022 [Thread-16] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 50353 2012-05-30 16:15:54,421 [Thread-16] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6899 2012-05-30 16:15:57,844 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201205080931_0179 Am Donnerstag, den 31.05.2012, 02:57 -0700 schrieb Prashant Kommireddi: > Hi Markus, > > Have you checked the JobTracker at the time of launching the job that Map > slots were available? > > Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB > jobs are running fine, there should be no reason a larger dataset should be > stuck, atleast not on Pig side. I can't think of a good reason why the job > does not take off other than the fact that cluster was busy running some > other job. > > I see that the number of files being processed is large, 50353. That could > be a reason for slowness, but ~8 minutes as shown in the logs seems to be > on the higher end for that. > > May be also post your script here. > > On Thu, May 31, 2012 at 2:38 AM, Markus Resch <[email protected]>wrote: > > > Hi all, > > > > when we're running a pig job for aggregating some amount of slightly > > compressed avro data (~160GByte), the time until the first actual mapred > > job starts takes ages: > > 15:27:21,052 [main] INFO org.apache.pig.Main - Logging error messages > > to: > > [...] > > 15:57:27,816 [main] INFO org.apache.pig.tools.pigstats.ScriptState - > > Pig features used in the script: > > [...] > > 16:07:00,969 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > > - 0% complete > > [...] > > 16:07:30,886 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > > - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621 > > [...] > > 16:15:38,022 [Thread-16] INFO > > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input > > paths to process : 50353 > > > > This log messages are from our test cluster which has a dedicated > > jobtracker and namenode each and 5 data nodes with a map task capacity > > of 15 and a reduce task capacity of 10. There were 6899 map tasks and > > 464 reduce tasks set up. > > > > During the initialisation phase we were observing the work load and > > memory usage of jobtracker, namenode and some data nodes using top. > > Those were nearly all the time kind of bored (e.g. 30% cpu load on the > > namenode, total idle on he data nodes). When the jobs were running most > > of the tasks where in "waiting for IO" most of the time. It seemed there > > was some swapping space reserved but rarely used in those times. > > > > In our eyes it looks like a hadoop config issue but we have no idea what > > it exaclty could be. Jobs with about 10GBytes of input data were running > > quite well. > > > > Any hint where to tweak will be appreciated > > > > Thanks > > Markus > > > > -- Markus Resch Software Developer P: +49 6103-5715-236 | F: +49 6103-5715-111 | ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany www.adtech.com<http://www.adtech.com> ADTECH | A Division of Advertising.com Group - Residence of the Company: Dreieich, Germany - Registration Office: Offenbach, HRB 46021 Management Board: Erhard Neumann, Mark Thielen This message contains privileged and confidential information. Any dissemination, distribution, copying or other use of this message or any of its content (data, prices...) to any third parties may only occur with ADTECH's prior consent.
