Hi all, when we're running a pig job for aggregating some amount of slightly compressed avro data (~160GByte), the time until the first actual mapred job starts takes ages: 15:27:21,052 [main] INFO org.apache.pig.Main - Logging error messages to: [...] 15:57:27,816 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: [...] 16:07:00,969 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete [...] 16:07:30,886 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621 [...] 16:15:38,022 [Thread-16] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 50353
This log messages are from our test cluster which has a dedicated jobtracker and namenode each and 5 data nodes with a map task capacity of 15 and a reduce task capacity of 10. There were 6899 map tasks and 464 reduce tasks set up. During the initialisation phase we were observing the work load and memory usage of jobtracker, namenode and some data nodes using top. Those were nearly all the time kind of bored (e.g. 30% cpu load on the namenode, total idle on he data nodes). When the jobs were running most of the tasks where in "waiting for IO" most of the time. It seemed there was some swapping space reserved but rarely used in those times. In our eyes it looks like a hadoop config issue but we have no idea what it exaclty could be. Jobs with about 10GBytes of input data were running quite well. Any hint where to tweak will be appreciated Thanks Markus
