Hi all,

when we're running a pig job for aggregating some amount of slightly
compressed avro data (~160GByte), the time until the first actual mapred
job starts takes ages:
15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
to:
[...]
15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
Pig features used in the script: 
[...]
16:07:00,969 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete
[...]
16:07:30,886 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
[...]
16:15:38,022 [Thread-16] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 50353

This log messages are from our test cluster which has a dedicated
jobtracker and namenode each and 5 data nodes with a map task capacity
of 15 and a reduce task capacity of 10. There were 6899 map tasks and
464 reduce tasks set up.

During the initialisation phase we were observing the work load and
memory usage of jobtracker, namenode and some data nodes using top.
Those were nearly all the time kind of bored (e.g. 30% cpu load on the
namenode, total idle on he data nodes). When the jobs were running most
of the tasks where in "waiting for IO" most of the time. It seemed there
was some swapping space reserved but rarely used in those times. 

In our eyes it looks like a hadoop config issue but we have no idea what
it exaclty could be. Jobs with about 10GBytes of input data were running
quite well. 

Any hint where to tweak will be appreciated

Thanks
Markus

Reply via email to