I have successfully run the shortest path example using Avery's sample input
data. I am now attempting to run the shortest-path algorithm on a much larger
data set (300,000 nodes) and I am running into errors. I have a 4-node cluster
and am running the following command:
./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar
org.apache.giraph.examples.SimpleShortestPathsVertex -if
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip
/user/hduser/insight -of
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op
/user/hduser/insight-out -w 3
It appears as though the shortest path computation "finishes". That is to say,
I hit "100%". Then the job just hangs for about 30 seconds, decreases it's
progress to 75%, and then finally throws an exception:
No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf
12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004
12/11/28 08:26:17 INFO mapred.JobClient: map 0% reduce 0%
12/11/28 08:26:33 INFO mapred.JobClient: map 25% reduce 0%
12/11/28 08:26:40 INFO mapred.JobClient: map 50% reduce 0%
12/11/28 08:26:42 INFO mapred.JobClient: map 75% reduce 0%
12/11/28 08:26:44 INFO mapred.JobClient: map 100% reduce 0%
12/11/28 08:27:45 INFO mapred.JobClient: map 75% reduce 0%
12/11/28 08:27:50 INFO mapred.JobClient: Task Id :
attempt_201211271542_0004_m_000000_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
Digging into the log files a little deeper, I noticed that the number of files
generated by the last node in my cluster contains more log directories than the
previous three.
I see:
* attempt_201211280843_0001_m_000000_0 ->
/app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0
* attempt_201211280843_0001_m_000000_0.cleanup ->
/app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0.cleanup
* attempt_201211280843_0001_m_000005_0 ->
/app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000005_0
* job-acls.xml
Whereas the first 3 nodes only contain 1 log folder underneath the job,
something like: "attempt_201211280843_0001_m_000003_0". I am assuming this is
because something went wrong on node 4 and some "cleanup logic" was attempted.
At any rate, when I cd into the first log folder on the bad node,
(attempt_201211280843_0001_m_000000_0) and look into "syslog", I see the
following error:
2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster:
barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]
2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
collectAndProcessAggregatorValues: Processed aggregators
2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
aggregateWorkerStats: Aggregation found
(vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false) on
superstep = 98
2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster:
coordinateSuperstep: Cleaning up old Superstep
/_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97
2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread:
masterThread: Coordination of superstep 98 took 0.445 seconds ended with state
THIS_SUPERSTEP_DONE and is now on superstep 99
2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper:
uncaughtException: OverrideExceptionHandler on thread
org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number of
counters - Counters=120 Limit=120, exiting...
org.apache.hadoop.mapred.Counters$CountersExceededException: Error: Exceeded
limits on number of counters - Counters=120 Limit=120
at
org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312)
at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)
at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)
at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)
at
org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInputOutputContext.java:88)
at org.apache.giraph.graph.MasterThread.run(MasterThread.java:131)
2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager:
onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.
What exactly is this limit on MapReduce job "counters"? What is a MapReduce
job "counter"? I assume it is some variable threshold to keep things in check,
and I know that I can modify the value in mapred-site.xml:
<property>
<name>mapreduce.job.counters.limit</name>
<value>120</value>
<description>I have no idea what this does!!!</description>
</property>
I have tried increasing and decreasing this value and my subsequent jobs pick
up the change. However, neither increasing or decreasing this value seems to
make any difference. I always reach whatever limit I've set and my job
crashes. Besides, from outward appearances it looks like the computation
finished before the crash. Can anyone please give deeper insight into what is
happening here, or where I can look for more help?
Thanks,
Bence