I have successfully run the shortest path example using Avery's sample input 
data.  I am now attempting to run the shortest-path algorithm on a much larger 
data set (300,000 nodes) and I am running into errors.  I have a 4-node cluster 
and am running the following command:


./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex -if 
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip 
/user/hduser/insight -of 
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op 
/user/hduser/insight-out -w 3


It appears as though the shortest path computation "finishes".  That is to say, 
I hit "100%".  Then the job just hangs for about 30 seconds, decreases it's 
progress to 75%, and then finally throws an exception:

No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf
12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004
12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%
12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%
12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%
12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%
12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%
12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%
12/11/28 08:27:50 INFO mapred.JobClient: Task Id : 
attempt_201211271542_0004_m_000000_0, Status : FAILED
java.lang.Throwable: Child Error
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)


Digging into the log files a little deeper, I noticed that the number of files 
generated by the last node in my cluster contains more log directories than the 
previous three.

I see:


*        attempt_201211280843_0001_m_000000_0 -> 
/app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0

*        attempt_201211280843_0001_m_000000_0.cleanup -> 
/app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0.cleanup

*        attempt_201211280843_0001_m_000005_0 -> 
/app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000005_0

*        job-acls.xml

Whereas the first 3 nodes only contain 1 log folder underneath the job, 
something like:  "attempt_201211280843_0001_m_000003_0".  I am assuming this is 
because something went wrong on node 4 and some "cleanup logic" was attempted.

At any rate, when I cd into the first log folder on the bad node, 
(attempt_201211280843_0001_m_000000_0) and look into "syslog", I see the 
following error:


2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster: 
barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]
2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: 
collectAndProcessAggregatorValues: Processed aggregators
2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: 
aggregateWorkerStats: Aggregation found 
(vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false) on 
superstep = 98
2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster: 
coordinateSuperstep: Cleaning up old Superstep 
/_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97
2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread: 
masterThread: Coordination of superstep 98 took 0.445 seconds ended with state 
THIS_SUPERSTEP_DONE and is now on superstep 99
2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper: 
uncaughtException: OverrideExceptionHandler on thread 
org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number of 
counters - Counters=120 Limit=120, exiting...
org.apache.hadoop.mapred.Counters$CountersExceededException: Error: Exceeded 
limits on number of counters - Counters=120 Limit=120
        at 
org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312)
        at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)
        at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)
        at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)
        at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInputOutputContext.java:88)
        at org.apache.giraph.graph.MasterThread.run(MasterThread.java:131)
2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager: 
onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.


What exactly is this limit on MapReduce job "counters"?  What is a MapReduce 
job "counter"?  I assume it is some variable threshold to keep things in check, 
and I know that I can modify the value in mapred-site.xml:

<property>
  <name>mapreduce.job.counters.limit</name>
  <value>120</value>
  <description>I have no idea what this does!!!</description>
</property>

I have tried increasing and decreasing this value and my subsequent jobs pick 
up the change.  However, neither increasing or decreasing this value seems to 
make any difference.  I always reach whatever limit I've set and my job 
crashes.  Besides, from outward appearances it looks like the computation 
finished before the crash.  Can anyone please give deeper insight into what is 
happening here, or where I can look for more help?

Thanks,

Bence

Reply via email to