Thank you Andre, Setting "giraph.useSuperstepCounters" = false
solved my issue. The job still hung at 100% and then eventually completed successfully. -Bence -----Original Message----- From: André Kelpe [mailto:[email protected]] Sent: Wednesday, November 28, 2012 10:45 AM To: [email protected] Subject: Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters Hi Bence, on older version of hadoop there is a hard limit on counters, which a job cannot modify. Since the counters are not crucial for the functioning of giraph, you can turn them off by setting giraph.useSuperstepCounters to false in your job config. I would also recommend looking into the GiraphConfiguration class, as it contains all the settings, that you might be interested in (like checkpoint frequency etc.): https://github.com/apache/giraph/blob/trunk/giraph/src/main/java/org/apache/giraph/GiraphConfiguration.java HTH -Andre 2012/11/28 Magyar, Bence (US SSA) <[email protected]>: > I have successfully run the shortest path example using Avery’s sample > input data. I am now attempting to run the shortest-path algorithm on > a much larger data set (300,000 nodes) and I am running into errors. > I have a 4-node cluster and am running the following command: > > > > > > ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar > org.apache.giraph.examples.SimpleShortestPathsVertex -if > org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip > /user/hduser/insight -of > org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op > /user/hduser/insight-out -w 3 > > > > > > It appears as though the shortest path computation “finishes”. That > is to say, I hit “100%”. Then the job just hangs for about 30 > seconds, decreases it’s progress to 75%, and then finally throws an exception: > > > > No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf > > 12/11/28 08:26:16 INFO mapred.JobClient: Running job: > job_201211271542_0004 > > 12/11/28 08:26:17 INFO mapred.JobClient: map 0% reduce 0% > > 12/11/28 08:26:33 INFO mapred.JobClient: map 25% reduce 0% > > 12/11/28 08:26:40 INFO mapred.JobClient: map 50% reduce 0% > > 12/11/28 08:26:42 INFO mapred.JobClient: map 75% reduce 0% > > 12/11/28 08:26:44 INFO mapred.JobClient: map 100% reduce 0% > > 12/11/28 08:27:45 INFO mapred.JobClient: map 75% reduce 0% > > 12/11/28 08:27:50 INFO mapred.JobClient: Task Id : > attempt_201211271542_0004_m_000000_0, Status : FAILED > > java.lang.Throwable: Child Error > > at > org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) > > Caused by: java.io.IOException: Task process exit with nonzero status of 1. > > at > org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) > > > > > > Digging into the log files a little deeper, I noticed that the number > of files generated by the last node in my cluster contains more log > directories than the previous three. > > > > I see: > > > > · attempt_201211280843_0001_m_000000_0 -> > /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20 > 1211280843_0001_m_000000_0 > > · attempt_201211280843_0001_m_000000_0.cleanup -> > /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20 > 1211280843_0001_m_000000_0.cleanup > > · attempt_201211280843_0001_m_000005_0 -> > /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20 > 1211280843_0001_m_000005_0 > > · job-acls.xml > > > > Whereas the first 3 nodes only contain 1 log folder underneath the > job, something like: “attempt_201211280843_0001_m_000003_0”. I am > assuming this is because something went wrong on node 4 and some > “cleanup logic” was attempted. > > > > At any rate, when I cd into the first log folder on the bad node, > (attempt_201211280843_0001_m_000000_0) and look into “syslog”, I see > the following error: > > > > > > 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster: > barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2] > > 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: > collectAndProcessAggregatorValues: Processed aggregators > > 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: > aggregateWorkerStats: Aggregation found > (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation= > false) > on superstep = 98 > > 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster: > coordinateSuperstep: Cleaning up old Superstep > /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstep > Dir/97 > > 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread: > masterThread: Coordination of superstep 98 took 0.445 seconds ended > with state THIS_SUPERSTEP_DONE and is now on superstep 99 > > 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper: > uncaughtException: OverrideExceptionHandler on thread > org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on > number of counters - Counters=120 Limit=120, exiting... > > org.apache.hadoop.mapred.Counters$CountersExceededException: Error: > Exceeded limits on number of counters - Counters=120 Limit=120 > > at > org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.jav > a:312) > > at > org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446) > > at > org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596) > > at > org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541) > > at > org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInpu > tOutputContext.java:88) > > at > org.apache.giraph.graph.MasterThread.run(MasterThread.java:131) > > 2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager: > onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper > process. > > > > > > What exactly is this limit on MapReduce job “counters”? What is a > MapReduce job “counter”? I assume it is some variable threshold to > keep things in check, and I know that I can modify the value in > mapred-site.xml: > > > > <property> > > <name>mapreduce.job.counters.limit</name> > > <value>120</value> > > <description>I have no idea what this does!!!</description> > > </property> > > > > I have tried increasing and decreasing this value and my subsequent > jobs pick up the change. However, neither increasing or decreasing > this value seems to make any difference. I always reach whatever > limit I’ve set and my job crashes. Besides, from outward appearances > it looks like the computation finished before the crash. Can anyone > please give deeper insight into what is happening here, or where I can look > for more help? > > > > Thanks, > > > > Bence > >
