Hi Claudio... I turned checkpointin on and executed the giraph job.
hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dmapred.job.map.memory.mb=1500 -Dmapred.map.child.java.opts=-Xmx1G -Dgiraph.useSuperstepCounters=false -Dgiraph.useOutOfCoreMessages=true -Dgiraph.checkpointFrequency=1 org.apache.giraph.examples.MyShortestDistance -vif org.apache.giraph.examples.io.formats.MyShortestDistanceVertexInputFormat -vip /user/hduser/big_input/my_line_rank_input6.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hduser/sp_output530/sd_output -w 1 -mc org.apache.giraph.examples.MyShortestDistance\$MyMasterCompute 14/01/31 09:47:57 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one. 14/01/31 09:47:57 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one. 14/01/31 09:48:21 INFO job.GiraphJob: run: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201401310947_0001 14/01/31 09:49:24 INFO job.HaltApplicationUtils$DefaultHaltInstructionsWriter: writeHaltInstructions: To halt after next superstep execute: 'bin/halt-application --zkServer kanha-Vostro-1014:22181 --zkNode /_hadoopBsp/job_201401310947_0001/_haltComputation' 14/01/31 09:49:24 INFO mapred.JobClient: Running job: job_201401310947_0001 14/01/31 09:49:25 INFO mapred.JobClient: map 100% reduce 0% 14/01/31 09:59:15 INFO mapred.JobClient: Task Id : attempt_201401310947_0001_m_000001_0, Status : FAILED org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hduser/_bsp/_checkpoints/job_201401310947_0001/4.kanha-Vostro-1014_1.metadata could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:596) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1383) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1379) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1377) at org.apache.hadoop.ipc.Client.call(Client.java:1030) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224) at com.sun.proxy.$Proxy2.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at com.sun.proxy.$Proxy2.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3104) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2975) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446) Task attempt_201401310947_0001_m_000001_0 failed to report status for 600 seconds. Killing! attempt_201401310947_0001_m_000001_0: SLF4J: Class path contains multiple SLF4J bindings. attempt_201401310947_0001_m_000001_0: SLF4J: Found binding in [file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201401310947_0001/jars/org/slf4j/impl/StaticLoggerBinder.class] attempt_201401310947_0001_m_000001_0: SLF4J: Found binding in [jar:file:/usr/local/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] attempt_201401310947_0001_m_000001_0: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. attempt_201401310947_0001_m_000001_0: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] attempt_201401310947_0001_m_000001_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient). attempt_201401310947_0001_m_000001_0: log4j:WARN Please initialize the log4j system properly. 14/01/31 09:59:19 INFO mapred.JobClient: map 50% reduce 0% 14/01/31 09:59:31 INFO mapred.JobClient: map 100% reduce 0% 14/01/31 10:14:15 INFO mapred.JobClient: map 50% reduce 0% 14/01/31 10:14:20 INFO mapred.JobClient: Task Id : attempt_201401310947_0001_m_000000_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) attempt_201401310947_0001_m_000000_0: SLF4J: Class path contains multiple SLF4J bindings. attempt_201401310947_0001_m_000000_0: SLF4J: Found binding in [file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201401310947_0001/jars/org/slf4j/impl/StaticLoggerBinder.class] attempt_201401310947_0001_m_000000_0: SLF4J: Found binding in [jar:file:/usr/local/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] attempt_201401310947_0001_m_000000_0: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. attempt_201401310947_0001_m_000000_0: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/01/31 10:14:30 INFO mapred.JobClient: map 100% reduce 0% 14/01/31 10:24:14 INFO mapred.JobClient: Task Id : attempt_201401310947_0001_m_000001_1, Status : FAILED java.lang.IllegalStateException: run: Caught an unrecoverable exception registerHealth: Trying to get the new application attempt by killing self at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) Caused by: java.lang.IllegalStateException: registerHealth: Trying to get the new application attempt by killing self at org.apache.giraph.worker.BspServiceWorker.registerHealth(BspServiceWorker.java:627) at org.apache.giraph.worker.BspServiceWorker.startSuperstep(BspServiceWorker.java:681) at org.apache.giraph.worker.BspServiceWorker.setup(BspServiceWorker.java:486) at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:246) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91) ... 7 more Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /_hadoopBsp/job_201401310947_0001/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/kanha-Vostro-1014_1 at org.apache.zookeeper.KeeperException.create(KeeperException.java:110) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) at org.apache.giraph.zk.ZooKeeperExt.createExt(ZooKeeperExt.java:152) at org.apache.giraph.worker.BspServiceWorker.registerHealth(BspServiceWorker.java:611) ... 11 more Task attempt_201401310947_0001_m_000001_1 failed to report status for 600 seconds. Killing! attempt_201401310947_0001_m_000001_1: SLF4J: Class path contains multiple SLF4J bindings. attempt_201401310947_0001_m_000001_1: SLF4J: Found binding in [file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201401310947_0001/jars/org/slf4j/impl/StaticLoggerBinder.class] attempt_201401310947_0001_m_000001_1: SLF4J: Found binding in [jar:file:/usr/local/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] attempt_201401310947_0001_m_000001_1: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. attempt_201401310947_0001_m_000001_1: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] attempt_201401310947_0001_m_000001_1: log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ClientCnxn). attempt_201401310947_0001_m_000001_1: log4j:WARN Please initialize the log4j system properly. 14/01/31 10:24:15 INFO mapred.JobClient: map 50% reduce 0% 14/01/31 10:24:24 INFO mapred.JobClient: map 100% reduce 0% please suggest me something related to fix this failure.. Thanks Jyoti On Wed, Jan 29, 2014 at 10:16 PM, Claudio Martella < [email protected]> wrote: > looks like one of your workers died. If you expect such a long job, I'd > suggest you turn checkpointing on. > > > On Wed, Jan 29, 2014 at 5:30 PM, Jyoti Yadav > <[email protected]>wrote: > >> Thanks all for your reply.. >> Actually i am working with an algorithm in which single source shortest >> path algorithm runs for thousands of vertices .suppose on an average for >> one vertex this algo takes 5-6 supersteps,then for thousands of >> vertices,count of superstep is extremely large..In that case at run time >> following error is thrown... >> >> ERROR org.apache.giraph.master.BspServiceMaster: >> superstepChosenWorkerAlive: Missing chosen worker >> Worker(hostname=kanha-Vostro-1014, MRtaskID=1, port=30001) on superstep >> 19528 >> 2014-01-28 05:11:36,852 INFO org.apache.giraph.master.MasterThread: >> masterThread: Coordination of superstep 19528 took 636.831 seconds ended >> with state WORKER_FAILURE and is now on superstep 19528 >> 2014-01-28 05:11:39,446 ERROR org.apache.giraph.master.MasterThread: >> masterThread: Master algorithm failed with ArrayIndexOutOfBoundsException >> java.lang.ArrayIndexOutOfBoundsException: -1 >> >> Any ideas?? >> >> Thanks >> Jyoti >> >> >> On Wed, Jan 29, 2014 at 8:55 PM, Peter Grman <[email protected]>wrote: >> >>> Yes but you can disable the counters per superstep, if you don't need >>> the data, and than I had around 2000 after which my algorithm stopped. >>> >>> Cheers >>> Peter >>> On Jan 29, 2014 4:22 PM, "Claudio Martella" <[email protected]> >>> wrote: >>> >>>> the limit is currently defined by the maximum number of counters your >>>> jobtracker allows. Hence, by default the max number of supersteps is around >>>> 90. >>>> >>>> check http://giraph.apache.org/faq.html to see how to increase it. >>>> >>>> >>>> On Wed, Jan 29, 2014 at 4:12 PM, Jyoti Yadav < >>>> [email protected]> wrote: >>>> >>>>> Hi folks.. >>>>> >>>>> Is there any limit for maximum no of supersteps while running a giraph >>>>> job?? >>>>> >>>>> Thanks >>>>> Jyoti >>>>> >>>> >>>> >>>> >>>> -- >>>> Claudio Martella >>>> >>>> >>> >> > > > -- > Claudio Martella > >
