Re: constraint about no of supersteps

Jyoti Yadav Thu, 30 Jan 2014 20:59:20 -0800

Hi Claudio...

I turned checkpointin on and executed the giraph job.


hadoop jar
$GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner -Dmapred.job.map.memory.mb=1500
-Dmapred.map.child.java.opts=-Xmx1G -Dgiraph.useSuperstepCounters=false
-Dgiraph.useOutOfCoreMessages=true -Dgiraph.checkpointFrequency=1
org.apache.giraph.examples.MyShortestDistance -vif
org.apache.giraph.examples.io.formats.MyShortestDistanceVertexInputFormat
-vip /user/hduser/big_input/my_line_rank_input6.txt -vof
org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
/user/hduser/sp_output530/sd_output -w 1 -mc
org.apache.giraph.examples.MyShortestDistance\$MyMasterCompute


14/01/31 09:47:57 INFO utils.ConfigurationUtils: No edge input format
specified. Ensure your InputFormat does not require one.
14/01/31 09:47:57 INFO utils.ConfigurationUtils: No edge output format
specified. Ensure your OutputFormat does not require one.
14/01/31 09:48:21 INFO job.GiraphJob: run: Tracking URL:
http://localhost:50030/jobdetails.jsp?jobid=job_201401310947_0001
14/01/31 09:49:24 INFO
job.HaltApplicationUtils$DefaultHaltInstructionsWriter:
writeHaltInstructions: To halt after next superstep execute:
'bin/halt-application --zkServer kanha-Vostro-1014:22181 --zkNode
/_hadoopBsp/job_201401310947_0001/_haltComputation'
14/01/31 09:49:24 INFO mapred.JobClient: Running job: job_201401310947_0001
14/01/31 09:49:25 INFO mapred.JobClient:  map 100% reduce 0%
14/01/31 09:59:15 INFO mapred.JobClient: Task Id :
attempt_201401310947_0001_m_000001_0, Status : FAILED
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/hduser/_bsp/_checkpoints/job_201401310947_0001/4.kanha-Vostro-1014_1.metadata
could only be replicated to 0 nodes, instead of 1
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417)
    at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:596)
    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1383)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1379)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1377)

    at org.apache.hadoop.ipc.Client.call(Client.java:1030)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3104)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2975)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)

Task attempt_201401310947_0001_m_000001_0 failed to report status for 600
seconds. Killing!
attempt_201401310947_0001_m_000001_0: SLF4J: Class path contains multiple
SLF4J bindings.
attempt_201401310947_0001_m_000001_0: SLF4J: Found binding in
[file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201401310947_0001/jars/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000001_0: SLF4J: Found binding in
[jar:file:/usr/local/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000001_0: SLF4J: See
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
attempt_201401310947_0001_m_000001_0: SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory]
attempt_201401310947_0001_m_000001_0: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201401310947_0001_m_000001_0: log4j:WARN Please initialize the
log4j system properly.
14/01/31 09:59:19 INFO mapred.JobClient:  map 50% reduce 0%
14/01/31 09:59:31 INFO mapred.JobClient:  map 100% reduce 0%
14/01/31 10:14:15 INFO mapred.JobClient:  map 50% reduce 0%
14/01/31 10:14:20 INFO mapred.JobClient: Task Id :
attempt_201401310947_0001_m_000000_0, Status : FAILED
java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

attempt_201401310947_0001_m_000000_0: SLF4J: Class path contains multiple
SLF4J bindings.
attempt_201401310947_0001_m_000000_0: SLF4J: Found binding in
[file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201401310947_0001/jars/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000000_0: SLF4J: Found binding in
[jar:file:/usr/local/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000000_0: SLF4J: See
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
attempt_201401310947_0001_m_000000_0: SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory]
14/01/31 10:14:30 INFO mapred.JobClient:  map 100% reduce 0%
14/01/31 10:24:14 INFO mapred.JobClient: Task Id :
attempt_201401310947_0001_m_000001_1, Status : FAILED
java.lang.IllegalStateException: run: Caught an unrecoverable exception
registerHealth: Trying to get the new application attempt by killing self
    at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.IllegalStateException: registerHealth: Trying to get
the new application attempt by killing self
    at
org.apache.giraph.worker.BspServiceWorker.registerHealth(BspServiceWorker.java:627)
    at
org.apache.giraph.worker.BspServiceWorker.startSuperstep(BspServiceWorker.java:681)
    at
org.apache.giraph.worker.BspServiceWorker.setup(BspServiceWorker.java:486)
    at
org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:246)
    at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91)
    ... 7 more
Caused by: org.apache.zookeeper.KeeperException$NodeExistsException:
KeeperErrorCode = NodeExists for
/_hadoopBsp/job_201401310947_0001/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/kanha-Vostro-1014_1
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
    at org.apache.giraph.zk.ZooKeeperExt.createExt(ZooKeeperExt.java:152)
    at
org.apache.giraph.worker.BspServiceWorker.registerHealth(BspServiceWorker.java:611)
    ... 11 more

Task attempt_201401310947_0001_m_000001_1 failed to report status for 600
seconds. Killing!
attempt_201401310947_0001_m_000001_1: SLF4J: Class path contains multiple
SLF4J bindings.
attempt_201401310947_0001_m_000001_1: SLF4J: Found binding in
[file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201401310947_0001/jars/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000001_1: SLF4J: Found binding in
[jar:file:/usr/local/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000001_1: SLF4J: See
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
attempt_201401310947_0001_m_000001_1: SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory]
attempt_201401310947_0001_m_000001_1: log4j:WARN No appenders could be
found for logger (org.apache.zookeeper.ClientCnxn).
attempt_201401310947_0001_m_000001_1: log4j:WARN Please initialize the
log4j system properly.
14/01/31 10:24:15 INFO mapred.JobClient:  map 50% reduce 0%
14/01/31 10:24:24 INFO mapred.JobClient:  map 100% reduce 0%


please suggest me something related to fix this failure..

Thanks
Jyoti



On Wed, Jan 29, 2014 at 10:16 PM, Claudio Martella <
[email protected]> wrote:

> looks like one of your workers died. If you expect such a long job, I'd
> suggest you turn checkpointing on.
>
>
> On Wed, Jan 29, 2014 at 5:30 PM, Jyoti Yadav 
> <[email protected]>wrote:
>
>> Thanks all for your reply..
>> Actually i am working with an algorithm in which single source shortest
>> path  algorithm  runs for thousands of vertices .suppose on an average for
>> one vertex this algo takes 5-6 supersteps,then for thousands of
>> vertices,count of superstep is extremely large..In that case at run time
>> following error is thrown...
>>
>>  ERROR org.apache.giraph.master.BspServiceMaster:
>> superstepChosenWorkerAlive: Missing chosen worker
>> Worker(hostname=kanha-Vostro-1014, MRtaskID=1, port=30001) on superstep
>> 19528
>> 2014-01-28 05:11:36,852 INFO org.apache.giraph.master.MasterThread:
>> masterThread: Coordination of superstep 19528 took 636.831 seconds ended
>> with state WORKER_FAILURE and is now on superstep 19528
>> 2014-01-28 05:11:39,446 ERROR org.apache.giraph.master.MasterThread:
>> masterThread: Master algorithm failed with ArrayIndexOutOfBoundsException
>> java.lang.ArrayIndexOutOfBoundsException: -1
>>
>> Any ideas??
>>
>> Thanks
>> Jyoti
>>
>>
>> On Wed, Jan 29, 2014 at 8:55 PM, Peter Grman <[email protected]>wrote:
>>
>>> Yes but you can disable the counters per superstep, if you don't need
>>> the data, and than I had around 2000 after which my algorithm stopped.
>>>
>>> Cheers
>>> Peter
>>> On Jan 29, 2014 4:22 PM, "Claudio Martella" <[email protected]>
>>> wrote:
>>>
>>>> the limit is currently defined by the maximum number of counters your
>>>> jobtracker allows. Hence, by default the max number of supersteps is around
>>>> 90.
>>>>
>>>> check http://giraph.apache.org/faq.html to see how to increase it.
>>>>
>>>>
>>>> On Wed, Jan 29, 2014 at 4:12 PM, Jyoti Yadav <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi folks..
>>>>>
>>>>> Is there any limit for maximum no of supersteps while running a giraph
>>>>> job??
>>>>>
>>>>> Thanks
>>>>> Jyoti
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>    Claudio Martella
>>>>
>>>>
>>>
>>
>
>
> --
>    Claudio Martella
>
>

Re: constraint about no of supersteps

Reply via email to