Sergey Edunov created GIRAPH-972:
------------------------------------

             Summary: Race condition in checkpointing
                 Key: GIRAPH-972
                 URL: https://issues.apache.org/jira/browse/GIRAPH-972
             Project: Giraph
          Issue Type: Bug
            Reporter: Sergey Edunov


Couple of issues noticed with checkpointing of large jobs:
1) Task ID of master appears to be important. In most cases it is 0, however 
sometimes it is not and as we can not control it checkpointing should not 
depend on it.

2) Race condition happens on master when worker dies:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for 
/_hadoopBsp/job_201411061513.38895_0001/_applicationAttemptsDir/0/_superstepDir/9/_workerHealthyDir/hadoop4921.prn2.facebook.com_3
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
        at org.apache.giraph.zk.ZooKeeperExt.getData(ZooKeeperExt.java:470)
        at 
org.apache.giraph.utils.WritableUtils.readFieldsFromZnode(WritableUtils.java:126)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to