Please see the zookeeper logs to figure out the reason of
ConnectionLossException. There are many possibilities such as FullGC,
heavy swap space usage, or session expired.

I guess, the answer will be in the sentence "stopped working after
4600 supersteps".

On Mon, Jun 17, 2013 at 6:11 PM, Sascha Jonas
<[email protected]> wrote:
> The servers are reserved for Apache Hama, so there is no other network
> traffic. I tested it on three other PCs at another location but with the
> same configuration and got the same errors :(
>
> Am So, 16.06.2013, 16:44 schrieb Chia-Hung Lin:
>> Have you checked if underlying network traffic is busy when error happens?
>>
>> Can't be very sure but the symptom seems to be the heavy network
>> traffic leads to the zk connection lost.
>>
>>
>>
>> On 16 June 2013 20:22, Sascha Jonas <[email protected]>
>> wrote:
>>> Hey,
>>>
>>> iam using Apache Hama on a small cluster with two computers. Its working
>>> fine with a small number of supersteps but every time i am trying with
>>> lots of iterations e.g. 10000 it crashes.
>>>
>>> Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are
>>> still running while the log shows some errors.
>>>
>>> Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a
>>> newer Hama or Zookeeper version?
>>>
>>> 13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path
>>> /bsp/job_201306091733_0009/sync/4276
>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> KeeperErrorCode = ConnectionLoss for
>>> /bsp/job_201306091733_0009/sync/4276
>>>         at
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>>>         at
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>>>         at
>>> org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138)
>>>         at
>>> org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290)
>>>         at
>>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99)
>>>         at
>>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>>>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>>>         at
>>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>>>         at
>>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>>>         at
>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
>>> 13/06/16 00:14:15 ERROR
>>> distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer:
>>> org.apache.hama.bsp.sync.SyncException
>>> org.apache.hama.bsp.sync.SyncException
>>>         at
>>> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137)
>>>         at
>>> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>>>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>>>         at
>>> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>>>         at
>>> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>>>         at
>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
>>>
>>
>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Reply via email to