The servers are reserved for Apache Hama, so there is no other network traffic. I tested it on three other PCs at another location but with the same configuration and got the same errors :(
Am So, 16.06.2013, 16:44 schrieb Chia-Hung Lin: > Have you checked if underlying network traffic is busy when error happens? > > Can't be very sure but the symptom seems to be the heavy network > traffic leads to the zk connection lost. > > > > On 16 June 2013 20:22, Sascha Jonas <[email protected]> > wrote: >> Hey, >> >> iam using Apache Hama on a small cluster with two computers. Its working >> fine with a small number of supersteps but every time i am trying with >> lots of iterations e.g. 10000 it crashes. >> >> Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are >> still running while the log shows some errors. >> >> Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a >> newer Hama or Zookeeper version? >> >> 13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path >> /bsp/job_201306091733_0009/sync/4276 >> org.apache.zookeeper.KeeperException$ConnectionLossException: >> KeeperErrorCode = ConnectionLoss for >> /bsp/job_201306091733_0009/sync/4276 >> at >> org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >> at >> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) >> at >> org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138) >> at >> org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290) >> at >> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99) >> at >> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474) >> at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428) >> at >> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90) >> at >> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57) >> at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168) >> at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) >> at >> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262) >> 13/06/16 00:14:15 ERROR >> distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer: >> org.apache.hama.bsp.sync.SyncException >> org.apache.hama.bsp.sync.SyncException >> at >> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137) >> at >> org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474) >> at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428) >> at >> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90) >> at >> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57) >> at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168) >> at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) >> at >> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262) >> >
