>From what I can tell the supervisors worked as designed. Storm is written to be fail fast. If there is an error that the code does not explicitly know how to recover from the daemon will exit. This is why we recommend all daemon processes to be run under supervision so if it does happen the supervisors, in this case, will be restarted. The processes are designed and tested to be able to restart after failures and pick up where they left off.
Did the cluster recover after zookeeper connectivity was restored? Did you need to take any manual steps to fix the storm cluster besides restarting supervisors? In all of the current versions of storm the entire cluster is very relent on ZK so it being down will make the cluster unusable. There are some steps we could take to the design of storm to make it less reliant on ZK, but those are non trivial changes. I hope this helps, Bobby On Wed, May 23, 2018 at 5:13 AM 정일영 <[email protected]> wrote: > Hi all > I have used storm version 1.1.1 and zookeer 3.4.11 as no problem for a > long time. > A few days ago, zookeeper service failed and connection timeout occured > with storm during about 2minute. > So all supervisors halted and storm service failed for a long time. > Supervisor log is below. > How can I make the storm falut tolerant even if zookeeper timeout occurs? > My storm configuration is default in connect with zookeeper. > > > > ============================================ > 2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when > processing an event > java.lang.RuntimeException: Halting process: Error when processing an event > at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.daemon.supervisor.DefaultUncaughtExceptionHandler.uncaughtException(DefaultUncaughtExceptionHandler.java:29) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:104) > ~[storm-core-1.1.1.jar:1.1.1] > 2018-05-18 01:11:19.348 o.a.s.e.EventManagerImp [ERROR] {} Error when > processing event > java.lang.RuntimeException: java.lang.RuntimeException: > org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /assignments > at > org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:182) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:54) > ~[storm-core-1.1.1.jar:1.1.1] > Caused by: java.lang.RuntimeException: > org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /assignments > at org.apache.storm.utils.Utils.wrapInRuntime(Utils.java:1531) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:265) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126) > ~[storm-core-1.1.1.jar:1.1.1] > ... 1 more > Caused by: > org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /assignments > at > org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1625) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:226) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:216) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:260) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126) > ~[storm-core-1.1.1.jar:1.1.1] > ... 1 more > 2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when > processing an event > java.lang.RuntimeException: Halting process: Error when processing an event > at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773) > ~[storm-core-1.1.1.jar:1.1.1] > at > org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:63) > ~[storm-core-1.1.1.jar:1.1.1] > 2018-05-18 01:11:19.350 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket > connection to server 1.2.3.4/1.2.3.4:10013. Will not attempt to > authenticate using SASL (unknown error) > 2018-05-18 01:11:19.351 o.a.s.d.s.Supervisor [INFO] Shutting down > supervisor 43e735b5-f39d-493f-bd25-990e85812a8d > > ======================================= >
