Root cause of zk fail is zk uplink network switch error for about 2 ~3 minutes. I suppose that even if zk had 5 members, be no use for this case. would it be okay if I set the storm configuration(storm.zookeeper.session/connection.timeout and storm zookeeper.retry.times) to a higher value? for example 1 hour or more long time.
-----Original Message----- From: "Koby Nachmany"<[email protected]> To: <[email protected]>; Cc: Sent: 2018-05-23 (수) 20:31:48 Subject: Re: Zookeeper connection time out and then supervisors halted Hi, What I would do is tweak the following params: https://github.com/apache/storm/blob/1.1.x-branch/conf/defaults.yaml#L31-L35 Specifically storm.zookeeper.session(/connection).timeout and storm.zookeeper.retry.times That should allow your supervisors and nimbus to recover. Try to keep a minimum of 5 ZK instances for better resiliency. HTH Koby Nachmany BigData Production Engineering Team Lead T: +972-74-700-4733 Our mission is to make life easier by transforming how people communicate with brands. On Wed, May 23, 2018 at 1:13 PM 정일영 <[email protected]> wrote: Hi all I have used storm version 1.1.1 and zookeer 3.4.11 as no problem for a long time. A few days ago, zookeeper service failed and connection timeout occured with storm during about 2minute. So all supervisors halted and storm service failed for a long time. Supervisor log is below. How can I make the storm falut tolerant even if zookeeper timeout occurs? My storm configuration is default in connect with zookeeper. ============================================ 2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when processing an event java.lang.RuntimeException: Halting process: Error when processing an event at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.daemon.supervisor.DefaultUncaughtExceptionHandler.uncaughtException(DefaultUncaughtExceptionHandler.java:29) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:104) ~[storm-core-1.1.1.jar:1.1.1] 2018-05-18 01:11:19.348 o.a.s.e.EventManagerImp [ERROR] {} Error when processing event java.lang.RuntimeException: java.lang.RuntimeException: org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /assignments at org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:182) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:54) ~[storm-core-1.1.1.jar:1.1.1] Caused by: java.lang.RuntimeException: org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /assignments at org.apache.storm.utils.Utils.wrapInRuntime(Utils.java:1531) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:265) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126) ~[storm-core-1.1.1.jar:1.1.1] ... 1 more Caused by: org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /assignments at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1625) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:226) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:216) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:260) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126) ~[storm-core-1.1.1.jar:1.1.1] ... 1 more 2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when processing an event java.lang.RuntimeException: Halting process: Error when processing an event at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:63) ~[storm-core-1.1.1.jar:1.1.1] 2018-05-18 01:11:19.350 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server 1.2.3.4/1.2.3.4:10013. Will not attempt to authenticate using SASL (unknown error) 2018-05-18 01:11:19.351 o.a.s.d.s.Supervisor [INFO] Shutting down supervisor 43e735b5-f39d-493f-bd25-990e85812a8d ======================================= This message may contain confidential and/or privileged information. If you are not the addressee or authorized to receive this on behalf of the addressee you must not use, copy, disclose or take action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply email and delete this message. Thank you.
