Root cause of zk fail is zk uplink network switch error for about 2 ~3 minutes.
I suppose that even if zk had 5 members,  be no use for this case.
​
would it be okay if I set the storm 
configuration(storm.zookeeper.session/connection.timeout and storm 
zookeeper.retry.times) to a higher value?
for example 1 hour or more long time. ​

-----Original Message-----
From: "Koby Nachmany"<[email protected]>
To: <[email protected]>;
Cc:
Sent: 2018-05-23 (수) 20:31:48
Subject: Re: Zookeeper connection time out and then supervisors halted
 
Hi,
 
What I would do is  tweak the following params:
https://github.com/apache/storm/blob/1.1.x-branch/conf/defaults.yaml#L31-L35
 
Specifically storm.zookeeper.session(/connection).timeout and  
storm.zookeeper.retry.times
​ That should allow your supervisors and nimbus to recover.
Try to keep a minimum of 5 ZK instances for better resiliency.
 
HTH


Koby Nachmany
BigData Production Engineering Team Lead
T: +972-74-700-4733







Our mission is to make life easier by transforming how people communicate with 
brands.


​
 
 

 
  
 

On Wed, May 23, 2018 at 1:13 PM 정일영 <[email protected]> wrote: 

Hi all
I have used storm version 1.1.1 and zookeer 3.4.11 as no problem for a long 
time.
A few days ago, zookeeper service failed and  connection timeout occured with 
storm during about 2minute.
So all supervisors halted and storm service failed for a long time.
Supervisor log is below.
How can I make the storm falut tolerant even if zookeeper timeout occurs?
My storm configuration is default  in connect with zookeeper. 
​
​
​
============================================
2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when 
processing an event
java.lang.RuntimeException: Halting process: Error when processing an event
       at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773) 
~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.daemon.supervisor.DefaultUncaughtExceptionHandler.uncaughtException(DefaultUncaughtExceptionHandler.java:29)
 ~[storm-core-1.1.1.jar:1.1.1]
       at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:104) 
~[storm-core-1.1.1.jar:1.1.1]
2018-05-18 01:11:19.348 o.a.s.e.EventManagerImp [ERROR] {} Error when 
processing event
java.lang.RuntimeException: java.lang.RuntimeException: 
org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for /assignments
       at 
org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:182)
 ~[storm-core-1.1.1.jar:1.1.1]
       at org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:54) 
~[storm-core-1.1.1.jar:1.1.1]
Caused by: java.lang.RuntimeException: 
org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for /assignments
       at org.apache.storm.utils.Utils.wrapInRuntime(Utils.java:1531) 
~[storm-core-1.1.1.jar:1.1.1]
       at org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:265) 
~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174) 
~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126)
 ~[storm-core-1.1.1.jar:1.1.1]
       ... 1 more
Caused by: 
org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for /assignments
       at 
org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1625)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:226)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:216)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40)
 ~[storm-core-1.1.1.jar:1.1.1]
       at org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:260) 
~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174) 
~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153)
 ~[storm-core-1.1.1.jar:1.1.1]
       at 
org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126)
 ~[storm-core-1.1.1.jar:1.1.1]
       ... 1 more
2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when 
processing an event
java.lang.RuntimeException: Halting process: Error when processing an event
       at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773) 
~[storm-core-1.1.1.jar:1.1.1]
       at org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:63) 
~[storm-core-1.1.1.jar:1.1.1]
2018-05-18 01:11:19.350 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket 
connection to server 1.2.3.4/1.2.3.4:10013. Will not attempt to authenticate 
using SASL (unknown error)
2018-05-18 01:11:19.351 o.a.s.d.s.Supervisor [INFO] Shutting down supervisor 
43e735b5-f39d-493f-bd25-990e85812a8d​
​
=======================================​





This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this message 
or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Reply via email to