Hi,

What I would do is  tweak the following params:
https://github.com/apache/storm/blob/1.1.x-branch/conf/defaults.yaml#L31-L35

Specifically storm.zookeeper.session(/connection).timeout and
storm.zookeeper.retry.times
​ That should allow your supervisors and nimbus to recover.
Try to keep a minimum of 5 ZK instances for better resiliency.

HTH
Koby Nachmany
BigData Production Engineering Team Lead
T: +972-74-700-4733
<https://www.linkedin.com/company/164748> <https://twitter.com/liveperson>
<https://www.facebook.com/LivePersonInc>
Our mission is to make life easier by transforming how people communicate
with brands.
<https://bit.ly/2EvXudh>​







On Wed, May 23, 2018 at 1:13 PM 정일영 <[email protected]> wrote:

> Hi all
> I have used storm version 1.1.1 and zookeer 3.4.11 as no problem for a
> long time.
> A few days ago, zookeeper service failed and  connection timeout occured
> with storm during about 2minute.
> So all supervisors halted and storm service failed for a long time.
> Supervisor log is below.
> How can I make the storm falut tolerant even if zookeeper timeout occurs?
> My storm configuration is default  in connect with zookeeper.
> ​
> ​
> ​
> ============================================
> 2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when
> processing an event
> java.lang.RuntimeException: Halting process: Error when processing an event
>        at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.daemon.supervisor.DefaultUncaughtExceptionHandler.uncaughtException(DefaultUncaughtExceptionHandler.java:29)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:104)
> ~[storm-core-1.1.1.jar:1.1.1]
> 2018-05-18 01:11:19.348 o.a.s.e.EventManagerImp [ERROR] {} Error when
> processing event
> java.lang.RuntimeException: java.lang.RuntimeException:
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /assignments
>        at
> org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:182)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:54)
> ~[storm-core-1.1.1.jar:1.1.1]
> Caused by: java.lang.RuntimeException:
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /assignments
>        at org.apache.storm.utils.Utils.wrapInRuntime(Utils.java:1531)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:265)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126)
> ~[storm-core-1.1.1.jar:1.1.1]
>        ... 1 more
> Caused by:
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /assignments
>        at
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1625)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:226)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:216)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:260)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126)
> ~[storm-core-1.1.1.jar:1.1.1]
>        ... 1 more
> 2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when
> processing an event
> java.lang.RuntimeException: Halting process: Error when processing an event
>        at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:63)
> ~[storm-core-1.1.1.jar:1.1.1]
> 2018-05-18 01:11:19.350 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
> connection to server 1.2.3.4/1.2.3.4:10013. Will not attempt to
> authenticate using SASL (unknown error)
> 2018-05-18 01:11:19.351 o.a.s.d.s.Supervisor [INFO] Shutting down
> supervisor 43e735b5-f39d-493f-bd25-990e85812a8d​
> ​
> =======================================​
>

-- 
This message may contain confidential and/or privileged information. 
If 
you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in 
error, please advise the sender immediately by reply email and delete this 
message. Thank you.

Reply via email to