Hi,

You have to balance resiliency here. Too long session timeout will mean
longer time to detect connection loss. I would keep tgthee timeout low
enough and increase the retry limit.

Koby

On Thu, May 24, 2018, 10:44 정일영 <[email protected]> wrote:

> Root cause of zk fail is zk uplink network switch error for about 2 ~3
> minutes.
> I suppose that even if zk had 5 members,  be no use for this case.
> ​
> would it be okay if I set the storm
> configuration(storm.zookeeper.session/connection.timeout and storm
> zookeeper.retry.times) to a higher value?
> for example 1 hour or more long time.
>
> ​
>
> -----Original Message-----
> *From:* "Koby Nachmany"<[email protected]>
> *To:* <[email protected]>;
> *Cc:*
> *Sent:* 2018-05-23 (수) 20:31:48
> *Subject:* Re: Zookeeper connection time out and then supervisors halted
>
> Hi,
>
> What I would do is  tweak the following params:
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_storm_blob_1.1.x-2Dbranch_conf_defaults.yaml-23L31-2DL35&d=DwMGaQ&c=NxS7LVD4EucgUR9_G6bWzuqhmQ0xEJ2AZdqjz4WaSHU&r=QeCfrvtPhnEaHZzKJFj6GGKfJI11O9nK6ZuRf2zFWwI&m=jnBjmKaADZ3TcXbwW0htdBdjoECYiXwVj39uuP2Z4yY&s=Iaov-F0CExWJNgVlK3w2bacgLTOwG38bAhBKbhi-2bE&e=>
> https://github.com/apache/storm/blob/1.1.x-branch/conf/defaults.yaml#L31-L35
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_storm_blob_1.1.x-2Dbranch_conf_defaults.yaml-23L31-2DL35&d=DwMGaQ&c=NxS7LVD4EucgUR9_G6bWzuqhmQ0xEJ2AZdqjz4WaSHU&r=QeCfrvtPhnEaHZzKJFj6GGKfJI11O9nK6ZuRf2zFWwI&m=jnBjmKaADZ3TcXbwW0htdBdjoECYiXwVj39uuP2Z4yY&s=Iaov-F0CExWJNgVlK3w2bacgLTOwG38bAhBKbhi-2bE&e=>
>
> Specifically storm.zookeeper.session(/connection).timeout and
> storm.zookeeper.retry.times
> ​ That should allow your supervisors and nimbus to recover.
> Try to keep a minimum of 5 ZK instances for better resiliency.
>
> HTH
> Koby Nachmany
> BigData Production Engineering Team Lead
> T: +972-74-700-4733
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_164748&d=DwMGaQ&c=NxS7LVD4EucgUR9_G6bWzuqhmQ0xEJ2AZdqjz4WaSHU&r=QeCfrvtPhnEaHZzKJFj6GGKfJI11O9nK6ZuRf2zFWwI&m=jnBjmKaADZ3TcXbwW0htdBdjoECYiXwVj39uuP2Z4yY&s=yPIRDoYWZmgK1c5iLFuxPuyT8QLYX7CXMqs2b6U9_eM&e=>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_liveperson&d=DwMGaQ&c=NxS7LVD4EucgUR9_G6bWzuqhmQ0xEJ2AZdqjz4WaSHU&r=QeCfrvtPhnEaHZzKJFj6GGKfJI11O9nK6ZuRf2zFWwI&m=jnBjmKaADZ3TcXbwW0htdBdjoECYiXwVj39uuP2Z4yY&s=yZw99hpFrOZYOcJGnscJQdYtdGL26pJ1VRWgWK7HEmk&e=>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_LivePersonInc&d=DwMGaQ&c=NxS7LVD4EucgUR9_G6bWzuqhmQ0xEJ2AZdqjz4WaSHU&r=QeCfrvtPhnEaHZzKJFj6GGKfJI11O9nK6ZuRf2zFWwI&m=jnBjmKaADZ3TcXbwW0htdBdjoECYiXwVj39uuP2Z4yY&s=eqb1F9x8g77o6d5tbUA6OQGzaEUXyCaHzzsKi_Cblgo&e=>
> Our mission is to make life easier by transforming how people communicate
> with brands.
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bit.ly_2EvXudh&d=DwMGaQ&c=NxS7LVD4EucgUR9_G6bWzuqhmQ0xEJ2AZdqjz4WaSHU&r=QeCfrvtPhnEaHZzKJFj6GGKfJI11O9nK6ZuRf2zFWwI&m=jnBjmKaADZ3TcXbwW0htdBdjoECYiXwVj39uuP2Z4yY&s=bKzfRWtlocW2uFEWP53S8e59Rw_DH6bwmpYbEeDxOz4&e=>
> ​
>
>
>
>
>
>
>
> On Wed, May 23, 2018 at 1:13 PM 정일영 <[email protected]> wrote:
>
> Hi all
> I have used storm version 1.1.1 and zookeer 3.4.11 as no problem for a
> long time.
> A few days ago, zookeeper service failed and  connection timeout occured
> with storm during about 2minute.
> So all supervisors halted and storm service failed for a long time.
> Supervisor log is below.
> How can I make the storm falut tolerant even if zookeeper timeout occurs?
> My storm configuration is default  in connect with zookeeper.
> ​
> ​
> ​
> ============================================
> 2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when
> processing an event
> java.lang.RuntimeException: Halting process: Error when processing an event
>        at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.daemon.supervisor.DefaultUncaughtExceptionHandler.uncaughtException(DefaultUncaughtExceptionHandler.java:29)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:104)
> ~[storm-core-1.1.1.jar:1.1.1]
> 2018-05-18 01:11:19.348 o.a.s.e.EventManagerImp [ERROR] {} Error when
> processing event
> java.lang.RuntimeException: java.lang.RuntimeException:
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /assignments
>        at
> org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:182)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:54)
> ~[storm-core-1.1.1.jar:1.1.1]
> Caused by: java.lang.RuntimeException:
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /assignments
>        at org.apache.storm.utils.Utils.wrapInRuntime(Utils.java:1531)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:265)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126)
> ~[storm-core-1.1.1.jar:1.1.1]
>        ... 1 more
> Caused by:
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /assignments
>        at
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1625)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:226)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:216)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:260)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126)
> ~[storm-core-1.1.1.jar:1.1.1]
>        ... 1 more
> 2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when
> processing an event
> java.lang.RuntimeException: Halting process: Error when processing an event
>        at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:63)
> ~[storm-core-1.1.1.jar:1.1.1]
> 2018-05-18 01:11:19.350 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
> connection to server 1.2.3.4/1.2.3.4:10013
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__1.2.3.4_1.2.3.4-3A10013&d=DwMGaQ&c=NxS7LVD4EucgUR9_G6bWzuqhmQ0xEJ2AZdqjz4WaSHU&r=QeCfrvtPhnEaHZzKJFj6GGKfJI11O9nK6ZuRf2zFWwI&m=jnBjmKaADZ3TcXbwW0htdBdjoECYiXwVj39uuP2Z4yY&s=G0amgp5INYL-w3wXvDww-ZsKrotUIbOr3bLQ07t3DTw&e=>.
> Will not attempt to authenticate using SASL (unknown error)
> 2018-05-18 01:11:19.351 o.a.s.d.s.Supervisor [INFO] Shutting down
> supervisor 43e735b5-f39d-493f-bd25-990e85812a8d​
> ​
> =======================================​
>
>
> This message may contain confidential and/or privileged information.
> If you are not the addressee or authorized to receive this on behalf of
> the addressee you must not use, copy, disclose or take action based on this
> message or any information herein.
> If you have received this message in error, please advise the sender
> immediately by reply email and delete this message. Thank you.
>

-- 
This message may contain confidential and/or privileged information. 
If 
you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in 
error, please advise the sender immediately by reply email and delete this 
message. Thank you.

Reply via email to