>From what I can tell the supervisors worked as designed.  Storm is written
to be fail fast.  If there is an error that the code does not explicitly
know how to recover from the daemon will exit.  This is why we recommend
all daemon processes to be run under supervision so if it does happen the
supervisors, in this case, will be restarted.  The processes are designed
and tested to be able to restart after failures and pick up where they left
off.

Did the cluster recover after zookeeper connectivity was restored?  Did you
need to take any manual steps to fix the storm cluster besides restarting
supervisors?

In all of the current versions of storm the entire cluster is very relent
on ZK so it being down will make the cluster unusable.  There are some
steps we could take to the design of storm to make it less reliant on ZK,
but those are non trivial changes.

I hope this helps,

Bobby

On Wed, May 23, 2018 at 5:13 AM 정일영 <[email protected]> wrote:

> Hi all
> I have used storm version 1.1.1 and zookeer 3.4.11 as no problem for a
> long time.
> A few days ago, zookeeper service failed and  connection timeout occured
> with storm during about 2minute.
> So all supervisors halted and storm service failed for a long time.
> Supervisor log is below.
> How can I make the storm falut tolerant even if zookeeper timeout occurs?
> My storm configuration is default  in connect with zookeeper.
> ​
> ​
> ​
> ============================================
> 2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when
> processing an event
> java.lang.RuntimeException: Halting process: Error when processing an event
>        at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.daemon.supervisor.DefaultUncaughtExceptionHandler.uncaughtException(DefaultUncaughtExceptionHandler.java:29)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:104)
> ~[storm-core-1.1.1.jar:1.1.1]
> 2018-05-18 01:11:19.348 o.a.s.e.EventManagerImp [ERROR] {} Error when
> processing event
> java.lang.RuntimeException: java.lang.RuntimeException:
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /assignments
>        at
> org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:182)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:54)
> ~[storm-core-1.1.1.jar:1.1.1]
> Caused by: java.lang.RuntimeException:
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /assignments
>        at org.apache.storm.utils.Utils.wrapInRuntime(Utils.java:1531)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:265)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126)
> ~[storm-core-1.1.1.jar:1.1.1]
>        ... 1 more
> Caused by:
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /assignments
>        at
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1625)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:226)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:216)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:260)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126)
> ~[storm-core-1.1.1.jar:1.1.1]
>        ... 1 more
> 2018-05-18 01:11:19.348 o.a.s.u.Utils [ERROR] Halting process: Error when
> processing an event
> java.lang.RuntimeException: Halting process: Error when processing an event
>        at org.apache.storm.utils.Utils.exitProcess(Utils.java:1773)
> ~[storm-core-1.1.1.jar:1.1.1]
>        at
> org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:63)
> ~[storm-core-1.1.1.jar:1.1.1]
> 2018-05-18 01:11:19.350 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
> connection to server 1.2.3.4/1.2.3.4:10013. Will not attempt to
> authenticate using SASL (unknown error)
> 2018-05-18 01:11:19.351 o.a.s.d.s.Supervisor [INFO] Shutting down
> supervisor 43e735b5-f39d-493f-bd25-990e85812a8d​
> ​
> =======================================​
>

Reply via email to