Re:RE: Recovering From Zookeeper Failure

Saurabh Agarwal (BLOOMBERG/ 731 LEX -) Wed, 21 May 2014 16:18:56 -0700

Thanks.


----- Original Message -----
From: Simon Cooper 
At: Wednesday, May 21, 2014 06:16

              
  
This has already been reported:  
https://issues.apache.org/jira/browse/STORM-307.
 
The workaround we’ve implemented is, in our storm init scripts, we always 
delete the {storm.home}/supervisor and {storm.home}/workers directories before 
starting   the supervisor.
 

From: Saurabh Agarwal (BLOOMBERG/ 731 LEX -) [mailto:[email protected]] 
 
  Sent: 21 May 2014 10:43
  To: [email protected]
  Subject: Re: Recovering From Zookeeper Failure
 

We also had the same issue. We also need to delete supervisor and worker 
directories in order to restart the supervisor processes after these processes 
die. Is this known   bug? Is there any cleaner way to restart? Thanks , 
Saurabh. 


  ----- Original Message -----
  From: Ryan Chan 
  At: Friday, May 16, 2014 18:58

Hi Josh

 

We are having the same issue for long time, and only solution is restart the 
whole storm cluster. 

(Actually I have asked the same question on 12 May but got no response.)

 

In the meantime, we are currently evaluating switch to Apache Spark for 
streaming, you might also have a look.

 

 

 

 

On Wed, May 14, 2014 at 11:25 PM, Josh Walton <[email protected]> wrote:

Recently, we have had a couple of power failures for the servers running our 
zookeeper cluster. When zookeeper dies, the nimbus and supervisor processes   
eventually die as well. After the zookeeper failure, the only way I have gotten 
the supervisor processes to start back up is to delete the supervisor and 
worker directories as specified in the storm.yaml file. Is there a 
better/cleaner way to restart them?

 

I have also noticed that when I start nimbus and the UI process back up, and 
navigate to the storm status page, the topologies we had started are still   
shown as active (even though they are not). 

 

This is the exception in the supervisor logs when I try to start them up after 
the zookeeper failure:

 

2014-05-14 09:16:03 b.s.event [ERROR] Error when processing event

java.lang.RuntimeException: java.io.EOFException

at backtype.storm.utils.Utils.deserialize(Utils.java:69) 
~[storm-core-0.9.0-rc3.jar:na]

at backtype.storm.utils.LocalState.snapshot(LocalState.java:28) 
~[storm-core-0.9.0-rc3.jar:na]

at backtype.storm.utils.LocalState.get(LocalState.java:39) 
~[storm-core-0.9.0-rc3.jar:na]

at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:187) 
~[storm-core-0.9.0-rc3.jar:na]

at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.4.0.jar:na]

at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]

at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]

at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) ~[clojure-1.4.0.jar:na]

at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]

at backtype.storm.event$event_manager$fn__3070.invoke(event.clj:24) 
~[storm-core-0.9.0-rc3.jar:na]

at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]

at java.lang.Thread.run(Thread.java:722) [na:1.7.0_21]

Caused by: java.io.EOFException: null

at 
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
 ~[na:1.7.0_21]

at 
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792)
 ~[na:1.7.0_21]

at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799) 
~[na:1.7.0_21]

at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) ~[na:1.7.0_21]

at backtype.storm.utils.Utils.deserialize(Utils.java:64) 
~[storm-core-0.9.0-rc3.jar:na]

... 11 common frames omitted

2014-05-14 09:16:03 b.s.util [INFO] Halting process: ("Error when processing an 
event")

This has already been reported: https://issues.apache.org/jira/browse/STORM-307.

The workaround we’ve implemented is, in our storm init scripts, we always delete the {storm.home}/supervisor and {storm.home}/workers directories before starting the supervisor.

From: Saurabh Agarwal (BLOOMBERG/ 731 LEX -) [mailto:[email protected]]
Sent: 21 May 2014 10:43
To: [email protected]
Subject: Re: Recovering From Zookeeper Failure

We also had the same issue. We also need to delete supervisor and worker directories in order to restart the supervisor processes after these processes die. Is this known bug? Is there any cleaner way to restart? Thanks , Saurabh.

----- Original Message -----
From: Ryan Chan
At: Friday, May 16, 2014 18:58

Hi Josh

We are having the same issue for long time, and only solution is restart the whole storm cluster.

(Actually I have asked the same question on 12 May but got no response.)

In the meantime, we are currently evaluating switch to Apache Spark for streaming, you might also have a look.

On Wed, May 14, 2014 at 11:25 PM, Josh Walton <[email protected]> wrote:

Recently, we have had a couple of power failures for the servers running our zookeeper cluster. When zookeeper dies, the nimbus and supervisor processes eventually die as well. After the zookeeper failure, the only way I have gotten the supervisor processes to start back up is to delete the supervisor and worker directories as specified in the storm.yaml file. Is there a better/cleaner way to restart them?

I have also noticed that when I start nimbus and the UI process back up, and navigate to the storm status page, the topologies we had started are still shown as active (even though they are not).

This is the exception in the supervisor logs when I try to start them up after the zookeeper failure:

2014-05-14 09:16:03 b.s.event [ERROR] Error when processing event

java.lang.RuntimeException: java.io.EOFException

at backtype.storm.utils.Utils.deserialize(Utils.java:69) ~[storm-core-0.9.0-rc3.jar:na]

at backtype.storm.utils.LocalState.snapshot(LocalState.java:28) ~[storm-core-0.9.0-rc3.jar:na]

at backtype.storm.utils.LocalState.get(LocalState.java:39) ~[storm-core-0.9.0-rc3.jar:na]

at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:187) ~[storm-core-0.9.0-rc3.jar:na]

at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.4.0.jar:na]

at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]

at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]

at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) ~[clojure-1.4.0.jar:na]

at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]

at backtype.storm.event$event_manager$fn__3070.invoke(event.clj:24) ~[storm-core-0.9.0-rc3.jar:na]

at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]

at java.lang.Thread.run(Thread.java:722) [na:1.7.0_21]

Caused by: java.io.EOFException: null

at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) ~[na:1.7.0_21]

at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792) ~[na:1.7.0_21]

at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799) ~[na:1.7.0_21]

at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) ~[na:1.7.0_21]

at backtype.storm.utils.Utils.deserialize(Utils.java:64) ~[storm-core-0.9.0-rc3.jar:na]

... 11 common frames omitted

2014-05-14 09:16:03 b.s.util [INFO] Halting process: ("Error when processing an event")

Re:RE: Recovering From Zookeeper Failure

Reply via email to