Re: Supervisors flapping after killing topology

Justin Workman Thu, 26 Mar 2015 09:11:46 -0700

This happens in all our environments (dev,test,stage and prod) every time we 
kill even one topology. This did not happen with the previous version of storm.


It almost seems as though when we shutdown a topology, the storm-local working 
directory is cleaned up, however the topology information is not cleaned up 
from zookeeper. So when the supervisor checks to see what it should be running, 
it finds a stale topology in zookeeper and not a corresponding worker directory 
on the file system, hence the Java io error for file not find. This then sends 
the supervisor into a restart loop. Shutting everything down and restarting 
seems to clean things up. 

I don't have hard evidence to support that theory, but the pattern seems to 
point in that direction 

Thanks 
Justin

Sent from my iPhone

> On Mar 26, 2015, at 9:36 AM, Bobby Evans <[email protected]> wrote:
> 
> I'm not sure if it is a bug in storm, or if your system is somehow messed up. 
>  I tracked down the exception being thrown and it indicates that someone 
> deleted the current working directory that you are running the supervisor in. 
> I'm not sure how that happens, but I have never seen it happen on any of my 
> clusters, which leads me to believe that it is something that happened with 
> your system, and not storm itself.
>  
> - Bobby 
> 
> 
> 
> On Wednesday, March 25, 2015 4:54 PM, Justin Workman 
> <[email protected]> wrote:
> 
> 
> Sounds like your storm-local is pointed to tmp and you are going through the 
> same process as us. 
> 
> Is this a bug, or is there a better solution?
> 
> Sent from my iPhone
> 
>> On Mar 25, 2015, at 2:41 PM, Andres Gomez Ferrer <[email protected]> 
>> wrote:
>> 
>> I stop all my topologies, stop all storm nodes, remove /tmp/storm, start all 
>> again!
>> 
>> Andrés Gómez
>> Developer 
>> redborder.net / [email protected]
>> Phone: +34 955 60 11 60
>> 
>> <0e6e8de_1.png>
>> 
>> <square-twitter-20.png> <square-google-plus-20.png> <square-linkedin-20.png>
>> 
>> 
>> Piénsalo antes de imprimir este mensaje
>>  
>> Este correo electrónico, incluidos sus anexos, se dirige exclusivamente a su 
>> destinatario. Contiene información CONFIDENCIAL cuya divulgación está 
>> prohibida por la ley o puede estar sometida a secreto profesional. Si ha 
>> recibido este mensaje por error, le rogamos nos lo comunique inmediatamente 
>> y proceda a su destrucción.
>>  
>> This email, including attachments, is intended exclusively for its 
>> addressee. It contains information that is CONFIDENTIAL whose disclosure is 
>> prohibited by law and may be covered by legal privilege. If you have 
>> received this email in error, please notify the sender and delete it from 
>> your system. 
>> 
>> 
>> En 25 de marzo de 2015 en 21:31:50, Justin Workman 
>> ([email protected]) escrito:
> 
> During each topology kill/restart? Where do you put that hook?
> 
>  Or was it a one time remove of /tmp/storm and future topology restarts were 
> fine. 
> 
> Sent from my iPhone
> 
> On Mar 25, 2015, at 1:59 PM, Andres Gomez Ferrer <[email protected]> wrote:
> 
>> I solved it removing all /tmp/storm/*
>> 
>> Regards,
>> 
>> Andrés Gómez
>> Developer 
>> redborder.net / [email protected]
>> Phone: +34 955 60 11 60
>> 
>> <0e6e8de_1.png>
>> 
>> <square-twitter-20.png> <square-google-plus-20.png> <square-linkedin-20.png>
>> 
>> Piénsalo antes de imprimir este mensaje
>>  
>> Este correo electrónico, incluidos sus anexos, se dirige exclusivamente a su 
>> destinatario. Contiene información CONFIDENCIAL cuya divulgación está 
>> prohibida por la ley o puede estar sometida a secreto profesional. Si ha 
>> recibido este mensaje por error, le rogamos nos lo comunique inmediatamente 
>> y proceda a su destrucción.
>>  
>> This email, including attachments, is intended exclusively for its 
>> addressee. It contains information that is CONFIDENTIAL whose disclosure is 
>> prohibited by law and may be covered by legal privilege. If you have 
>> received this email in error, please notify the sender and delete it from 
>> your system. 
>> 
>> 
>> En 25 de marzo de 2015 en 20:54:58, Justin Workman 
>> ([email protected]) escrito:
>>> We are currently running 7 node storm cluster, 1 nimbus and 6 supervisor 
>>> nodes all running storm 0.9.2, running 3 topologies. Any time we kill a 
>>> running topology the supervisors across all nodes start flapping and we end 
>>> up in a mess. To clean this up we end up killing all running topologies, 
>>> shutdown the supervisors, cleanup the storm/storm-local directories on all 
>>> supervisor nodes, restart the supervisor processes then restart the 
>>> topologies. 
>>> 
>>> Has anyone experienced this issues, or have any ideas on how to resolve it.
>>> 
>>> Log snippet we see in the supervisor logs when this happens...
>>> 
>>> 2015-03-25 11:28:13 b.s.d.supervisor [INFO] Shutting down 
>>> 4d971d4b-a208-4758-a55e-3e8b34d7531f:ce049d5c-fd4c-499c-ad8d-ef1d8f2b992b
>>> 2015-03-25 11:28:13 b.s.event [ERROR] Error when processing event
>>> java.io.IOException: . doesn't exist.
>>>         at 
>>> org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:157) 
>>> ~[commons-exec-1.1.jar:1.1]
>>>         at 
>>> org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:147) 
>>> ~[commons-exec-1.1.jar:1.1]
>>>         at backtype.storm.util$exec_command_BANG_.invoke(util.clj:378) 
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>         at 
>>> backtype.storm.util$ensure_process_killed_BANG_.invoke(util.clj:394) 
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>         at 
>>> backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:175) 
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>         at 
>>> backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:240) 
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>         at clojure.lang.AFn.applyToHelper(AFn.java:161) 
>>> ~[clojure-1.5.1.jar:na]
>>>         at clojure.lang.AFn.applyTo(AFn.java:151) ~[clojure-1.5.1.jar:na]
>>>         at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
>>>         at clojure.core$partial$fn__4190.doInvoke(core.clj:2396) 
>>> ~[clojure-1.5.1.jar:na]
>>>         at clojure.lang.RestFn.invoke(RestFn.java:397) 
>>> ~[clojure-1.5.1.jar:na]
>>>         at backtype.storm.event$event_manager$fn__2378.invoke(event.clj:39) 
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>         at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.5.1.jar:na]
>>>         at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
>>> 2015-03-25 11:28:13 b.s.util [INFO] Halting process: ("Error when 
>>> processing an event")
>>> 
>>> 
>>> There does not be any thing corresponding to this in the worker logs.'
>>> 
>>> Ideas??
>>> 
>>> Thanks
>>> Justin
> 
>

Re: Supervisors flapping after killing topology

Reply via email to