Re: Nimbus dies

Derek Dagit Fri, 16 May 2014 15:04:34 -0700

Notes:
- Nimbus going down temporarily is not a threat to running topologies, they 
will stay up.
- If nimbus local state on disk is lost, topologies will be lost as well.
- Currently multiple nimbus instances conflict, however...
- ...there is work to provide high-availability to nimbus: 
https://issues.apache.org/jira/browse/STORM-166



In the scenario when the nimbus host is permanently lost, here is something 
that might work:
- Configure nimbus host file system as shared storage (like NFS)
- <nimbus host dies!>
- provision a second nimbus host, with the old host name, and using the shared 
storage
- restart nimbus on the new host

This should preserve running topologies, but make sure never to run two nimbus 
daemons concurrently or else the cluster will be unusable.
--
Derek

On 5/15/14, 7:46, Krzysztof Sadowski wrote:

Let's imagine the following scenario:

    - machine with one supervisor goes down
    - machine with nimbus goes down

Right now, because some workers go down as well, a few queues are not
drained properly, what causes that these queues are continuously increasing
in size.

To avoid this situation we should rebalance the topology in order to
distribute the load across all of the remaining supervisors, but to do this
I need the nimbus to be up and running. Moreover the basic monitoring
information is not available because StormUI is also not working.

My question is: What is a devops operation when the machine with nimbus
dies and what can be done to minimize its unavailability period? Should we
install nimbus on second machine and run it after first machine dies -
something similar to failover services? Can we run more than one nimbus? Or
maybe there is a better option?

Thanks for help
Krzysztof Sadowski

Re: Nimbus dies

Reply via email to