Notes:
- Nimbus going down temporarily is not a threat to running topologies, they
will stay up.
- If nimbus local state on disk is lost, topologies will be lost as well.
- Currently multiple nimbus instances conflict, however...
- ...there is work to provide high-availability to nimbus:
https://issues.apache.org/jira/browse/STORM-166
In the scenario when the nimbus host is permanently lost, here is something
that might work:
- Configure nimbus host file system as shared storage (like NFS)
- <nimbus host dies!>
- provision a second nimbus host, with the old host name, and using the shared
storage
- restart nimbus on the new host
This should preserve running topologies, but make sure never to run two nimbus
daemons concurrently or else the cluster will be unusable.
--
Derek
On 5/15/14, 7:46, Krzysztof Sadowski wrote:
Let's imagine the following scenario:
- machine with one supervisor goes down
- machine with nimbus goes down
Right now, because some workers go down as well, a few queues are not
drained properly, what causes that these queues are continuously increasing
in size.
To avoid this situation we should rebalance the topology in order to
distribute the load across all of the remaining supervisors, but to do this
I need the nimbus to be up and running. Moreover the basic monitoring
information is not available because StormUI is also not working.
My question is: What is a devops operation when the machine with nimbus
dies and what can be done to minimize its unavailability period? Should we
install nimbus on second machine and run it after first machine dies -
something similar to failover services? Can we run more than one nimbus? Or
maybe there is a better option?
Thanks for help
Krzysztof Sadowski