I attached one idea (prototype) for the safe cluster restart.
The attached file contains a bit change in IMM and CLM.
The idea is that when cluster restart is invoked by CLM admin operation, that
CLM first disable sync in IMM (change in IMM), and then continue with rebooting
nodes.
If a rebooted node comes up too fast, before the last IMM veteran node goes
down, IMM sync will not be possible, and the node will be hanging in the NID
phase waiting for the sync.
When the last IMM veteran node goes down, IMMD will start with electing a new
coordinator. Since there is no any veteran node in the cluster, the new IMM
coordinator will start loading data from PBE or XML file.
The side effect of the attached file is that some nodes which joined before the
last veteran goes down, can be rebooted again mostly due to QUIESCED role in
RDE, or if they are payload running without SC absence allowed.
There is nothing wrong with rebooting that nodes again. They are still in
OpenSAF starting phase, and there is no any application up and running. So,
rebooting that nodes are safe.
The attached file is only a proposal and needs to be split in two tickets, one
for IMM (disable sync feature) and this ticket for CLM.
For IMM part, I would like to make the disable sync function as a one way
function, and when the sync is disabled, it cannot be enabled again until the
cluster restart is done.
In the attached file, disable sync feature can be switched on and off.
Attachments:
-
[clmrestart.diff](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/d666d71b/3ab4/attachment/clmrestart.diff)
(6.4 kB; application/octet-stream)
---
** [tickets:#2451] clm: Make the cluster reset admin op safe**
**Status:** review
**Milestone:** 5.17.10
**Created:** Wed May 03, 2017 10:51 AM UTC by Anders Widell
**Last Updated:** Fri Sep 15, 2017 06:01 AM UTC
**Owner:** Hans Nordebäck
The cluster reset admin operation that was implemented in ticket [#2053] is not
safe: if a node reboots very fast it can come up again and join the old cluster
before other nodes have rebooted. See mail discussion:
https://sourceforge.net/p/opensaf/mailman/message/35398725/
This can be solved by implementing a two-phase cluster reset or by introducing
a cluster generation number which is increased at each cluster reset (maybe
both ordered an spontaneous cluster resets). A node will not be allowed to join
the cluster with a different cluster genration without first rebooting.
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets