I attached one idea (prototype) for the safe cluster restart.
The attached file contains a bit change in IMM and CLM.

The idea is that when cluster restart is invoked by CLM admin operation, that 
CLM first disable sync in IMM (change in IMM), and then continue with rebooting 
nodes.

If a rebooted node comes up too fast, before the last IMM veteran node goes 
down, IMM sync will not be possible, and the node will be hanging in the NID 
phase waiting for the sync.
When the last IMM veteran node goes down, IMMD will start with electing a new 
coordinator. Since there is no any veteran node in the cluster, the new IMM 
coordinator will start loading data from PBE or XML file.

The side effect of the attached file is that some nodes which joined before the 
last veteran goes down, can be rebooted again mostly due to QUIESCED role in 
RDE, or if they are payload running without SC absence allowed.
There is nothing wrong with rebooting that nodes again. They are still in 
OpenSAF starting phase, and there is no any application up and running. So, 
rebooting that nodes are safe.

The attached file is only a proposal and needs to be split in two tickets, one 
for IMM (disable sync feature) and this ticket for CLM.

For IMM part, I would like to make the disable sync function as a one way 
function, and when the sync is disabled, it cannot be enabled again until the 
cluster restart is done.
In the attached file, disable sync feature can be switched on and off.



Attachments:

- 
[clmrestart.diff](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/d666d71b/3ab4/attachment/clmrestart.diff)
 (6.4 kB; application/octet-stream)


---

** [tickets:#2451] clm: Make the cluster reset admin op safe**

**Status:** review
**Milestone:** 5.17.10
**Created:** Wed May 03, 2017 10:51 AM UTC by Anders Widell
**Last Updated:** Fri Sep 15, 2017 06:01 AM UTC
**Owner:** Hans Nordebäck


The cluster reset admin operation that was implemented in ticket [#2053] is not 
safe: if a node reboots very fast it can come up again and join the old cluster 
before other nodes have rebooted. See mail discussion:

https://sourceforge.net/p/opensaf/mailman/message/35398725/

This can be solved by implementing a two-phase cluster reset or by introducing 
a cluster generation number which is increased at each cluster reset (maybe 
both ordered an spontaneous cluster resets). A node will not be allowed to join 
the cluster with a different cluster genration without first rebooting.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to