[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start
- **status**: assigned --> review - **assigned_to**: Rafael - **Milestone**: future --> 5.17.06 --- ** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for nodes to start** **Status:** review **Milestone:** 5.17.06 **Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen **Last Updated:** Wed Apr 12, 2017 01:43 PM UTC **Owner:** Rafael When using the one step upgrade feature with a cluster reboot all nodes will restart including the SC-nodes. This is done as the last action in the upgrade step. After the active SC-node is up again SMF will continue with the procedure wrapup. When collecting information in order to prepare the wrapup the node destination for all nodes in the campaign is requested. However this information can only be collected from nodes that are started and has joined the cluster (unlocked). The problem is that SMF does not seems wait in order to give all nodes a chance to join the cluster and if SMF fails to get node destination from any of the nodes the campaign will fail as seen in the log below. When reading node destination there is a 10 sec “try again” loop waiting for “node up” for each node. It is not unlikely that the active SC-node comes up before some of the other nodes and that it will take more than 10 sec after that before some of the other nodes joins the cluster. If that's the case the campaign will fail --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start
- **Milestone**: 5.0.1 --> 5.0.2 --- ** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for nodes to start** **Status:** unassigned **Milestone:** 5.0.2 **Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen **Last Updated:** Thu Sep 15, 2016 10:36 AM UTC **Owner:** nobody When using the one step upgrade feature with a cluster reboot all nodes will restart including the SC-nodes. This is done as the last action in the upgrade step. After the active SC-node is up again SMF will continue with the procedure wrapup. When collecting information in order to prepare the wrapup the node destination for all nodes in the campaign is requested. However this information can only be collected from nodes that are started and has joined the cluster (unlocked). The problem is that SMF does not seems wait in order to give all nodes a chance to join the cluster and if SMF fails to get node destination from any of the nodes the campaign will fail as seen in the log below. When reading node destination there is a 10 sec “try again” loop waiting for “node up” for each node. It is not unlikely that the active SC-node comes up before some of the other nodes and that it will take more than 10 sec after that before some of the other nodes joins the cluster. If that's the case the campaign will fail --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start
After some more investigation: SMF should not have to wait for all nodes in order to change admin state. If admin state is changed for a node that is not yet started or part of the cluster it should be handled according to the set admin state when it comes up. --- ** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for nodes to start** **Status:** unassigned **Milestone:** 5.0.1 **Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen **Last Updated:** Tue Sep 13, 2016 11:09 AM UTC **Owner:** nobody When using the one step upgrade feature with a cluster reboot all nodes will restart including the SC-nodes. This is done as the last action in the upgrade step. After the active SC-node is up again SMF will continue with the procedure wrapup. When collecting information in order to prepare the wrapup the node destination for all nodes in the campaign is requested. However this information can only be collected from nodes that are started and has joined the cluster (unlocked). The problem is that SMF does not seems wait in order to give all nodes a chance to join the cluster and if SMF fails to get node destination from any of the nodes the campaign will fail as seen in the log below. When reading node destination there is a 10 sec “try again” loop waiting for “node up” for each node. It is not unlikely that the active SC-node comes up before some of the other nodes and that it will take more than 10 sec after that before some of the other nodes joins the cluster. If that's the case the campaign will fail --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start
I think a separate AMF ticket should be written for the AMF part of this problem. However even if the AMF problem is solved I think SMF shall be fixed to handle this in a better way e.g. by having a configurable time out for waiting for nodes. --- ** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for nodes to start** **Status:** unassigned **Milestone:** 5.0.1 **Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen **Last Updated:** Fri Sep 09, 2016 12:38 PM UTC **Owner:** nobody When using the one step upgrade feature with a cluster reboot all nodes will restart including the SC-nodes. This is done as the last action in the upgrade step. After the active SC-node is up again SMF will continue with the procedure wrapup. When collecting information in order to prepare the wrapup the node destination for all nodes in the campaign is requested. However this information can only be collected from nodes that are started and has joined the cluster (unlocked). The problem is that SMF does not seems wait in order to give all nodes a chance to join the cluster and if SMF fails to get node destination from any of the nodes the campaign will fail as seen in the log below. When reading node destination there is a 10 sec “try again” loop waiting for “node up” for each node. It is not unlikely that the active SC-node comes up before some of the other nodes and that it will take more than 10 sec after that before some of the other nodes joins the cluster. If that's the case the campaign will fail --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start
The defect case : 1. AMFD gets node up from AMFND but it rejects as CLM status is not avialable: Aug 22 6:54:24.099512 osafamfd [6694:ndfsm.cc:0281] >> avd_node_up_evh: from 2010f, safAmfNode=SC-1,safAmfCluster=myAmfCluster Aug 22 6:54:24.099519 osafamfd [6694:ndfsm.cc:0335] TR invalid node ID (2010f) Aug 22 6:54:24.099529 osafamfd [6694:ndfsm.cc:0477] << avd_node_up_evh This has to be findout, why the AMFND does not send clm info, even though it started early. 2. The case, here is not the late start of the SC-2-1 node, but why the AMFND of SC-2-1 did not send clm info. In General: 1. The camapign is cluster reboot, all nodes went for cluster reboot at the same time. Ideally all will join at the same time. 2. If the nodes starting is unpredictable, only when the nodes are in bad state or delay due to hardware issues. 3. If the nodes starting is unpredictable, then timout is also unpredictable. 4. This can be made configurable, or otherwise we can use already timout like "smfRebootTimeout" avaialble in smfConfig=1,safApp=safSmfService --- ** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for nodes to start** **Status:** unassigned **Milestone:** 5.0.1 **Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen **Last Updated:** Thu Sep 08, 2016 12:28 PM UTC **Owner:** nobody When using the one step upgrade feature with a cluster reboot all nodes will restart including the SC-nodes. This is done as the last action in the upgrade step. After the active SC-node is up again SMF will continue with the procedure wrapup. When collecting information in order to prepare the wrapup the node destination for all nodes in the campaign is requested. However this information can only be collected from nodes that are started and has joined the cluster (unlocked). The problem is that SMF does not seems wait in order to give all nodes a chance to join the cluster and if SMF fails to get node destination from any of the nodes the campaign will fail as seen in the log below. When reading node destination there is a 10 sec “try again” loop waiting for “node up” for each node. It is not unlikely that the active SC-node comes up before some of the other nodes and that it will take more than 10 sec after that before some of the other nodes joins the cluster. If that's the case the campaign will fail --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start
When SMF is started after a reboot and shall continue with a campaign it is checked that all nodes that are part of the campaign is available. In this case the campaign has requested a cluster reboot after the procedure execute state is completed. After restart the campaign shall continue with the procedure wrap-up state. The preparation for this includes asking for node Id of all nodes that’s part of the campaign and when all nodes has answered the wrap-up will be done. The problem here is that in this case each node is checked for node up with a timeout of 10s (this is hard coded) and if a node is not up within this time the campaign will fail. • Each node has a timeout of 10s • Nodes are checked in sequence meaning that the last node checked may have longer time to start if there has been any waiting done for any of the previous ones • The check starts when smfd has started on the active SC node and some of the other nodes may already have been started by then and some not Al together this means that this behavior is unpredictable and since the worst case will give a rather short timeout it may also be considered as unstable. For 2) I suggest the following to be done: 1. Create a temporary (quick) fix by just using a longer (hard coded) timeout if reboot upgrade to be released with 5.1 (defect ticket). Will this create any NBC problem? 2. Define and implement a better handling of this e.g. by making it possible to configure the timeout via a new attribute in the smf configuration object. Can be released as an enhancement in 5.2 Any better suggestions? --- ** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for nodes to start** **Status:** unassigned **Milestone:** 5.0.1 **Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen **Last Updated:** Thu Sep 01, 2016 09:50 AM UTC **Owner:** nobody When using the one step upgrade feature with a cluster reboot all nodes will restart including the SC-nodes. This is done as the last action in the upgrade step. After the active SC-node is up again SMF will continue with the procedure wrapup. When collecting information in order to prepare the wrapup the node destination for all nodes in the campaign is requested. However this information can only be collected from nodes that are started and has joined the cluster (unlocked). The problem is that SMF does not seems wait in order to give all nodes a chance to join the cluster and if SMF fails to get node destination from any of the nodes the campaign will fail as seen in the log below. When reading node destination there is a 10 sec “try again” loop waiting for “node up” for each node. It is not unlikely that the active SC-node comes up before some of the other nodes and that it will take more than 10 sec after that before some of the other nodes joins the cluster. If that's the case the campaign will fail --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start
The standby AMFND (SC-2-1) is not instantiated SMF/SMFND for one minute. Complete AMFND at SC_2-1 logs are required why the SMF/SMFND is not instantiated . Apart from the amfnd logs clmd logs are also required 1. when amfnd is started the following messages will appear in amnd traces: Sep 1 11:07:18.936290 osafamfnd [31552:clm.cc:0156] << clm_to_amf_node: 1 Sep 1 11:07:18.936295 osafamfnd [31552:di.cc:0454] >> avnd_send_node_up_msg Sep 1 11:07:18.936303 osafamfnd [31552:di.cc:1030] >> avnd_di_msg_send: Msg type '1' Sep 1 11:07:18.936307 osafamfnd [31552:di.cc:1225] >> avnd_diq_rec_add But, in the present shared logs, the following is happening: Aug 22 6:40:34.199112 osafamfnd [7027:main.cc:0644] TR Evt Type:32 success Aug 22 6:40:34.199122 osafamfnd [7027:main.cc:0649] << avnd_evt_process Aug 22 6:54:24.145784 osafamfnd [6889:main.cc:0621] >> avnd_evt_process Aug 22 6:54:24.146447 osafamfnd [6889:main.cc:0638] TR Evt type:47 Aug 22 6:54:24.146486 osafamfnd [6889:proxy.cc:0044] >> avnd_evt_mds_avnd_up_evh Aug 22 6:54:24.146499 osafamfnd [6889:proxy.cc:0053] << avnd_evt_mds_avnd_up_evh Aug 22 6:54:24.146505 osafamfnd [6889:main.cc:0644] TR Evt Type:47 success The complete logs are required why the amfnd is struck for one-minute 2. Active AMFD received node up, but discarded because of CLM status is not available AMFD gets node up from AMFND but it rejects as CLM status is not avialable: Aug 22 6:54:24.099512 osafamfd [6694:ndfsm.cc:0281] >> avd_node_up_evh: from 2010f, safAmfNode=SC-1,safAmfCluster=myAmfCluster Aug 22 6:54:24.099519 osafamfd [6694:ndfsm.cc:0335] TR invalid node ID (2010f) Aug 22 6:54:24.099529 osafamfd [6694:ndfsm.cc:0477] << avd_node_up_evh 3. From the active AMFD, also all nodes joined except the SC-1, which is joined approximately one miute late Aug 22 6:54:25.014401 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 2020f, version 2 Aug 22 6:54:55.733680 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 20c0f, version 2 Aug 22 6:54:55.779699 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 2080f, version 2 Aug 22 6:54:55.785660 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 2030f, version 2 Aug 22 6:55:05.876639 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 2070f, version 2 Aug 22 6:55:05.935932 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 20a0f, version 2 Aug 22 6:55:05.940017 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 2050f, version 2 Aug 22 6:55:05.942162 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 2090f, version 2 Aug 22 6:55:05.944370 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 2040f, version 2 Aug 22 6:55:05.946108 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 2060f, version 2 Aug 22 6:55:05.959791 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 20b0f, version 2 Aug 22 6:55:23.658639 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node id 2010f, version 2 4. Increasing the timeout or fixing at SMF may not be a solution as there some problem with SC-2-1 node joining, that has to be sorted out. --- ** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for nodes to start** **Status:** unassigned **Milestone:** 5.0.1 **Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen **Last Updated:** Thu Aug 25, 2016 02:52 AM UTC **Owner:** nobody When using the one step upgrade feature with a cluster reboot all nodes will restart including the SC-nodes. This is done as the last action in the upgrade step. After the active SC-node is up again SMF will continue with the procedure wrapup. When collecting information in order to prepare the wrapup the node destination for all nodes in the campaign is requested. However this information can only be collected from nodes that are started and has joined the cluster (unlocked). The problem is that SMF does not seems wait in order to give all nodes a chance to join the cluster and if SMF fails to get node destination from any of the nodes the campaign will fail as seen in the log below. When reading node destination there is a 10 sec “try again” loop waiting for “node up” for each node. It is not unlikely that the active SC-node comes up before some of the other nodes and that it will take more than 10 sec after that before some of the other nodes joins the cluster. If that's the case the campaign will fail --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list
[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start
- **summary**: log: One step upgrade with cluster reboot does not wait for nodes to start --> smf: One step upgrade with cluster reboot does not wait for nodes to start --- ** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for nodes to start** **Status:** unassigned **Milestone:** 5.0.1 **Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen **Last Updated:** Wed Aug 24, 2016 01:01 PM UTC **Owner:** nobody When using the one step upgrade feature with a cluster reboot all nodes will restart including the SC-nodes. This is done as the last action in the upgrade step. After the active SC-node is up again SMF will continue with the procedure wrapup. When collecting information in order to prepare the wrapup the node destination for all nodes in the campaign is requested. However this information can only be collected from nodes that are started and has joined the cluster (unlocked). The problem is that SMF does not seems wait in order to give all nodes a chance to join the cluster and if SMF fails to get node destination from any of the nodes the campaign will fail as seen in the log below. When reading node destination there is a 10 sec “try again” loop waiting for “node up” for each node. It is not unlikely that the active SC-node comes up before some of the other nodes and that it will take more than 10 sec after that before some of the other nodes joins the cluster. If that's the case the campaign will fail --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets