[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start

2017-04-20 Thread elunlen
- **status**: assigned --> review
- **assigned_to**: Rafael
- **Milestone**: future --> 5.17.06



---

** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for 
nodes to start**

**Status:** review
**Milestone:** 5.17.06
**Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen
**Last Updated:** Wed Apr 12, 2017 01:43 PM UTC
**Owner:** Rafael


When using the one step upgrade feature with a cluster reboot all nodes will 
restart including the SC-nodes. This is done as the last action in the upgrade 
step. After the active SC-node is up again SMF will continue with the procedure 
wrapup. When collecting information in order to prepare the wrapup the node 
destination for all nodes in the campaign is requested. However this 
information can only be collected from nodes that are started and has joined 
the cluster (unlocked).
The problem is that SMF does not seems wait in order to give all nodes a chance 
to join the cluster and if SMF fails to get node destination from any of the 
nodes the campaign will fail as seen in the log below. When reading node 
destination there is a 10 sec “try again” loop waiting for “node up” for each 
node. It is not unlikely that the active SC-node comes up before some of the 
other nodes and that it will take more than 10 sec after that before some of 
the other nodes joins the cluster. If that's the case the campaign will fail


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start

2016-09-20 Thread Anders Widell
- **Milestone**: 5.0.1 --> 5.0.2



---

** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for 
nodes to start**

**Status:** unassigned
**Milestone:** 5.0.2
**Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen
**Last Updated:** Thu Sep 15, 2016 10:36 AM UTC
**Owner:** nobody


When using the one step upgrade feature with a cluster reboot all nodes will 
restart including the SC-nodes. This is done as the last action in the upgrade 
step. After the active SC-node is up again SMF will continue with the procedure 
wrapup. When collecting information in order to prepare the wrapup the node 
destination for all nodes in the campaign is requested. However this 
information can only be collected from nodes that are started and has joined 
the cluster (unlocked).
The problem is that SMF does not seems wait in order to give all nodes a chance 
to join the cluster and if SMF fails to get node destination from any of the 
nodes the campaign will fail as seen in the log below. When reading node 
destination there is a 10 sec “try again” loop waiting for “node up” for each 
node. It is not unlikely that the active SC-node comes up before some of the 
other nodes and that it will take more than 10 sec after that before some of 
the other nodes joins the cluster. If that's the case the campaign will fail


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start

2016-09-15 Thread elunlen
After some more investigation:
SMF should not have to wait for all nodes in order to change admin state. If 
admin state is changed for a node that is not yet started or part of the 
cluster it should be handled according to the set admin state when it comes up.


---

** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for 
nodes to start**

**Status:** unassigned
**Milestone:** 5.0.1
**Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen
**Last Updated:** Tue Sep 13, 2016 11:09 AM UTC
**Owner:** nobody


When using the one step upgrade feature with a cluster reboot all nodes will 
restart including the SC-nodes. This is done as the last action in the upgrade 
step. After the active SC-node is up again SMF will continue with the procedure 
wrapup. When collecting information in order to prepare the wrapup the node 
destination for all nodes in the campaign is requested. However this 
information can only be collected from nodes that are started and has joined 
the cluster (unlocked).
The problem is that SMF does not seems wait in order to give all nodes a chance 
to join the cluster and if SMF fails to get node destination from any of the 
nodes the campaign will fail as seen in the log below. When reading node 
destination there is a 10 sec “try again” loop waiting for “node up” for each 
node. It is not unlikely that the active SC-node comes up before some of the 
other nodes and that it will take more than 10 sec after that before some of 
the other nodes joins the cluster. If that's the case the campaign will fail


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start

2016-09-13 Thread elunlen
I think a separate AMF ticket should be written for the AMF part of this 
problem. However even if the AMF problem is solved I think SMF shall be fixed 
to handle this in a better way e.g. by having a configurable time out for 
waiting for nodes.


---

** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for 
nodes to start**

**Status:** unassigned
**Milestone:** 5.0.1
**Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen
**Last Updated:** Fri Sep 09, 2016 12:38 PM UTC
**Owner:** nobody


When using the one step upgrade feature with a cluster reboot all nodes will 
restart including the SC-nodes. This is done as the last action in the upgrade 
step. After the active SC-node is up again SMF will continue with the procedure 
wrapup. When collecting information in order to prepare the wrapup the node 
destination for all nodes in the campaign is requested. However this 
information can only be collected from nodes that are started and has joined 
the cluster (unlocked).
The problem is that SMF does not seems wait in order to give all nodes a chance 
to join the cluster and if SMF fails to get node destination from any of the 
nodes the campaign will fail as seen in the log below. When reading node 
destination there is a 10 sec “try again” loop waiting for “node up” for each 
node. It is not unlikely that the active SC-node comes up before some of the 
other nodes and that it will take more than 10 sec after that before some of 
the other nodes joins the cluster. If that's the case the campaign will fail


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start

2016-09-09 Thread Neelakanta Reddy
The defect case :

1. AMFD gets node up from AMFND but it rejects as CLM status is not avialable:

Aug 22 6:54:24.099512 osafamfd [6694:ndfsm.cc:0281] >> avd_node_up_evh: from 
2010f, safAmfNode=SC-1,safAmfCluster=myAmfCluster
Aug 22 6:54:24.099519 osafamfd [6694:ndfsm.cc:0335] TR invalid node ID (2010f)
Aug 22 6:54:24.099529 osafamfd [6694:ndfsm.cc:0477] << avd_node_up_evh

This has to be findout, why the AMFND does not send clm info, even though it 
started early.

2. The case, here is not the late start of the SC-2-1 node, but why the AMFND 
of SC-2-1 did not send clm info.


In General:

1. The camapign is cluster reboot, all nodes went for cluster reboot at the 
same time. Ideally all will join at the same time.

2. If the nodes starting is unpredictable, only when the nodes are in bad state 
or delay due to hardware issues.

3. If the nodes starting is unpredictable, then timout is also unpredictable.

4. This can be made configurable, or otherwise we can use already timout like 
"smfRebootTimeout" avaialble in smfConfig=1,safApp=safSmfService


---

** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for 
nodes to start**

**Status:** unassigned
**Milestone:** 5.0.1
**Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen
**Last Updated:** Thu Sep 08, 2016 12:28 PM UTC
**Owner:** nobody


When using the one step upgrade feature with a cluster reboot all nodes will 
restart including the SC-nodes. This is done as the last action in the upgrade 
step. After the active SC-node is up again SMF will continue with the procedure 
wrapup. When collecting information in order to prepare the wrapup the node 
destination for all nodes in the campaign is requested. However this 
information can only be collected from nodes that are started and has joined 
the cluster (unlocked).
The problem is that SMF does not seems wait in order to give all nodes a chance 
to join the cluster and if SMF fails to get node destination from any of the 
nodes the campaign will fail as seen in the log below. When reading node 
destination there is a 10 sec “try again” loop waiting for “node up” for each 
node. It is not unlikely that the active SC-node comes up before some of the 
other nodes and that it will take more than 10 sec after that before some of 
the other nodes joins the cluster. If that's the case the campaign will fail


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start

2016-09-08 Thread elunlen
When SMF is started after a reboot and shall continue with a campaign it is 
checked that all nodes that are part of the campaign is available. In this case 
the campaign has requested a cluster reboot after the procedure execute state 
is completed. After restart the campaign shall continue with the procedure 
wrap-up state. The preparation for this includes asking for node Id of all 
nodes that’s part of the campaign and when all nodes has answered the wrap-up 
will be done.
The problem here is that in this case each node is checked for node up with a 
timeout of 10s (this is hard coded) and if a node is not up within this time 
the campaign will fail.
•   Each node has a timeout of 10s
•   Nodes are checked in sequence meaning that the last node checked may 
have longer time to start if there has been any waiting done for any of the 
previous ones
•   The check starts when smfd has started on the active SC node and some 
of the other nodes may already have been started by then and some not
Al together this means that this behavior is unpredictable and since the worst 
case will give a rather short timeout it may also be considered as unstable.

For 2) I suggest the following to be done:
1.  Create a temporary (quick) fix by just using a longer (hard coded)  
timeout if reboot upgrade to be released with 5.1 (defect ticket).
Will this create any NBC problem?
2.  Define and implement a better handling of this e.g. by making it 
possible to configure the timeout via a new attribute in the smf configuration 
object. Can be released as an enhancement in 5.2
Any better suggestions?



---

** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for 
nodes to start**

**Status:** unassigned
**Milestone:** 5.0.1
**Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen
**Last Updated:** Thu Sep 01, 2016 09:50 AM UTC
**Owner:** nobody


When using the one step upgrade feature with a cluster reboot all nodes will 
restart including the SC-nodes. This is done as the last action in the upgrade 
step. After the active SC-node is up again SMF will continue with the procedure 
wrapup. When collecting information in order to prepare the wrapup the node 
destination for all nodes in the campaign is requested. However this 
information can only be collected from nodes that are started and has joined 
the cluster (unlocked).
The problem is that SMF does not seems wait in order to give all nodes a chance 
to join the cluster and if SMF fails to get node destination from any of the 
nodes the campaign will fail as seen in the log below. When reading node 
destination there is a 10 sec “try again” loop waiting for “node up” for each 
node. It is not unlikely that the active SC-node comes up before some of the 
other nodes and that it will take more than 10 sec after that before some of 
the other nodes joins the cluster. If that's the case the campaign will fail


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start

2016-09-01 Thread Neelakanta Reddy
The standby AMFND (SC-2-1) is not  instantiated SMF/SMFND for one minute.
Complete AMFND at SC_2-1 logs are required why the SMF/SMFND is not 
instantiated .
Apart from the amfnd logs clmd logs are also required

1. when amfnd is started the following messages will appear in amnd traces:
Sep  1 11:07:18.936290 osafamfnd [31552:clm.cc:0156] << clm_to_amf_node: 1
Sep  1 11:07:18.936295 osafamfnd [31552:di.cc:0454] >> avnd_send_node_up_msg
Sep  1 11:07:18.936303 osafamfnd [31552:di.cc:1030] >> avnd_di_msg_send: Msg 
type '1'
Sep  1 11:07:18.936307 osafamfnd [31552:di.cc:1225] >> avnd_diq_rec_add

But, in the present shared logs, the following is happening:

Aug 22  6:40:34.199112 osafamfnd [7027:main.cc:0644] TR Evt Type:32 success
Aug 22  6:40:34.199122 osafamfnd [7027:main.cc:0649] << avnd_evt_process
Aug 22  6:54:24.145784 osafamfnd [6889:main.cc:0621] >> avnd_evt_process
Aug 22  6:54:24.146447 osafamfnd [6889:main.cc:0638] TR Evt type:47
Aug 22  6:54:24.146486 osafamfnd [6889:proxy.cc:0044] >> 
avnd_evt_mds_avnd_up_evh
Aug 22  6:54:24.146499 osafamfnd [6889:proxy.cc:0053] << 
avnd_evt_mds_avnd_up_evh
Aug 22  6:54:24.146505 osafamfnd [6889:main.cc:0644] TR Evt Type:47 success

The complete logs are required why the amfnd  is struck for one-minute

2.  Active AMFD received node up, but discarded  because  of CLM status is not 
available

AMFD gets node up from AMFND but it rejects as CLM status is not avialable:
Aug 22  6:54:24.099512 osafamfd [6694:ndfsm.cc:0281] >> avd_node_up_evh: from 
2010f, safAmfNode=SC-1,safAmfCluster=myAmfCluster
Aug 22  6:54:24.099519 osafamfd [6694:ndfsm.cc:0335] TR invalid node ID (2010f)
Aug 22  6:54:24.099529 osafamfd [6694:ndfsm.cc:0477] << avd_node_up_evh

3.  From the active AMFD, also all nodes joined except the SC-1, which is 
joined approximately one miute late


Aug 22  6:54:25.014401 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 2020f, version 2
Aug 22  6:54:55.733680 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 20c0f, version 2
Aug 22  6:54:55.779699 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 2080f, version 2
Aug 22  6:54:55.785660 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 2030f, version 2
Aug 22  6:55:05.876639 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 2070f, version 2
Aug 22  6:55:05.935932 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 20a0f, version 2
Aug 22  6:55:05.940017 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 2050f, version 2
Aug 22  6:55:05.942162 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 2090f, version 2
Aug 22  6:55:05.944370 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 2040f, version 2
Aug 22  6:55:05.946108 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 2060f, version 2
Aug 22  6:55:05.959791 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 20b0f, version 2
Aug 22  6:55:23.658639 osafsmfd [6727:smfd_smfnd.c:0123] TR SMFND UP for node 
id 2010f, version 2

4.  Increasing the timeout or fixing at SMF may not be a solution as there some 
problem with SC-2-1 node joining, that has to be sorted out.



---

** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for 
nodes to start**

**Status:** unassigned
**Milestone:** 5.0.1
**Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen
**Last Updated:** Thu Aug 25, 2016 02:52 AM UTC
**Owner:** nobody


When using the one step upgrade feature with a cluster reboot all nodes will 
restart including the SC-nodes. This is done as the last action in the upgrade 
step. After the active SC-node is up again SMF will continue with the procedure 
wrapup. When collecting information in order to prepare the wrapup the node 
destination for all nodes in the campaign is requested. However this 
information can only be collected from nodes that are started and has joined 
the cluster (unlocked).
The problem is that SMF does not seems wait in order to give all nodes a chance 
to join the cluster and if SMF fails to get node destination from any of the 
nodes the campaign will fail as seen in the log below. When reading node 
destination there is a 10 sec “try again” loop waiting for “node up” for each 
node. It is not unlikely that the active SC-node comes up before some of the 
other nodes and that it will take more than 10 sec after that before some of 
the other nodes joins the cluster. If that's the case the campaign will fail


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list

[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start

2016-08-24 Thread Vu Minh Nguyen
- **summary**: log: One step upgrade with cluster reboot does not wait for 
nodes to start --> smf: One step upgrade with cluster reboot does not wait for 
nodes to start



---

** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for 
nodes to start**

**Status:** unassigned
**Milestone:** 5.0.1
**Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen
**Last Updated:** Wed Aug 24, 2016 01:01 PM UTC
**Owner:** nobody


When using the one step upgrade feature with a cluster reboot all nodes will 
restart including the SC-nodes. This is done as the last action in the upgrade 
step. After the active SC-node is up again SMF will continue with the procedure 
wrapup. When collecting information in order to prepare the wrapup the node 
destination for all nodes in the campaign is requested. However this 
information can only be collected from nodes that are started and has joined 
the cluster (unlocked).
The problem is that SMF does not seems wait in order to give all nodes a chance 
to join the cluster and if SMF fails to get node destination from any of the 
nodes the campaign will fail as seen in the log below. When reading node 
destination there is a 10 sec “try again” loop waiting for “node up” for each 
node. It is not unlikely that the active SC-node comes up before some of the 
other nodes and that it will take more than 10 sec after that before some of 
the other nodes joins the cluster. If that's the case the campaign will fail


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets