[devel] [PATCH 1/1] amfnd: change log message severity [#2945]

2018-10-24 Thread Gary Lee
---
 src/amf/amfnd/clm.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/amf/amfnd/clm.cc b/src/amf/amfnd/clm.cc
index f1f65bcef..06eb229c7 100644
--- a/src/amf/amfnd/clm.cc
+++ b/src/amf/amfnd/clm.cc
@@ -124,7 +124,7 @@ static void clm_to_amf_node(void) {
 
   error = saImmOmInitialize_cond(, nullptr, );
   if (SA_AIS_OK != error) {
-LOG_CR("saImmOmInitialize failed. Use previous value of nodeName.");
+LOG_WA("saImmOmInitialize failed. Use previous value of nodeName.");
 osafassert(avnd_cb->amf_nodeName.empty() == false);
 goto done1;
   }
-- 
2.17.1



___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


[devel] [PATCH 0/1] Review Request for amfnd: change log message severity [#2945]

2018-10-24 Thread Gary Lee
Summary: amfnd: change log message severity [#2945]
Review request for Ticket(s): 2945
Peer Reviewer(s): Hans, Minh, Nagu
Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE ***
Affected branch(es): develop
Development branch: ticket-2945
Base revision: 3b80698770d599bc15b97119cbfd4098943d7643
Personal repository: git://git.code.sf.net/u/userid-2226215/review


Impacted area   Impact y/n

 Docsn
 Build systemn
 RPM/packaging   n
 Configuration files n
 Startup scripts n
 SAF servicesy 
 OpenSAF servicesn
 Core libraries  n
 Samples n
 Tests   n
 Other   n


Comments (indicate scope for each "y" above):
-

revision c7de076e4efbcc2c4822e7ad4f8eafa0cdf61f46
Author: Gary Lee 
Date:   Thu, 25 Oct 2018 05:24:25 +

amfnd: change log message severity [#2945]



Complete diffstat:
--
 src/amf/amfnd/clm.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


Testing Commands:
-
*** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES ***


Testing, Expected Results:
--
*** PASTE COMMAND OUTPUTS / TEST RESULTS ***


Conditions of Submission:
-
Ack from any reviewer, or in 5 days

Arch  Built StartedLinux distro
---
mipsn  n
mips64  n  n
x86 n  n
x86_64  y  y 
powerpc n  n
powerpc64   n  n


Reviewer Checklist:
---
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
(i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
cosmetic code cleanup changes. These have to be separate commits.

___ You need to refactor your submission into logical chunks; there is
too much content into a single commit.

___ You have extraneous garbage in your review (merge commits etc)

___ You have giant attachments which should never have been sent;
Instead you should place your content in a public tree to be pulled.

___ You have too many commits attached to an e-mail; resend as threaded
commits, or place in a public tree for a pull.

___ You have resent this content multiple times without a clear indication
of what has changed between each re-send.

___ You have failed to adequately and individually address all of the
comments and change requests that were proposed in the initial review.

___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc)

___ Your computer have a badly configured date and time; confusing the
the threaded patch review.

___ Your changes affect IPC mechanism, and you don't present any results
for in-service upgradability test.

___ Your changes affect user manual and documentation, your patch series
do not contain the patch that updates the Doxygen manual.



___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


[devel] [PATCH 1/1] mds: Send NCSMDS_DOWN with vdest if there is no any adest [#2941]

2018-10-24 Thread Minh Chau
If split brain happens and network merges back, at this point in time
there are a few mds events coming to payloads, which are the SVC UP
from the other controller; SVC down from services in both controllers
due to reboot from split brain detection.
In the ticket description, the first partition includes SC1, PL3,
the second partition includes SC2, PL4, PL5. The amfnd on PL3 is
missing NCSMDS_DOWN with vdest in the below scenario:

- SVC up event from the other amfd (on SC2)
- SVC down event from amfd (SC1), it's the same active adest from
mds-PL3's view, start await_active timer, but no NCSMDS_DOWN with
vdest is sent because the adest on SC2 exists.
- SVC down event from amfd (SC2), it's different active adest.

Because the payloads reside in different partitions so they don't
have the same active adest view at mds level. When both SCs go down
due to split brain detection, the same SVC down events occur and
comes to all payloads, but they have different view so they behave
differently to the payloads in the other partition.

The patch adds an additional condition to send NCSMDS_DOWN if there is
no actual adest existed
---
 src/mds/mds_c_api.c | 80 ++---
 1 file changed, 46 insertions(+), 34 deletions(-)

diff --git a/src/mds/mds_c_api.c b/src/mds/mds_c_api.c
index f5ba318..73849cc 100644
--- a/src/mds/mds_c_api.c
+++ b/src/mds/mds_c_api.c
@@ -3644,13 +3644,58 @@ uint32_t mds_mcm_svc_down(PW_ENV_ID pwe_id, MDS_SVC_ID 
svc_id, V_DEST_RL role,
local_svc_hdl, svc_id, vdest_id,
_adest, _running,
_result_info, true);
-
+   m_MDS_LOG_INFO("MCM:API: svc_down: "
+ "active_adest:%lu", active_adest);
/* First delete the entry */
mds_subtn_res_tbl_del(
local_svc_hdl, svc_id, vdest_id, adest,
vdest_policy, svc_sub_part_ver,
archword_type);
 
+   MDS_SUBSCRIPTION_RESULTS_INFO *s_info = NULL;
+   bool adest_exists = false;
+
+   /* if no adest remains for this svc
+* send MDS_DOWN
+*/
+   status = mds_subtn_res_tbl_getnext_any(
+   local_svc_hdl, svc_id,
+   _info);
+
+   while (status != NCSCC_RC_FAILURE) {
+   if (s_info->key.vdest_id !=
+   m_VDEST_ID_FOR_ADEST_ENTRY) {
+   adest_exists = true;
+   break;
+   }
+
+   status = mds_subtn_res_tbl_getnext_any(
+   local_svc_hdl, svc_id, _info);
+   }
+
+   if (active_adest != adest
+ && vdest_policy == NCS_VDEST_TYPE_MxN
+   && adest_exists == false) {
+   m_MDS_LOG_INFO("MCM:API: svc_down : "
+   "svc_id = %s(%d) on DEST id = 
%d "
+   "got NO_ACTIVE for svc_id = 
%s(%d) "
+"on Vdest id = %d Adest = %s, rem_svc_pvt_ver=%d",
+   get_svc_names(
+   
m_MDS_GET_SVC_ID_FROM_SVC_HDL(local_svc_hdl)),
+   m_MDS_GET_SVC_ID_FROM_SVC_HDL(
+   local_svc_hdl),
+   m_MDS_GET_VDEST_ID_FROM_SVC_HDL(
+   local_svc_hdl),
+   get_svc_names(svc_id), svc_id,
+   vdest_id,
+   
log_subtn_result_info->sub_adest_details,
+   svc_sub_part_ver);
+   status = mds_mcm_user_event_callback(
+ local_svc_hdl, pwe_id, svc_id,
+ role, vdest_id, 0, NCSMDS_DOWN,
+   svc_sub_part_ver, 
archword_type);
+   }
+
if (active_adest == adest) {
if (vdest_policy ==
NCS_VDEST_TYPE_MxN) {

[devel] [PATCH 4/4] amfd: add support for delaying node failover [#2918]

2018-10-24 Thread Gary Lee
OpenSAF has relied on reliable, redundant links between nodes in a cluster.
This can no longer be assumed in virtualised environments.

In order to avoid duplicate assignments, we need to delay
node failover in environments where temporary network partitioning is expected.

When delayed node failover is enabled, AMF will not perform a node
failover until a node has been fenced if remote fencing is available,
or until the specified period has occurred (osafAmfDelayNodeFailoverTimeout).

If MDS connectivity is re-established while waiting, AMF will wait
osafAmfDelayNodeFailoverNodeUpWait seconds for a node
up (with leds_set == false) message to indicate the node
has been already rebooted, and finish the node failover.

Otherwise, AMF will send a message to the node
asking it to reboot itself. When AMF sees that the MDS connectivity is
lost again, or after osafAmfDelayNodeFailoverNodeUpWait seconds,
it can consider the fencing to be complete and finish the node failover.
---
 src/amf/Makefile.am|   6 +
 src/amf/amfd/cb.h  |  24 +-
 src/amf/amfd/clm.cc|  12 +-
 src/amf/amfd/cluster.cc|  18 ++
 src/amf/amfd/cluster.h |   1 +
 src/amf/amfd/config.cc |  35 ++-
 src/amf/amfd/evt.h |   1 +
 src/amf/amfd/main.cc   |  13 +-
 src/amf/amfd/ndfsm.cc  |  70 +-
 src/amf/amfd/ndproc.cc |  14 +-
 src/amf/amfd/node.cc   |   2 +
 src/amf/amfd/node_state.cc | 338 +
 src/amf/amfd/node_state.h  | 101 +
 src/amf/amfd/node_state_machine.cc |  98 +
 src/amf/amfd/node_state_machine.h  |  39 
 src/amf/amfd/proc.h|   2 +-
 src/amf/amfd/role.cc   |   9 +-
 src/amf/amfd/timer.cc  |   6 +-
 src/amf/amfd/timer.h   |   1 +
 19 files changed, 761 insertions(+), 29 deletions(-)
 create mode 100644 src/amf/amfd/node_state.cc
 create mode 100644 src/amf/amfd/node_state.h
 create mode 100644 src/amf/amfd/node_state_machine.cc
 create mode 100644 src/amf/amfd/node_state_machine.h

diff --git a/src/amf/Makefile.am b/src/amf/Makefile.am
index 413571a52..8544effd4 100644
--- a/src/amf/Makefile.am
+++ b/src/amf/Makefile.am
@@ -107,6 +107,8 @@ noinst_HEADERS += \
src/amf/amfd/mds.h \
src/amf/amfd/msg.h \
src/amf/amfd/node.h \
+   src/amf/amfd/node_state.h \
+   src/amf/amfd/node_state_machine.h \
src/amf/amfd/ntf.h \
src/amf/amfd/pg.h \
src/amf/amfd/proc.h \
@@ -225,6 +227,8 @@ bin_testamfd_LDFLAGS = \
src/amf/amfd/bin_osafamfd-ndmsg.o \
src/amf/amfd/bin_osafamfd-ndproc.o \
src/amf/amfd/bin_osafamfd-node.o \
+   src/amf/amfd/bin_osafamfd-node_state.o \
+   src/amf/amfd/bin_osafamfd-node_state_machine.o \
src/amf/amfd/bin_osafamfd-nodegroup.o \
src/amf/amfd/bin_osafamfd-nodeswbundle.o \
src/amf/amfd/bin_osafamfd-ntf.o \
@@ -327,6 +331,8 @@ bin_osafamfd_SOURCES = \
src/amf/amfd/ndmsg.cc \
src/amf/amfd/ndproc.cc \
src/amf/amfd/node.cc \
+   src/amf/amfd/node_state.cc \
+   src/amf/amfd/node_state_machine.cc \
src/amf/amfd/nodegroup.cc \
src/amf/amfd/nodeswbundle.cc \
src/amf/amfd/ntf.cc \
diff --git a/src/amf/amfd/cb.h b/src/amf/amfd/cb.h
index 3b7e6d13f..d3d88c1ed 100644
--- a/src/amf/amfd/cb.h
+++ b/src/amf/amfd/cb.h
@@ -37,18 +37,21 @@
 #include 
 #include 
 
+#include 
+#include 
+#include 
+#include 
+#include 
+
 #include "base/ncssysf_lck.h"
 #include "mds/mds_papi.h"
 #include "mbc/mbcsv_papi.h"
 #include "base/ncs_edu_pub.h"
 
 #include "amf/amfd/ckpt.h"
+#include "amf/amfd/node_state_machine.h"
 #include "amf/amfd/timer.h"
 
-#include 
-#include 
-#include 
-
 class AVD_SI;
 class AVD_AVND;
 
@@ -248,6 +251,19 @@ typedef struct cl_cb_tag {
   /* The duration that amfd should tolerate the absence of SCs */
   uint32_t scs_absence_max_duration;
   AVD_IMM_INIT_STATUS avd_imm_status;
+
+  // MDS_DOWN received for node, we are delaying node failover by this
+  // number of seconds (timer1)
+  SaTimeT node_failover_delay;
+
+  // after receiving MDS_UP, we will wait for NODE_UP up to this number
+  // of seconds (timer2)
+  SaTimeT node_failover_nodeup_wait;
+
+  using FailedNodeMap = std::map>;
+  // We received amfnd down for these nodes
+  FailedNodeMap failover_list;
+
 } AVD_CL_CB;
 
 extern AVD_CL_CB *avd_cb;
diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc
index 1e67ff389..aeae93931 100644
--- a/src/amf/amfd/clm.cc
+++ b/src/amf/amfd/clm.cc
@@ -202,8 +202,11 @@ static void clm_node_exit_complete(SaClmNodeIdT nodeId) {
 goto done;
   }
 
-  avd_node_failover(node);
-  avd_node_delete_nodeid(node);
+  if (avd_cb->failover_list.count(node->node_info.nodeId) == 0 &&
+avd_cb->node_failover_delay == 0) {
+avd_node_failover(node);
+avd_node_delete_nodeid(node);
+  }
   

[devel] [PATCH 0/4] Review Request for amfd: add support for delaying node failover [#2918]

2018-10-24 Thread Gary Lee
Summary: amfd: add support for delaying node failover [#2918] 
Review request for Ticket(s): 2918
Peer Reviewer(s): Hans, Minh, Nagu 
Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE ***
Affected branch(es): develop
Development branch: ticket-2918
Base revision: 3b80698770d599bc15b97119cbfd4098943d7643
Personal repository: git://git.code.sf.net/u/userid-2226215/review


Impacted area   Impact y/n

 Docsn
 Build systemn
 RPM/packaging   n
 Configuration files n
 Startup scripts n
 SAF servicesy 
 OpenSAF servicesn
 Core libraries  n
 Samples n
 Tests   n
 Other   n


Comments (indicate scope for each "y" above):
-

Please see ticket for more details and a state diagram is available there.

revision 7e04f9bc5aea4f5580e3bdf0551b37c05bfc4025
Author: Gary Lee 
Date:   Wed, 24 Oct 2018 11:37:04 +

amfd: add support for delaying node failover [#2918]

OpenSAF has relied on reliable, redundant links between nodes in a cluster.
This can no longer be assumed in virtualised environments.

In order to avoid duplicate assignments, we need to delay
node failover in environments where temporary network partitioning is expected.

When delayed node failover is enabled, AMF will not perform a node
failover until a node has been fenced if remote fencing is available,
or until the specified period has occurred (osafAmfDelayNodeFailoverTimeout).

If MDS connectivity is re-established while waiting, AMF will wait
osafAmfDelayNodeFailoverNodeUpWait seconds for a node
up (with leds_set == false) message to indicate the node
has been already rebooted, and finish the node failover.

Otherwise, AMF will send a message to the node
asking it to reboot itself. When AMF sees that the MDS connectivity is
lost again, or after osafAmfDelayNodeFailoverNodeUpWait seconds,
it can consider the fencing to be complete and finish the node failover.



revision 184835903e2c0d4544c69b2348d7095afb91219f
Author: Gary Lee 
Date:   Wed, 24 Oct 2018 11:37:04 +

amfd: add checkpointing of node failover state [#2918]



revision 7052963a7b555d256c2674aee0cfa2cb2497dd68
Author: Gary Lee 
Date:   Wed, 24 Oct 2018 11:37:04 +

amfnd: allow reboot from any director [#2918]

allow reboot msg to be sent from any director, for
split brain recovery situations



revision 7aeb96aebae4dec85b59a83e0755337ff6be3c28
Author: Gary Lee 
Date:   Wed, 24 Oct 2018 11:36:56 +

amfd: add class definitions for new timers [#2918]

osafAmfDelayNodeFailoverTimeout - the number of seconds we wait
after MDS down is received before we consider it truly down.

osafAmfDelayNodeFailoverNodeUpWait - the number of seconds we
wait for Node Up after receving MDS up, before we send reboot
to the node. After sending  reboot to a node, also wait up to
this number of seconds before we consider the node to be
down (unless MDs down is received first).



Added Files:

 src/amf/amfd/node_state.cc
 src/amf/amfd/node_state.h
 src/amf/amfd/node_state_machine.cc
 src/amf/amfd/node_state_machine.h


Complete diffstat:
--
 src/amf/Makefile.am|   6 +
 src/amf/amfd/cb.h  |  24 ++-
 src/amf/amfd/chkop.cc  |  10 ++
 src/amf/amfd/ckpt.h|   3 +-
 src/amf/amfd/ckpt_dec.cc   |  40 -
 src/amf/amfd/ckpt_enc.cc   |  26 ++-
 src/amf/amfd/ckpt_msg.h|   1 +
 src/amf/amfd/clm.cc|  12 +-
 src/amf/amfd/cluster.cc|  18 ++
 src/amf/amfd/cluster.h |   1 +
 src/amf/amfd/config.cc |  35 +++-
 src/amf/amfd/evt.h |   1 +
 src/amf/amfd/main.cc   |  13 +-
 src/amf/amfd/ndfsm.cc  |  70 ++--
 src/amf/amfd/ndproc.cc |  14 +-
 src/amf/amfd/node.cc   |   2 +
 src/amf/amfd/node_state.cc | 338 +
 src/amf/amfd/node_state.h  | 101 +++
 src/amf/amfd/node_state_machine.cc |  98 +++
 src/amf/amfd/node_state_machine.h  |  39 +
 src/amf/amfd/proc.h|   2 +-
 src/amf/amfd/role.cc   |   9 +-
 src/amf/amfd/timer.cc  |   6 +-
 src/amf/amfd/timer.h   |   1 +
 src/amf/amfnd/mds.cc   |   3 +-
 src/amf/config/amf_classes.xml |  14 +-
 src/amf/config/amf_objects.xml |   8 +
 27 files changed, 860 insertions(+), 35 deletions(-)


Testing Commands:
-

Test Case 1:

0. Set 'osafAmfDelayNodeFailoverTimeout' to 15s
1. 2N app on PL3 (active) and PL4 (standby)
2. Reboot PL3 (assuming it comes back within 15s)
3. Ensure PL4 is only assigned active after PL3 is up

Test Case 2:

1. NwayActive app on PL3, PL4 and PL5
2. Isolate PL3 from the rest of network
3. Remove isolation
4. Ensure PL3 is rebooted by AMF

[devel] [PATCH 2/4] amfnd: allow reboot from any director [#2918]

2018-10-24 Thread Gary Lee
allow reboot msg to be sent from any director, for
split brain recovery situations
---
 src/amf/amfnd/mds.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/amf/amfnd/mds.cc b/src/amf/amfnd/mds.cc
index 1ee24cf5b..d179ff40e 100644
--- a/src/amf/amfnd/mds.cc
+++ b/src/amf/amfnd/mds.cc
@@ -328,7 +328,8 @@ uint32_t avnd_mds_rcv(AVND_CB *cb, 
MDS_CALLBACK_RECEIVE_INFO *rcv_info) {
* from any other anchor than Active (except for HB message).
*/
   if ((rcv_info->i_fr_dest != cb->active_avd_adest) &&
-  (msg.info.avd->msg_type != AVSV_D2N_HEARTBEAT_MSG)) {
+  (msg.info.avd->msg_type != AVSV_D2N_HEARTBEAT_MSG) &&
+  (msg.info.avd->msg_type != AVSV_D2N_REBOOT_MSG)) {
 LOG_ER("Received dest: %" PRIu64 " and cb active AVD adest:%" PRIu64
" mismatch, message type = %u",
rcv_info->i_fr_dest, cb->active_avd_adest,
-- 
2.17.1



___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


[devel] [PATCH 1/4] amfd: add class definitions for new timers [#2918]

2018-10-24 Thread Gary Lee
osafAmfDelayNodeFailoverTimeout - the number of seconds we wait
after MDS down is received before we consider it truly down.

osafAmfDelayNodeFailoverNodeUpWait - the number of seconds we
wait for Node Up after receving MDS up, before we send reboot
to the node. After sending  reboot to a node, also wait up to
this number of seconds before we consider the node to be
down (unless MDs down is received first).
---
 src/amf/config/amf_classes.xml | 14 +-
 src/amf/config/amf_objects.xml |  8 
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/src/amf/config/amf_classes.xml b/src/amf/config/amf_classes.xml
index df5cbd92a..182bd97e5 100644
--- a/src/amf/config/amf_classes.xml
+++ b/src/amf/config/amf_classes.xml
@@ -1452,5 +1452,17 @@
SA_CONFIG
SA_WRITABLE

-   
+   
+   osafAmfDelayNodeFailoverTimeout
+   SA_TIME_T
+   SA_CONFIG
+   SA_WRITABLE
+   
+   
+   osafAmfDelayNodeFailoverNodeUpWait
+   SA_TIME_T
+   SA_CONFIG
+   SA_WRITABLE
+   
+
 
diff --git a/src/amf/config/amf_objects.xml b/src/amf/config/amf_objects.xml
index 6ed68d83d..c008c7520 100644
--- a/src/amf/config/amf_objects.xml
+++ b/src/amf/config/amf_objects.xml
@@ -6,6 +6,14 @@
osafAmfRestrictAutoRepairEnable
1

+   
+   osafAmfDelayNodeFailoverTimeout
+   0
+   
+   
+   osafAmfDelayNodeFailoverNodeUpWait
+   180
+   


safAppType=OpenSafApplicationType
-- 
2.17.1



___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


[devel] [PATCH 3/4] amfd: add checkpointing of node failover state [#2918]

2018-10-24 Thread Gary Lee
---
 src/amf/amfd/chkop.cc| 10 ++
 src/amf/amfd/ckpt.h  |  3 ++-
 src/amf/amfd/ckpt_dec.cc | 40 +++-
 src/amf/amfd/ckpt_enc.cc | 26 --
 src/amf/amfd/ckpt_msg.h  |  1 +
 5 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/src/amf/amfd/chkop.cc b/src/amf/amfd/chkop.cc
index 1ba4140c7..e9a68f4cd 100644
--- a/src/amf/amfd/chkop.cc
+++ b/src/amf/amfd/chkop.cc
@@ -1042,6 +1042,16 @@ uint32_t avsv_send_ckpt_data(AVD_CL_CB *cb, uint32_t 
action,
 return NCSCC_RC_SUCCESS;
   }
   break;
+case AVSV_CKPT_NODE_FAILOVER_STATE:
+  if ((avd_cb->other_avd_adest != 0) &&
+  (avd_cb->avd_peer_ver < AVD_MBCSV_SUB_PART_VERSION_9)) {
+TRACE(
+"No ckpt for AVSV_CKPT_NODE_FAILOVER_STATE as peer AMFD has"
+" lower version:%d",
+avd_cb->avd_peer_ver);
+return NCSCC_RC_SUCCESS;
+  }
+  break;
 default:
   return NCSCC_RC_SUCCESS;
   }
diff --git a/src/amf/amfd/ckpt.h b/src/amf/amfd/ckpt.h
index c006f9a69..875776a21 100644
--- a/src/amf/amfd/ckpt.h
+++ b/src/amf/amfd/ckpt.h
@@ -35,9 +35,10 @@
 #define AMF_AMFD_CKPT_H_
 
 // current version
-#define AVD_MBCSV_SUB_PART_VERSION 8
+#define AVD_MBCSV_SUB_PART_VERSION 9
 
 // supported versions
+#define AVD_MBCSV_SUB_PART_VERSION_9 9
 #define AVD_MBCSV_SUB_PART_VERSION_8 8
 #define AVD_MBCSV_SUB_PART_VERSION_7 7
 #define AVD_MBCSV_SUB_PART_VERSION_6 6
diff --git a/src/amf/amfd/ckpt_dec.cc b/src/amf/amfd/ckpt_dec.cc
index 9f3949a15..022fa8f4b 100644
--- a/src/amf/amfd/ckpt_dec.cc
+++ b/src/amf/amfd/ckpt_dec.cc
@@ -49,6 +49,7 @@ static uint32_t dec_oper_su(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC 
*dec);
 static uint32_t dec_node_up_info(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
 static uint32_t dec_node_admin_state(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
 static uint32_t dec_node_oper_state(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
+static uint32_t dec_node_failover_state(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
 static uint32_t dec_node_state(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
 static uint32_t dec_node_rcv_msg_id(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
 static uint32_t dec_node_snd_msg_id(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
@@ -160,7 +161,8 @@ const AVSV_DECODE_CKPT_DATA_FUNC_PTR 
avd_dec_data_func_list[] = {
 dec_comp_curr_num_csi_stby, dec_comp_oper_state, dec_comp_readiness_state,
 dec_comp_pres_state, dec_comp_restart_count, nullptr, /* AVSV_SYNC_COMMIT 
*/
 dec_su_restart_count, dec_si_dep_state, dec_ng_admin_state,
-dec_avd_to_avd_job_queue_status
+dec_avd_to_avd_job_queue_status,
+dec_node_failover_state
 
 };
 
@@ -2958,3 +2960,39 @@ static uint32_t 
dec_avd_to_avd_job_queue_status(AVD_CL_CB *cb,
   TRACE_LEAVE();
   return NCSCC_RC_SUCCESS;
 }
+
+static uint32_t dec_node_failover_state(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec) {
+  TRACE_ENTER();
+
+  uint32_t state;
+  SaNameT name;
+
+  osaf_decode_sanamet(>i_uba, );
+  const std::string node_name(Amf::to_string());
+  osaf_extended_name_free();
+
+  AVD_AVND* node;
+  node = avd_node_get(node_name);
+
+  if (node == nullptr) {
+LOG_ER("%s: node not found, nodeid=%s", __FUNCTION__, node_name.c_str());
+return NCSCC_RC_FAILURE;
+  }
+
+  osaf_decode_uint32(>i_uba,
+ reinterpret_cast());
+
+  auto failed_node = cb->failover_list.find(node->node_info.nodeId);
+  if (failed_node != cb->failover_list.end()) {
+failed_node->second->SetState(state);
+  } else {
+LOG_NO("Node '%s' not found in failover_list. Create new entry",
+node->node_name.c_str());
+auto new_node = std::make_shared(cb,
+  node->node_info.nodeId);
+new_node->SetState(state);
+cb->failover_list[node->node_info.nodeId] = new_node;
+  }
+
+  return NCSCC_RC_SUCCESS;
+}
\ No newline at end of file
diff --git a/src/amf/amfd/ckpt_enc.cc b/src/amf/amfd/ckpt_enc.cc
index 0a2d73698..0e675aed5 100644
--- a/src/amf/amfd/ckpt_enc.cc
+++ b/src/amf/amfd/ckpt_enc.cc
@@ -48,6 +48,7 @@ static uint32_t enc_oper_su(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC 
*enc);
 static uint32_t enc_node_up_info(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
 static uint32_t enc_node_admin_state(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
 static uint32_t enc_node_oper_state(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
+static uint32_t enc_node_failover_state(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
 static uint32_t enc_node_state(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
 static uint32_t enc_node_rcv_msg_id(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
 static uint32_t enc_node_snd_msg_id(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
@@ -163,7 +164,8 @@ const AVSV_ENCODE_CKPT_DATA_FUNC_PTR 
avd_enc_ckpt_data_func_list[] = {
 enc_comp_curr_num_csi_stby, enc_comp_oper_state, enc_comp_readiness_state,
 enc_comp_pres_state, enc_comp_restart_count, nullptr, /* AVSV_SYNC_COMMIT 
*/
 enc_su_restart_count, enc_si_dep_state, enc_ng_admin_state,
-enc_avd_to_avd_job_queue_status};
+  

[devel] [PATCH 1/4] amfd: add class definitions for new timers [#2918]

2018-10-24 Thread Gary Lee
osafAmfDelayNodeFailoverTimeout - the number of seconds we wait
after MDS down is received before we consider it truly down.

osafAmfDelayNodeFailoverNodeUpWait - the number of seconds we
wait for Node Up after receving MDS up, before we send reboot
to the node. After sending  reboot to a node, also wait up to
this number of seconds before we consider the node to be
down (unless MDs down is received first).
---
 src/amf/config/amf_classes.xml | 14 +-
 src/amf/config/amf_objects.xml |  8 
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/src/amf/config/amf_classes.xml b/src/amf/config/amf_classes.xml
index df5cbd92a..182bd97e5 100644
--- a/src/amf/config/amf_classes.xml
+++ b/src/amf/config/amf_classes.xml
@@ -1452,5 +1452,17 @@
SA_CONFIG
SA_WRITABLE

-   
+   
+   osafAmfDelayNodeFailoverTimeout
+   SA_TIME_T
+   SA_CONFIG
+   SA_WRITABLE
+   
+   
+   osafAmfDelayNodeFailoverNodeUpWait
+   SA_TIME_T
+   SA_CONFIG
+   SA_WRITABLE
+   
+
 
diff --git a/src/amf/config/amf_objects.xml b/src/amf/config/amf_objects.xml
index 6ed68d83d..c008c7520 100644
--- a/src/amf/config/amf_objects.xml
+++ b/src/amf/config/amf_objects.xml
@@ -6,6 +6,14 @@
osafAmfRestrictAutoRepairEnable
1

+   
+   osafAmfDelayNodeFailoverTimeout
+   0
+   
+   
+   osafAmfDelayNodeFailoverNodeUpWait
+   180
+   


safAppType=OpenSafApplicationType
-- 
2.17.1



___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


[devel] [PATCH 0/4] Review Request for Test Case 1:

2018-10-24 Thread Gary Lee
Summary: amfd: add support for delaying node failover [#2918]
Review request for Ticket(s): 2918
Peer Reviewer(s): Hans, Minh, Nagu 
Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE ***
Affected branch(es): develop
Development branch: ticket-2918
Base revision: 3b80698770d599bc15b97119cbfd4098943d7643
Personal repository: git://git.code.sf.net/u/userid-2226215/review


Impacted area   Impact y/n

 Docsn
 Build systemn
 RPM/packaging   n
 Configuration files n
 Startup scripts n
 SAF servicesy 
 OpenSAF servicesn 
 Core libraries  n
 Samples n
 Tests   n
 Other   n


Comments (indicate scope for each "y" above):
-

Please see ticket for more details and a state diagram is available there.

revision 7e04f9bc5aea4f5580e3bdf0551b37c05bfc4025
Author: Gary Lee 
Date:   Wed, 24 Oct 2018 11:37:04 +

amfd: add support for delaying node failover [#2918]

OpenSAF has relied on reliable, redundant links between nodes in a cluster.
This can no longer be assumed in virtualised environments.

In order to avoid duplicate assignments, we need to delay
node failover in environments where temporary network partitioning is expected.

When delayed node failover is enabled, AMF will not perform a node
failover until a node has been fenced if remote fencing is available,
or until the specified period has occurred (osafAmfDelayNodeFailoverTimeout).

If MDS connectivity is re-established while waiting, AMF will wait
osafAmfDelayNodeFailoverNodeUpWait seconds for a node
up (with leds_set == false) message to indicate the node
has been already rebooted, and finish the node failover.

Otherwise, AMF will send a message to the node
asking it to reboot itself. When AMF sees that the MDS connectivity is
lost again, or after osafAmfDelayNodeFailoverNodeUpWait seconds,
it can consider the fencing to be complete and finish the node failover.



revision 184835903e2c0d4544c69b2348d7095afb91219f
Author: Gary Lee 
Date:   Wed, 24 Oct 2018 11:37:04 +

amfd: add checkpointing of node failover state [#2918]



revision 7052963a7b555d256c2674aee0cfa2cb2497dd68
Author: Gary Lee 
Date:   Wed, 24 Oct 2018 11:37:04 +

amfnd: allow reboot from any director [#2918]

allow reboot msg to be sent from any director, for
split brain recovery situations



revision 7aeb96aebae4dec85b59a83e0755337ff6be3c28
Author: Gary Lee 
Date:   Wed, 24 Oct 2018 11:36:56 +

amfd: add class definitions for new timers [#2918]

osafAmfDelayNodeFailoverTimeout - the number of seconds we wait
after MDS down is received before we consider it truly down.

osafAmfDelayNodeFailoverNodeUpWait - the number of seconds we
wait for Node Up after receving MDS up, before we send reboot
to the node. After sending  reboot to a node, also wait up to
this number of seconds before we consider the node to be
down (unless MDs down is received first).



Added Files:

 src/amf/amfd/node_state.cc
 src/amf/amfd/node_state.h
 src/amf/amfd/node_state_machine.cc
 src/amf/amfd/node_state_machine.h


Complete diffstat:
--
 src/amf/Makefile.am|   6 +
 src/amf/amfd/cb.h  |  24 ++-
 src/amf/amfd/chkop.cc  |  10 ++
 src/amf/amfd/ckpt.h|   3 +-
 src/amf/amfd/ckpt_dec.cc   |  40 -
 src/amf/amfd/ckpt_enc.cc   |  26 ++-
 src/amf/amfd/ckpt_msg.h|   1 +
 src/amf/amfd/clm.cc|  12 +-
 src/amf/amfd/cluster.cc|  18 ++
 src/amf/amfd/cluster.h |   1 +
 src/amf/amfd/config.cc |  35 +++-
 src/amf/amfd/evt.h |   1 +
 src/amf/amfd/main.cc   |  13 +-
 src/amf/amfd/ndfsm.cc  |  70 ++--
 src/amf/amfd/ndproc.cc |  14 +-
 src/amf/amfd/node.cc   |   2 +
 src/amf/amfd/node_state.cc | 338 +
 src/amf/amfd/node_state.h  | 101 +++
 src/amf/amfd/node_state_machine.cc |  98 +++
 src/amf/amfd/node_state_machine.h  |  39 +
 src/amf/amfd/proc.h|   2 +-
 src/amf/amfd/role.cc   |   9 +-
 src/amf/amfd/timer.cc  |   6 +-
 src/amf/amfd/timer.h   |   1 +
 src/amf/amfnd/mds.cc   |   3 +-
 src/amf/config/amf_classes.xml |  14 +-
 src/amf/config/amf_objects.xml |   8 +
 27 files changed, 860 insertions(+), 35 deletions(-)


Testing Commands:
-

Test Case 1:

0. Set 'osafAmfDelayNodeFailoverTimeout' to 15s
1. 2N app on PL3 (active) and PL4 (standby)
2. Reboot PL3 (assuming it comes back within 15s)
3. Ensure PL4 is only assigned active after PL3 is up

Test Case 2:

1. NwayActive app on PL3, PL4 and PL5
2. Isolate PL3 from the rest of network
3. Remove isolation
4. Ensure PL3 is rebooted by AMF

[devel] [PATCH 3/4] amfd: add checkpointing of node failover state [#2918]

2018-10-24 Thread Gary Lee
---
 src/amf/amfd/chkop.cc| 10 ++
 src/amf/amfd/ckpt.h  |  3 ++-
 src/amf/amfd/ckpt_dec.cc | 40 +++-
 src/amf/amfd/ckpt_enc.cc | 26 --
 src/amf/amfd/ckpt_msg.h  |  1 +
 5 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/src/amf/amfd/chkop.cc b/src/amf/amfd/chkop.cc
index 1ba4140c7..e9a68f4cd 100644
--- a/src/amf/amfd/chkop.cc
+++ b/src/amf/amfd/chkop.cc
@@ -1042,6 +1042,16 @@ uint32_t avsv_send_ckpt_data(AVD_CL_CB *cb, uint32_t 
action,
 return NCSCC_RC_SUCCESS;
   }
   break;
+case AVSV_CKPT_NODE_FAILOVER_STATE:
+  if ((avd_cb->other_avd_adest != 0) &&
+  (avd_cb->avd_peer_ver < AVD_MBCSV_SUB_PART_VERSION_9)) {
+TRACE(
+"No ckpt for AVSV_CKPT_NODE_FAILOVER_STATE as peer AMFD has"
+" lower version:%d",
+avd_cb->avd_peer_ver);
+return NCSCC_RC_SUCCESS;
+  }
+  break;
 default:
   return NCSCC_RC_SUCCESS;
   }
diff --git a/src/amf/amfd/ckpt.h b/src/amf/amfd/ckpt.h
index c006f9a69..875776a21 100644
--- a/src/amf/amfd/ckpt.h
+++ b/src/amf/amfd/ckpt.h
@@ -35,9 +35,10 @@
 #define AMF_AMFD_CKPT_H_
 
 // current version
-#define AVD_MBCSV_SUB_PART_VERSION 8
+#define AVD_MBCSV_SUB_PART_VERSION 9
 
 // supported versions
+#define AVD_MBCSV_SUB_PART_VERSION_9 9
 #define AVD_MBCSV_SUB_PART_VERSION_8 8
 #define AVD_MBCSV_SUB_PART_VERSION_7 7
 #define AVD_MBCSV_SUB_PART_VERSION_6 6
diff --git a/src/amf/amfd/ckpt_dec.cc b/src/amf/amfd/ckpt_dec.cc
index 9f3949a15..022fa8f4b 100644
--- a/src/amf/amfd/ckpt_dec.cc
+++ b/src/amf/amfd/ckpt_dec.cc
@@ -49,6 +49,7 @@ static uint32_t dec_oper_su(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC 
*dec);
 static uint32_t dec_node_up_info(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
 static uint32_t dec_node_admin_state(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
 static uint32_t dec_node_oper_state(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
+static uint32_t dec_node_failover_state(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
 static uint32_t dec_node_state(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
 static uint32_t dec_node_rcv_msg_id(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
 static uint32_t dec_node_snd_msg_id(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec);
@@ -160,7 +161,8 @@ const AVSV_DECODE_CKPT_DATA_FUNC_PTR 
avd_dec_data_func_list[] = {
 dec_comp_curr_num_csi_stby, dec_comp_oper_state, dec_comp_readiness_state,
 dec_comp_pres_state, dec_comp_restart_count, nullptr, /* AVSV_SYNC_COMMIT 
*/
 dec_su_restart_count, dec_si_dep_state, dec_ng_admin_state,
-dec_avd_to_avd_job_queue_status
+dec_avd_to_avd_job_queue_status,
+dec_node_failover_state
 
 };
 
@@ -2958,3 +2960,39 @@ static uint32_t 
dec_avd_to_avd_job_queue_status(AVD_CL_CB *cb,
   TRACE_LEAVE();
   return NCSCC_RC_SUCCESS;
 }
+
+static uint32_t dec_node_failover_state(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec) {
+  TRACE_ENTER();
+
+  uint32_t state;
+  SaNameT name;
+
+  osaf_decode_sanamet(>i_uba, );
+  const std::string node_name(Amf::to_string());
+  osaf_extended_name_free();
+
+  AVD_AVND* node;
+  node = avd_node_get(node_name);
+
+  if (node == nullptr) {
+LOG_ER("%s: node not found, nodeid=%s", __FUNCTION__, node_name.c_str());
+return NCSCC_RC_FAILURE;
+  }
+
+  osaf_decode_uint32(>i_uba,
+ reinterpret_cast());
+
+  auto failed_node = cb->failover_list.find(node->node_info.nodeId);
+  if (failed_node != cb->failover_list.end()) {
+failed_node->second->SetState(state);
+  } else {
+LOG_NO("Node '%s' not found in failover_list. Create new entry",
+node->node_name.c_str());
+auto new_node = std::make_shared(cb,
+  node->node_info.nodeId);
+new_node->SetState(state);
+cb->failover_list[node->node_info.nodeId] = new_node;
+  }
+
+  return NCSCC_RC_SUCCESS;
+}
\ No newline at end of file
diff --git a/src/amf/amfd/ckpt_enc.cc b/src/amf/amfd/ckpt_enc.cc
index 0a2d73698..0e675aed5 100644
--- a/src/amf/amfd/ckpt_enc.cc
+++ b/src/amf/amfd/ckpt_enc.cc
@@ -48,6 +48,7 @@ static uint32_t enc_oper_su(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC 
*enc);
 static uint32_t enc_node_up_info(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
 static uint32_t enc_node_admin_state(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
 static uint32_t enc_node_oper_state(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
+static uint32_t enc_node_failover_state(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
 static uint32_t enc_node_state(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
 static uint32_t enc_node_rcv_msg_id(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
 static uint32_t enc_node_snd_msg_id(AVD_CL_CB *cb, NCS_MBCSV_CB_ENC *enc);
@@ -163,7 +164,8 @@ const AVSV_ENCODE_CKPT_DATA_FUNC_PTR 
avd_enc_ckpt_data_func_list[] = {
 enc_comp_curr_num_csi_stby, enc_comp_oper_state, enc_comp_readiness_state,
 enc_comp_pres_state, enc_comp_restart_count, nullptr, /* AVSV_SYNC_COMMIT 
*/
 enc_su_restart_count, enc_si_dep_state, enc_ng_admin_state,
-enc_avd_to_avd_job_queue_status};
+