[devel] OpenSAF 5.24.02 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.24.02 release. The source code for OpenSAF 5.24.02 and the corresponding documentation can be downloaded using the following links: [opensaf-5.24.02.tar.gz](http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.24.02.tar.gz/download), [opensaf-documentation-5.24.02.tar.gz](http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.24.02.tar.gz/download). For a complete list of new features in this release, please refer to the [NEWS](https://sourceforge.net/p/opensaf/wiki/NEWS-5.24.02/) at the wiki. See the [ChangeLog](https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.24.02/) for a full list of changes in this release. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] Announcement of the OpenSAF 5.23.07 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.23.07 release. The source code for OpenSAF 5.23.07 and the corresponding documentation can be downloaded using the following links: http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.23.07.tar.g z/download http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.2 3.07.tar.gz/download For a complete list of new features in this release, please refer to the NEWS at the wiki: https://sourceforge.net/p/opensaf/wiki/NEWS-5.23.07/ See the ChangeLog for a full list of changes in this release: https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.23.07/ Thank you for your continued interest in OpenSAF and to everyone who has contributed to this release. smime.p7s Description: S/MIME cryptographic signature ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] Announcement of the OpenSAF 5.23.03 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.23.03 release. The source code for OpenSAF 5.23.03 and the corresponding documentation can be downloaded using the following links: http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.23.03.tar.g z/download http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.2 3.03.tar.gz/download For a complete list of new features in this release, please refer to the NEWS at the wiki: https://sourceforge.net/p/opensaf/wiki/NEWS-5.23.03/ See the ChangeLog for a full list of changes in this release: https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.23.03/ Thank you for your continued interest in OpenSAF and to everyone who has contributed to this release. smime.p7s Description: S/MIME cryptographic signature ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] Announcement of the OpenSAF 5.22.11 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.22.11 release. The source code for OpenSAF 5.22.11 and the corresponding documentation can be downloaded using the following links: http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.22.11.tar.g z/download http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.2 2.11.tar.gz/download For a complete list of new features in this release, please refer to the NEWS at the wiki: https://sourceforge.net/p/opensaf/wiki/NEWS-5.22.11/ See the ChangeLog for a full list of changes in this release: https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.22.11/ Thank you for your continued interest in OpenSAF and to everyone who has contributed to this release. smime.p7s Description: S/MIME cryptographic signature ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] Announcement of the OpenSAF 5.22.06 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.22.06 release. The source code for OpenSAF 5.22.06 and the corresponding documentation can be downloaded using the following links: http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.22.06.tar.gz/download http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.22.06.tar.gz/download For a complete list of new features in this release, please refer to the NEWS at the wiki: https://sourceforge.net/p/opensaf/wiki/NEWS-5.22.06/ See the ChangeLog for a full list of changes in this release: https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.22.06/ Thank you for your continued interest in OpenSAF and to everyone who has contributed to this release. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] Announcement of the OpenSAF 5.22.01 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.22.01 release. The source code for OpenSAF 5.22.01 and the corresponding documentation can be downloaded using the following links: http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.22.01.tar.gz/download http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.22.01.tar.gz/download For a complete list of new features in this release, please refer to the NEWS at the wiki: https://sourceforge.net/p/opensaf/wiki/NEWS-5.22.01/ See the ChangeLog for a full list of changes in this release: https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.22.01/ Thank you for your continued interest in OpenSAF and to everyone who has contributed to this release. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] Announcement of the OpenSAF 5.21.09 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.21.09 release. The source code for OpenSAF 5.21.09 and the corresponding documentation can be downloaded using the following links: http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.21.09.tar.gz/download http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.21.09.tar.gz/download For a complete list of new features in this release, please refer to the NEWS at the wiki: https://sourceforge.net/p/opensaf/wiki/NEWS-5.21.09/ See the ChangeLog for a full list of changes in this release: https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.21.09/ Thank you for your continued interest in OpenSAF and to everyone who has contributed to this release. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] Announcement of the OpenSAF 5.21.06 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.21.06 release. The source code for OpenSAF 5.21.06 and the corresponding documentation can be downloaded using the following links: http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.21.06.tar.gz/download http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.21.06.tar.gz/download For a complete list of new features in this release, please refer to the NEWS at the wiki: https://sourceforge.net/p/opensaf/wiki/NEWS-5.21.06/ See the ChangeLog for a full list of changes in this release: https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.21.06/ Thank you for your continued interest in OpenSAF and to everyone who has contributed to this release. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] Announcement of the OpenSAF 5.20.11 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.20.11 release. The source code for OpenSAF 5.20.11 and the corresponding documentation can be downloaded using the following links: http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.20.11.tar.gz/download http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.20.11.tar.gz/download For a complete list of new features in this release, please refer to the NEWS at the wiki: https://sourceforge.net/p/opensaf/wiki/NEWS-5.20.11/ See the ChangeLog for a full list of changes in this release: https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.20.11/ Note that starting from the August 2017 release, we are using a new version numbering scheme for OpenSAF. The components in the OpenSAF version number 5.20.11 represent the major release (5), followed by the year (20) and month (11) when the release was made. This change was made as a step towards introducing continuous delivery in the OpenSAF project. Thank you for your continued interest in OpenSAF and to everyone who has contributed to this release. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] Announcement of the OpenSAF 5.20.08 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.20.08 release. The source code for OpenSAF 5.20.08 and the corresponding documentation can be downloaded using the following links: http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.20.08.tar.gz/download http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.20.08.tar.gz/download For a complete list of new features in this release, please refer to the NEWS at the wiki: https://sourceforge.net/p/opensaf/wiki/NEWS-5.20.08/ See the ChangeLog for a full list of changes in this release: https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.20.08/ Note that starting from the August 2017 release, we are using a new version numbering scheme for OpenSAF. The components in the OpenSAF version number 5.20.08 represent the major release (5), followed by the year (20) and month (08) when the release was made. This change was made as a step towards introducing continuous delivery in the OpenSAF project. Thank you for your continued interest in OpenSAF and to everyone who has contributed to this release. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] Announcement of the OpenSAF 5.20.05 release
The OpenSAF community is pleased to announce the availability of the OpenSAF 5.20.05 release. The source code for OpenSAF 5.20.05 and the corresponding documentation can be downloaded using the following links: http://sourceforge.net/projects/opensaf/files/releases/opensaf-5.20.05.tar.gz/download http://sourceforge.net/projects/opensaf/files/docs/opensaf-documentation-5.20.05.tar.gz/download For a complete list of new features in this release, please refer to the NEWS at the wiki: https://sourceforge.net/p/opensaf/wiki/NEWS-5.20.05/ See the ChangeLog for a full list of changes in this release: https://sourceforge.net/p/opensaf/wiki/ChangeLog-5.20.05/ Note that starting from the August 2017 release, we are using a new version numbering scheme for OpenSAF. The components in the OpenSAF version number 5.20.05 represent the major release (5), followed by the year (20) and month (05) when the release was made. This change was made as a step towards introducing continuous delivery in the OpenSAF project. Thank you for your continued interest in OpenSAF and to everyone who has contributed to this release. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amf: Debug info logged at Emergency level [#3179]
ack (review only) Thanks From: Peter McIntyre Sent: 30 April 2020 18:55 To: Minh Hon Chau ; Thang Duc Nguyen Cc: opensaf-devel@lists.sourceforge.net Subject: [devel] [PATCH 1/1] amf: Debug info logged at Emergency level [#3179] Many places in amf code the debug info is logged with LOG_EMERG, which is not quite informative at emergency level. These should be moved to LOG_ERR level. The fix is to change the LOG_EM to the LOG_ER level. --- src/amf/amfd/ndfsm.cc | 2 +- src/amf/amfd/ndproc.cc | 8 +++--- src/amf/amfd/role.cc | 8 +++--- src/amf/amfd/sg_2n_fsm.cc | 50 +- src/amf/amfd/sg_nored_fsm.cc | 36 src/amf/amfd/sg_npm_fsm.cc | 40 +-- src/amf/amfd/sg_nway_fsm.cc| 18 ++-- src/amf/amfd/sg_nwayact_fsm.cc | 32 +++--- src/amf/amfd/timer.cc | 4 +-- src/amf/amfd/util.cc | 16 +-- 10 files changed, 107 insertions(+), 107 deletions(-) diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc index e2235b2e9..674ef863a 100644 --- a/src/amf/amfd/ndfsm.cc +++ b/src/amf/amfd/ndfsm.cc @@ -1145,7 +1145,7 @@ uint32_t avd_node_down(AVD_CL_CB *cb, SaClmNodeIdT node_id) { if ((avnd = avd_node_find_nodeid(node_id)) == nullptr) { /* log error that the node id is invalid */ -LOG_EM("%s:%u: %u", __FILE__, __LINE__, node_id); +LOG_ER("%s:%u: %u", __FILE__, __LINE__, node_id); return NCSCC_RC_FAILURE; } diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc index 0d30dfe71..29c574167 100644 --- a/src/amf/amfd/ndproc.cc +++ b/src/amf/amfd/ndproc.cc @@ -202,7 +202,7 @@ void avd_reg_su_evh(AVD_CL_CB *cb, AVD_EVT *evt) { /* log an error since this shouldn't happen */ - LOG_EM("%s:%u: %u", __FILE__, __LINE__, n2d_msg->msg_info.n2d_reg_su.error); + LOG_ER("%s:%u: %u", __FILE__, __LINE__, n2d_msg->msg_info.n2d_reg_su.error); /* call the routine to failover all the effected nodes * due to restarting this node @@ -1041,7 +1041,7 @@ void avd_data_update_req_evh(AVD_CL_CB *cb, AVD_EVT *evt) { break; default: /* log error that a the object value is invalid */ - LOG_EM("%s:%u: %u", __FILE__, __LINE__, + LOG_ER("%s:%u: %u", __FILE__, __LINE__, n2d_msg->msg_info.n2d_data_req.param_info.attr_id); break; } /* switch(n2d_msg->msg_info.n2d_data_req.param_info.obj_id) */ @@ -1168,7 +1168,7 @@ void avd_data_update_req_evh(AVD_CL_CB *cb, AVD_EVT *evt) { break; default: /* log error that a the object value is invalid */ - LOG_EM("%s:%u: %u", __FILE__, __LINE__, + LOG_ER("%s:%u: %u", __FILE__, __LINE__, n2d_msg->msg_info.n2d_data_req.param_info.attr_id); break; } /* switch(n2d_msg->msg_info.n2d_data_req.param_info.obj_id) */ @@ -1177,7 +1177,7 @@ void avd_data_update_req_evh(AVD_CL_CB *cb, AVD_EVT *evt) { } default: /* log error that a the table value is invalid */ - LOG_EM("%s:%u: %u", __FILE__, __LINE__, + LOG_ER("%s:%u: %u", __FILE__, __LINE__, n2d_msg->msg_info.n2d_data_req.param_info.class_id); goto done; break; diff --git a/src/amf/amfd/role.cc b/src/amf/amfd/role.cc index 15b0458d2..24374de7c 100644 --- a/src/amf/amfd/role.cc +++ b/src/amf/amfd/role.cc @@ -598,7 +598,7 @@ static uint32_t avd_role_failover_qsd_actv(AVD_CL_CB *cb, SaAmfHAStateT role) { do node down processing for other node */ avd_node_mark_absent(avnd_other); } else { -LOG_EM("%s:%u: %u", __FILE__, __LINE__, NCSCC_RC_FAILURE); +LOG_ER("%s:%u: %u", __FILE__, __LINE__, NCSCC_RC_FAILURE); } return NCSCC_RC_SUCCESS; @@ -701,7 +701,7 @@ void avd_role_switch_ncs_su_evh(AVD_CL_CB *cb, AVD_EVT *evt) { /* get the avnd from node_id */ if (nullptr == (avnd = avd_node_find_nodeid(cb->node_id_avd))) { -LOG_EM("%s:%u: %u", __FILE__, __LINE__, cb->node_id_avd); +LOG_ER("%s:%u: %u", __FILE__, __LINE__, cb->node_id_avd); return; } other_avnd = avd_node_find_nodeid(cb->node_id_avd_other); @@ -852,12 +852,12 @@ try_again: if (NCSCC_RC_SUCCESS != (status = avsv_set_ckpt_role(cb, SA_AMF_HA_QUIESCED))) { /* Log error */ -LOG_EM("%s:%u: %u", __FILE__, __LINE__, status); +LOG_ER("%s:%u: %u", __FILE__, __LINE__, status); } /* Now Dispatch all the messages from the MBCSv mail-box */ if (NCSCC_RC_SUCCESS != (rc = avsv_mbcsv_dispatch(cb, SA_DISPATCH_ALL))) { -LOG_EM("%s:%u: %u", __FILE__, __LINE__, cb->node_id_avd_other); +LOG_ER("%s:%u: %u", __FILE__, __LINE__, cb->node_id_avd_other); cb->swap_switch = false; return; } diff --git a/src/amf/amfd/sg_2n_fsm.cc b/src/amf/amfd/sg_2n_fsm.cc index e38288db7..525e30049 100644 --- a/src/amf/amfd/sg_2n_fsm.cc +++
Re: [devel] [PATCH 1/1] amfnd: fix unexpected reboot after split-brain recovery [#3162]
Hi Thuan One comment inline with [GL]. Thanks Gary From: Thuan Tran Sent: 04 March 2020 18:28 To: Thang Duc Nguyen ; Minh Hon Chau ; Gary Lee Cc: opensaf-devel@lists.sourceforge.net ; Thuan Tran Subject: [PATCH 1/1] amfnd: fix unexpected reboot after split-brain recovery [#3162] - Split-brain recovery in headless enable, IMMND may expected restart. If AMFND not wait IMMND restart but reinit CLM, CLM callback trigger, clm_to_amf_node() is called then AMFND stuck in init IMM OM causes delay restart IMMND, delay resend node_up then AMFD will order reboot node. - Do not trigger saClmDispatch() if immnd down. --- src/amf/amfnd/avnd_cb.h | 1 + src/amf/amfnd/clc.cc| 10 ++ src/amf/amfnd/main.cc | 4 +++- 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/src/amf/amfnd/avnd_cb.h b/src/amf/amfnd/avnd_cb.h index 8b0cc2304..0fa0590ff 100644 --- a/src/amf/amfnd/avnd_cb.h +++ b/src/amf/amfnd/avnd_cb.h @@ -125,6 +125,7 @@ typedef struct avnd_cb_tag { SaTimeT scs_absence_max_duration; /* the timer for supervision of the absence of SC */ AVND_TMR sc_absence_tmr; + bool immnd_down; } AVND_CB; #define AVND_CB_NULL ((AVND_CB *)0) diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc index f78e1a707..227bf6a5a 100644 --- a/src/amf/amfnd/clc.cc +++ b/src/amf/amfnd/clc.cc @@ -3106,6 +3106,9 @@ uint32_t avnd_comp_clc_cmd_execute(AVND_CB *cb, AVND_COMP *comp, unsigned int i; SaStringT env; size_t env_set_nmemb; + size_t comma = comp->saAmfCompType.find_last_of(","); + size_t end = comp->saAmfCompType.length(); + std::string compBaseType = comp->saAmfCompType.substr(comma + 1, end); TRACE_ENTER2("'%s':CLC CLI command type:'%s'", comp->name.c_str(), clc_cmd_type[cmd_type]); @@ -,6 +3336,13 @@ uint32_t avnd_comp_clc_cmd_execute(AVND_CB *cb, AVND_COMP *comp, // outcome of command is reported in comp_clc_resp_callback() } + if (compBaseType.compare("safCompType=OpenSafCompTypeIMMND") == 0) { +if (cmd_type == AVND_COMP_CLC_CMD_TYPE_CLEANUP) + cb->immnd_down = true; +else if (cmd_type == AVND_COMP_CLC_CMD_TYPE_INSTANTIATE) + cb->immnd_down = false; + } + TRACE_2("success"); goto done; diff --git a/src/amf/amfnd/main.cc b/src/amf/amfnd/main.cc index d7857fabe..447e2aa82 100644 --- a/src/amf/amfnd/main.cc +++ b/src/amf/amfnd/main.cc @@ -334,6 +334,7 @@ AVND_CB *avnd_cb_create() { cb->is_avd_down = true; cb->amfd_sync_required = false; + cb->immnd_down = false; // retrieve hydra configuration from IMM hydra_config_get(cb); @@ -609,7 +610,8 @@ void avnd_main_process(void) { exit(0); } -if (avnd_cb->clmHandle && (fds[FD_CLM].revents & POLLIN)) { +if (!avnd_cb->immnd_down && avnd_cb->clmHandle && +(fds[FD_CLM].revents & POLLIN)) { [GL] I think, in general, it's probably bad practise to skip an event when it is ready to be processed. This could end up in a tight loop, spiking CPU usage. // LOG_NO("DEBUG-> CLM event fd: %d sel_obj: %llu, clm handle: %llu", // fds[FD_CLM].fd, avnd_cb->clm_sel_obj, avnd_cb->clmHandle); result = saClmDispatch(avnd_cb->clmHandle, SA_DISPATCH_ALL); -- 2.17.1 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] osaf: fix etcd3.plugin watch takeover_request [#3158]
Hi Thuan ack (review) with minor comment. Here -a is used for AND but && is used elsewhere in the file. We could be more consistent. #value is cleaned after a lease time, keep watching Maybe rechange to #value is cleared after lease time, keep watching Thanks Gary From: Thuan Tran Sent: 20 February 2020 22:21 To: Gary Lee ; Vu Minh Nguyen ; Minh Hon Chau ; Thang Duc Nguyen Cc: opensaf-devel@lists.sourceforge.net ; Thuan Tran Subject: [PATCH 1/1] osaf: fix etcd3.plugin watch takeover_request [#3158] After reject a takeover_request, value is cleaned after a lease time then it mistaken raise a change value become empty. It leads to osafrded handle and reboot itself as lost connectivity to consensus. --- src/osaf/consensus/plugins/etcd3.plugin | 8 1 file changed, 8 insertions(+) diff --git a/src/osaf/consensus/plugins/etcd3.plugin b/src/osaf/consensus/plugins/etcd3.plugin index 60559a0e9..e8fa6b6e7 100644 --- a/src/osaf/consensus/plugins/etcd3.plugin +++ b/src/osaf/consensus/plugins/etcd3.plugin @@ -362,6 +362,14 @@ watch() { return 1 fi elif [ "$orig_value" != "$current_value" ]; then +if [ "$watch_key" == "$takeover_request" ]; then + state=$(echo $orig_value | awk '{print $4}') + if [ "$state" == "REJECTED" -a -z "$current_value" ]; then +#value is cleaned after a lease time, keep watching +orig_value="" +continue + fi +fi echo $current_value return 0 fi -- 2.17.1 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] clmd: retry once to send message to clmna [#3156]
Hi Thuan Would this be simpler? + while (retry < 1) { +rc = clms_mds_msg_send(cb, _msg, >fr_dest, >mds_ctxt, + MDS_SEND_PRIORITY_HIGH, NCSMDS_SVC_ID_CLMNA); +if (rc != NCSCC_RC_SUCCESS) { + ... + osaf_nanosleep(); + ++retry; +} else { + break; +} + } Thanks Gary From: Thuan Tran Sent: 18 February 2020 17:38 To: Vu Minh Nguyen ; Minh Hon Chau ; Thang Duc Nguyen ; Gary Lee Cc: opensaf-devel@lists.sourceforge.net ; Thuan Tran Subject: [PATCH 1/1] clmd: retry once to send message to clmna [#3156] - If a node reboot up, clmna svc_up is not yet come but clmd get message join request then send message back clmna failed. It leads to amfnd timeout init clm agent and delay send node up. This may cause amfd order reboot that node if node up delay (osafAmfDelayNodeFailoverNodeWaitTimeout) is set smaller than total time amfnd retry until init clm agent successfully. - One retry to send messsage to clmna help avoid this scenario. --- src/clm/clmd/clms_evt.cc | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/src/clm/clmd/clms_evt.cc b/src/clm/clmd/clms_evt.cc index 1059c6cfa..59e9c4156 100644 --- a/src/clm/clmd/clms_evt.cc +++ b/src/clm/clmd/clms_evt.cc @@ -34,6 +34,7 @@ #include "base/logtrace.h" #include "base/ncsgl_defs.h" #include "base/osaf_utility.h" +#include "base/osaf_time.h" #include "clm/clmd/clms.h" static uint32_t process_api_evt(CLMSV_CLMS_EVT *evt); @@ -535,6 +536,7 @@ uint32_t proc_node_up_msg(CLMS_CB *cb, CLMSV_CLMS_EVT *evt) { SaNameT node_name = {0}; CLMSV_MSG clm_msg; SaBoolT check_member; + int retry = 0; TRACE_ENTER2("Node up mesg for nodename length %d %s", nodeup_info->node_name.length, nodeup_info->node_name.value); @@ -636,8 +638,20 @@ uint32_t proc_node_up_msg(CLMS_CB *cb, CLMSV_CLMS_EVT *evt) { clm_msg.info.api_resp_info.type = CLMSV_CLUSTER_JOIN_RESP; clm_msg.info.api_resp_info.param.node_name = node_name; /*rc will be updated down in the positive flow */ - rc = clms_mds_msg_send(cb, _msg, >fr_dest, >mds_ctxt, - MDS_SEND_PRIORITY_HIGH, NCSMDS_SVC_ID_CLMNA); + do { +rc = clms_mds_msg_send(cb, _msg, >fr_dest, >mds_ctxt, + MDS_SEND_PRIORITY_HIGH, NCSMDS_SVC_ID_CLMNA); +if (rc != NCSCC_RC_SUCCESS && retry < 1) { + /* If a node reboot up, clmna svc_up is not yet come but clmd + * get message join request then send message back clmna failed. + * It leads to amfnd timeout init clm agent and delay send node up. + * This may cause amfd order reboot that node if node up delay + * (osafAmfDelayNodeFailoverNodeWaitTimeout) is set smaller than + * total time amfnd retry until init clm agent successfully. + * If retry here, it would help avoid this scenario */ + osaf_nanosleep(); +} + } while (rc != NCSCC_RC_SUCCESS && retry++ < 1); /*if mds send failed, we need to report failure */ if (rc != NCSCC_RC_SUCCESS) { LOG_NO("%s: send failed. dest:%" PRIx64, __FUNCTION__, evt->fr_dest); -- 2.17.1 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] rde: correct to promote node to active [#3108]
Hi Ack (tested) -Original Message- From: thang.d.nguyen [mailto:thang.d.ngu...@dektech.com.au] Sent: Tuesday, 4 February 2020 1:37 PM To: Gary Lee Cc: opensaf-devel@lists.sourceforge.net; Thang Duc Nguyen Subject: [PATCH 1/1] rde: correct to promote node to active [#3108] If relaxed node promotion is enabled, allow this node to be promoted active if it can see a peer SC and this node has the lowest node ID. --- src/rde/rded/role.cc | 14 +- src/rde/rded/role.h | 1 + 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/src/rde/rded/role.cc b/src/rde/rded/role.cc index 593ccf0eb..7ca020d5d 100644 --- a/src/rde/rded/role.cc +++ b/src/rde/rded/role.cc @@ -260,7 +260,8 @@ bool Role::IsCandidate() { // if relaxed node promotion is enabled, allow this node to be promoted // active if it can see a peer SC and this node has the lowest node ID if (consensus_service.IsRelaxedNodePromotionEnabled() == true && - cb->state == State::kNotActiveSeenPeer) { + cb->state == State::kNotActiveSeenPeer && + IsLowestNodeid() == true) { LOG_NO("Relaxed node promotion enabled. This node is a candidate."); result = true; } @@ -279,6 +280,17 @@ bool Role::IsPeerPresent() { return result; } +bool Role::IsLowestNodeid() { + bool result = true; + RDE_CONTROL_BLOCK* cb = rde_get_control_block(); + + for (auto peer_id : cb->peer_controllers) { +if (peer_id < own_node_id_) + return false; + } + return result; +} + uint32_t Role::SetRole(PCS_RDA_ROLE new_role) { TRACE_ENTER(); PCS_RDA_ROLE old_role = role_; diff --git a/src/rde/rded/role.h b/src/rde/rded/role.h index 9c63cbe7b..9bf1b10bd 100644 --- a/src/rde/rded/role.h +++ b/src/rde/rded/role.h @@ -38,6 +38,7 @@ class Role { void AddPeer(NODE_ID node_id); bool IsCandidate(); bool IsPeerPresent(); + bool IsLowestNodeid(); void SetPeerState(PCS_RDA_ROLE node_role, NODE_ID node_id); timespec* Poll(timespec* ts); uint32_t SetRole(PCS_RDA_ROLE new_role); -- 2.17.1 smime.p7s Description: S/MIME cryptographic signature ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] fmd: Do not send RDE to set active role if opensaf_quick_reboot is executed [#3146]
Hi Minh ack — From: Minh Chau Sent: Friday, January 24, 2020 11:35:29 AM To: Gary Lee Cc: opensaf-devel@lists.sourceforge.net ; Minh Hon Chau Subject: [PATCH 1/1] fmd: Do not send RDE to set active role if opensaf_quick_reboot is executed [#3146] If a SC is separated from cluster, fmd calls opensaf_quick_reboot(). The reboot script returns yet the node has not been coming down. In the code after opensaf_quick_reboot(), fmd tells rde to promote to active. Hence, there is a short period of having two 2 active SC This patch makes fmd to stop sending to RDE to set active role after opensaf_quick_reboot(). Note: There are a few places after opensaf_quick_reboot(), the function does not return. However, this patch only fixes the issue in fm, the other places will be re-visited. --- src/fm/fmd/fm_rda.cc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/fm/fmd/fm_rda.cc b/src/fm/fmd/fm_rda.cc index fca417f79..479eb2149 100644 --- a/src/fm/fmd/fm_rda.cc +++ b/src/fm/fmd/fm_rda.cc @@ -86,6 +86,7 @@ void promote_node(FM_CB *fm_cb) { LOG_ER("Unable to set active controller in consensus service"); opensaf_quick_reboot("Unable to set active controller " "in consensus service"); +return; } else if (rc == SA_AIS_ERR_EXIST) { // @todo if we don't reboot, we don't seem to recover from this. Can we // improve? @@ -94,6 +95,7 @@ void promote_node(FM_CB *fm_cb) { "cluster?"); opensaf_quick_reboot("A controller is already active. We were separated " "from the cluster?"); +return; } PCS_RDA_REQ rda_req; -- 2.20.1 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] rde: Reboot node if another active controller is detected [#3142]
Hi Minh Ack with comment. Please include this explaination. +// a reboot is required, as clmna on other nodes may not start +// an election because it thinks this node is going to be active +opensaf_quick_reboot("Another controller is already active"); Thanks Gary From: Minh Chau Sent: 16 January 2020 13:06 To: Gary Lee ; hans.nordeb...@ericsson.com ; Vu Minh Nguyen Cc: opensaf-devel@lists.sourceforge.net ; Minh Hon Chau Subject: [PATCH 1/1] rde: Reboot node if another active controller is detected [#3142] --- src/rde/rded/role.cc | 1 + 1 file changed, 1 insertion(+) diff --git a/src/rde/rded/role.cc b/src/rde/rded/role.cc index b890117..9446ccb 100644 --- a/src/rde/rded/role.cc +++ b/src/rde/rded/role.cc @@ -107,6 +107,7 @@ void Role::PromoteNode(const uint64_t cluster_size, rc = consensus_service.PromoteThisNode(true, cluster_size); if (rc == SA_AIS_ERR_EXIST) { LOG_WA("Another controller is already active"); +opensaf_quick_reboot("Another controller is already active"); return; } else if (rc != SA_AIS_OK && relaxed_mode == true) { LOG_WA("Unable to set active controller in consensus service"); -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] rde: Reboot node if another active controller is detected [#3142]
Hi Minh ack From: Minh Chau Sent: 16 January 2020 13:06 To: Gary Lee ; hans.nordeb...@ericsson.com ; Vu Minh Nguyen Cc: opensaf-devel@lists.sourceforge.net ; Minh Hon Chau Subject: [PATCH 1/1] rde: Reboot node if another active controller is detected [#3142] --- src/rde/rded/role.cc | 1 + 1 file changed, 1 insertion(+) diff --git a/src/rde/rded/role.cc b/src/rde/rded/role.cc index b890117..9446ccb 100644 --- a/src/rde/rded/role.cc +++ b/src/rde/rded/role.cc @@ -107,6 +107,7 @@ void Role::PromoteNode(const uint64_t cluster_size, rc = consensus_service.PromoteThisNode(true, cluster_size); if (rc == SA_AIS_ERR_EXIST) { LOG_WA("Another controller is already active"); +opensaf_quick_reboot("Another controller is already active"); return; } else if (rc != SA_AIS_OK && relaxed_mode == true) { LOG_WA("Unable to set active controller in consensus service"); -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] log: fix memory leak that was introduced in 3116 [#3138]
Hi Vu ack (review only) From: Vu Minh Nguyen Sent: 09 January 2020 21:51 To: Minh Hon Chau ; Gary Lee Cc: opensaf-devel@lists.sourceforge.net ; Vu Minh Nguyen Subject: [PATCH 1/1] log: fix memory leak that was introduced in 3116 [#3138] --- src/log/logd/lgs_evt.cc | 3 +++ src/log/logd/lgs_mbcsv_cache.cc | 2 ++ 2 files changed, 5 insertions(+) diff --git a/src/log/logd/lgs_evt.cc b/src/log/logd/lgs_evt.cc index 7501a282b..f169ea1e9 100644 --- a/src/log/logd/lgs_evt.cc +++ b/src/log/logd/lgs_evt.cc @@ -1348,6 +1348,7 @@ static uint32_t proc_write_log_async_msg(lgs_cb_t *cb, lgsv_lgs_evt_t *evt) { stream->fixedLogRecordSize, buf_size, logOutputString, ++stream->logRecordId, node_name)) == 0) { AckToWriteAsync(param, evt->fr_dest, SA_AIS_ERR_INVALID_PARAM); +free(logOutputString); return NCSCC_RC_SUCCESS; } @@ -1356,6 +1357,8 @@ static uint32_t proc_write_log_async_msg(lgs_cb_t *cb, lgsv_lgs_evt_t *evt) { evt->fr_dest, node_name); auto data = std::make_shared(info, logOutputString, n); Cache::instance()->Write(data); + + lgs_free_write_log(param); return NCSCC_RC_SUCCESS; } diff --git a/src/log/logd/lgs_mbcsv_cache.cc b/src/log/logd/lgs_mbcsv_cache.cc index cde26432a..b190c5bea 100644 --- a/src/log/logd/lgs_mbcsv_cache.cc +++ b/src/log/logd/lgs_mbcsv_cache.cc @@ -230,6 +230,8 @@ uint32_t ckpt_proc_pop_write_async(lgs_cb_t* cb, void* data) { if (top->seq_id_ != seq_id) { LOG_ER("Out of sync! Expected seq: (%" PRIu64 "), Got: (%" PRIu64 ")", seq_id, top->seq_id_); +lgs_free_edu_mem(param->log_record); +lgs_free_edu_mem(param->log_file); return NCSCC_RC_FAILURE; } -- 2.17.1 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amf: allow update node failover state in cold sync [#3136]
Hi Thuan Ack Thanks Gary From: thuan.tran Sent: 30 December 2019 21:20 To: Thang Duc Nguyen ; Gary Lee ; Minh Hon Chau Cc: opensaf-devel@lists.sourceforge.net ; Thuan Tran Subject: [PATCH 1/1] amf: allow update node failover state in cold sync [#3136] Nodes joined during cold sync is not updated failover state to standby amfd cause later standby amfd failover to active will mistakenly order reboot these nodes. --- src/amf/amfd/chkop.cc | 1 + 1 file changed, 1 insertion(+) diff --git a/src/amf/amfd/chkop.cc b/src/amf/amfd/chkop.cc index 15408b657..1ed6dd632 100644 --- a/src/amf/amfd/chkop.cc +++ b/src/amf/amfd/chkop.cc @@ -1347,6 +1347,7 @@ static uint32_t avsv_validate_reo_type_in_csync(AVD_CL_CB *cb, case AVSV_CKPT_AVND_NODE_STATE: case AVSV_CKPT_AVND_RCV_MSG_ID: case AVSV_CKPT_AVND_SND_MSG_ID: +case AVSV_CKPT_NODE_FAILOVER_STATE: if (cb->synced_reo_type >= AVSV_CKPT_AVD_NODE_CONFIG) status = NCSCC_RC_SUCCESS; break; -- 2.17.1 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/5] log: improve the resilience of log service [#3116]
Hi Vu Very, very minor comments with [GL]. Thanks Gary -Original Message- From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] Sent: Thursday, 28 November 2019 7:24 PM To: lennart.l...@ericsson.com; Gary Lee ; Minh Hon Chau Cc: opensaf-devel@lists.sourceforge.net; Vu Minh Nguyen Subject: [PATCH 1/5] log: improve the resilience of log service [#3116] In order to improve resilience of OpenSAF LOG service when underlying file system is unresponsive, a queue is introduced to hold async write request up to an configurable time that is around 15 - 30 seconds. The readiness of the I/O thread will periodically check, and if it turns to ready state, the front element will go first. Returns SA_AIS_ERR_TRY_AGAIN to client if the element stays in the queue longer than the setting time. The queue capacity and the resilient time are configurable via the attributes: `logMaxPendingWriteRequests` and `logResilienceTimeout`. In default, this feature is disabled to keep log server backward compatible. --- src/log/Makefile.am | 21 +- src/log/config/logsv_classes.xml | 43 ++- src/log/logd/lgs_cache.cc| 469 +++ src/log/logd/lgs_cache.h | 287 +++ src/log/logd/lgs_config.cc | 78 - src/log/logd/lgs_config.h| 10 +- src/log/logd/lgs_evt.cc | 161 +++ src/log/logd/lgs_evt.h | 10 + src/log/logd/lgs_file.cc | 8 +- src/log/logd/lgs_filehdl.cc | 58 ++-- src/log/logd/lgs_imm.cc | 40 ++- src/log/logd/lgs_main.cc | 24 +- src/log/logd/lgs_mbcsv.cc| 447 +++-- src/log/logd/lgs_mbcsv.h | 19 +- src/log/logd/lgs_mbcsv_cache.cc | 372 src/log/logd/lgs_mbcsv_cache.h | 110 src/log/logd/lgs_mbcsv_v1.cc | 1 + src/log/logd/lgs_mbcsv_v2.cc | 2 + 18 files changed, 1889 insertions(+), 271 deletions(-) create mode 100644 src/log/logd/lgs_cache.cc create mode 100644 src/log/logd/lgs_cache.h create mode 100644 src/log/logd/lgs_mbcsv_cache.cc create mode 100644 src/log/logd/lgs_mbcsv_cache.h diff --git a/src/log/Makefile.am b/src/log/Makefile.am index f63a4a053..3367ef4f6 100644 --- a/src/log/Makefile.am +++ b/src/log/Makefile.am @@ -95,7 +95,9 @@ noinst_HEADERS += \ src/log/logd/lgs_nildest.h \ src/log/logd/lgs_unixsock_dest.h \ src/log/logd/lgs_common.h \ - src/log/logd/lgs_amf.h + src/log/logd/lgs_amf.h \ + src/log/logd/lgs_cache.h \ + src/log/logd/lgs_mbcsv_cache.h bin_PROGRAMS += bin/saflogger @@ -123,6 +125,15 @@ bin_osaflogd_CPPFLAGS = \ -DSA_EXTENDED_NAME_SOURCE \ $(AM_CPPFLAGS) +# Enable this flag to simulate the case that file system is unresponsive +# during write log record. Mainly for testing the following enhancement: +# log: improve the resilience of log service [#3116]. +# When enabled, log handle thread will be suspended 17 seconds every 02 write +# requests and only take affect if the `logMaxPendingWriteRequests` is set +# to an non-zero value. +bin_osaflogd_CPPFLAGS += -DSIMULATE_NFS_UNRESPONSE + + bin_osaflogd_SOURCES = \ src/log/logd/lgs_amf.cc \ src/log/logd/lgs_clm.cc \ @@ -147,7 +158,9 @@ bin_osaflogd_SOURCES = \ src/log/logd/lgs_util.cc \ src/log/logd/lgs_dest.cc \ src/log/logd/lgs_nildest.cc \ - src/log/logd/lgs_unixsock_dest.cc + src/log/logd/lgs_unixsock_dest.cc \ + src/log/logd/lgs_cache.cc \ + src/log/logd/lgs_mbcsv_cache.cc bin_osaflogd_LDADD = \ lib/libosaf_common.la \ @@ -183,6 +196,10 @@ bin_logtest_CPPFLAGS = \ -DSA_EXTENDED_NAME_SOURCE \ $(AM_CPPFLAGS) +# Enable this flag to add test cases for following enhancement: +# log: improve the resilience of log service [#3116]. +bin_logtest_CPPFLAGS += -DSIMULATE_NFS_UNRESPONSE + bin_logtest_SOURCES = \ src/log/apitest/logtest.c \ src/log/apitest/logutil.c \ diff --git a/src/log/config/logsv_classes.xml b/src/log/config/logsv_classes.xml index 9359823ff..084e8915d 100644 --- a/src/log/config/logsv_classes.xml +++ b/src/log/config/logsv_classes.xml @@ -195,7 +195,7 @@ to ensure that default global values in the implementation are also changed acco SA_UINT32_T SA_CONFIG SA_WRITABLE -1024 + 1024 logStreamFileFormat @@ -208,42 +208,42 @@ to ensure that default global values in the implementation are also changed acco SA_UINT32_T SA_CONFIG SA_WRITABLE -0 + 0 logStreamSystemLowLimit SA_UINT32_T SA_CONFIG SA_WRITABLE
Re: [devel] [PATCH 4/5] log: update README file for improvement of log resilience [#3116]
Hi Vu Very minor comments with [GL]. Gary -Original Message- From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] Sent: Thursday, 28 November 2019 7:25 PM To: lennart.l...@ericsson.com; Gary Lee ; Minh Hon Chau Cc: opensaf-devel@lists.sourceforge.net; Vu Minh Nguyen Subject: [PATCH 4/5] log: update README file for improvement of log resilience [#3116] --- src/log/README | 38 ++ 1 file changed, 38 insertions(+) diff --git a/src/log/README b/src/log/README index b83d472e4..ab96a8157 100644 --- a/src/log/README +++ b/src/log/README @@ -764,3 +764,41 @@ on AMF role is unnecessary delay the CLM state of a Node (CLM state will available as soon as CLM started), so LGS is a taking AVD Up event as trigger to do CLM initialize. + +4. Improve the resilience of OpenSAF LOG service (#3116) +- +When the file system is unresponsive, log client gets try-again from +write callback very shortly after I/O timeout reaches the setting; the [GL] "reaches the setting" sounds confusing. What setting? +value of I/O timeout is configurable via the attribute logFileIoTimeout +within this valid range [500ms – 5000ms]. This is legacy behavior. + +This ticket improves the resilience of LOG service, so that log service +can cache async write requests up to an configurable time that is [GL] a configurable +around 15-30 seconds before returning status to log client via write async callback. + +The cache size is configurable via a new attribute `logMaxPendingWriteRequests`. +Default value is zero (0) - means this feature is disabled. The valid +range is [current queue size - 1000]. To know what is the current size +of the queue, fetching the value of pure runtime attribute [GL] To find the current size of the queue, fetch the ... +`logCurrentPendingWriteRequests` of `OpenSafLogCurrentConfig` class. +When the cache size reaches the limit, all coming requests will get +acknowledgement right away with SA_AIS_ERR_TRY_AGAIN. + +The resilient timeout can also be configurable via a new attribute +`logResilienceTimeout`. The valid range is [15-30] seconds. When a +pending write async can be dropped and removed from the queue in cases: +a) Stays in the queue longer than the given resilient timeout. +b) The targeting stream has been closed. + +The queue is always kept in sync with standby. + +Besides, log agent has a light list keeping track all invocations which +not yet get acknowledgements from log server. If cluster goes to +headless; in other words, log server is disappeared and all cached data +has been lost, log agent (library) will notify all lost invocations to +log client via write async callback with SA_AIS_ERR_TRY_AGAIN error code. + +To test this feature, a gcc flag is added during compile time to +simulate the case the underlying file system is unresponsive, and it +only takes affect when the cache size is given to an non-zero value. [G][ it only takes effect when the cache size is set to a non-zero value +With that, the I/O thread will sleep *16 seconds* every 02 write requests. \ No newline at end of file -- 2.17.1 smime.p7s Description: S/MIME cryptographic signature ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: Fix the data types of attributes inconsistency in get_config() [#3128]
Hi Ack ( review ) thanks Gary — From: phuc.h.chau Sent: Monday, December 16, 2019 6:59:38 PM To: Vu Minh Nguyen Cc: opensaf-devel@lists.sourceforge.net Subject: [devel] [PATCH 1/1] amfd: Fix the data types of attributes inconsistency in get_config() [#3128] In Amfd, for Configuration::get_config(), object osafAmfDelayNodeFailoverTimeout and osafAmfDelayNodeFailoverNodeWaitTimeout are time_t, but the method uses uint32_t to hold the values of those attributes it leads to the stack memory corrupted --- src/amf/amfd/config.cc | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) mode change 100644 => 100755 src/amf/amfd/config.cc diff --git a/src/amf/amfd/config.cc b/src/amf/amfd/config.cc old mode 100644 new mode 100755 index af72840..375f050 --- a/src/amf/amfd/config.cc +++ b/src/amf/amfd/config.cc @@ -43,20 +43,20 @@ static void ccb_apply_modify_hdlr(struct CcbUtilOperationData *opdata) { configuration->restrict_auto_repair(enabled); } else if (!strcmp(attr_mod->modAttr.attrName, "osafAmfDelayNodeFailoverTimeout")) { - uint32_t delay = 0; // default to 0 if attribute is blank + time_t delay = 0; // default to 0 if attribute is blank if (attr_mod->modType != SA_IMM_ATTR_VALUES_DELETE && attr_mod->modAttr.attrValues != nullptr) { -delay = (*((SaUint32T *)attr_mod->modAttr.attrValues[0])); +delay = (*((time_t *)attr_mod->modAttr.attrValues[0])); } avd_cb->node_failover_delay = delay; TRACE("osafAmfDelayNodeFailoverTimeout changed to '%llu'", avd_cb->node_failover_delay); } else if (!strcmp(attr_mod->modAttr.attrName, "osafAmfDelayNodeFailoverNodeWaitTimeout")) { - uint32_t delay = kDefaultNodeWaitTime; + time_t delay = kDefaultNodeWaitTime; if (attr_mod->modType != SA_IMM_ATTR_VALUES_DELETE && attr_mod->modAttr.attrValues != nullptr) { -delay = (*((SaUint32T *)attr_mod->modAttr.attrValues[0])); +delay = (*((time_t *)attr_mod->modAttr.attrValues[0])); } avd_cb->node_failover_node_wait = delay; TRACE("osafAmfDelayNodeFailoverNodeWaitTimeout changed to '%llu'", @@ -166,18 +166,19 @@ SaAisErrorT Configuration::get_config(void) { (SaImmAttrValuesT_2 ***)) == SA_AIS_OK) { uint32_t value; +time_t time_value; TRACE("reading configuration '%s'", osaf_extended_name_borrow()); if (immutil_getAttr("osafAmfRestrictAutoRepairEnable", attributes, 0, ) == SA_AIS_OK) { configuration->restrict_auto_repair(static_cast(value)); } if (immutil_getAttr("osafAmfDelayNodeFailoverTimeout", attributes, 0, -) == SA_AIS_OK) { - avd_cb->node_failover_delay = value; +_value) == SA_AIS_OK) { + avd_cb->node_failover_delay = time_value; } if (immutil_getAttr("osafAmfDelayNodeFailoverNodeWaitTimeout", attributes, 0, -) == SA_AIS_OK) { - avd_cb->node_failover_node_wait = value; +_value) == SA_AIS_OK) { + avd_cb->node_failover_node_wait = time_value; } } -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for osaf: return a help message if no parameter is specified [#3118]
Summary: osaf: return a help message if no parameter is specified [#3118] Review request for Ticket(s): 3118 Peer Reviewer(s): Minh, Thuan Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3118 Base revision: 2a7ec1f63710f9e8f679bbceb18032e0ebb1b46a Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesn Core libraries y Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 4fd8ba91a1943a6ed696f86763b6ee804bccc27c Author: Gary Lee Date: Wed, 13 Nov 2019 17:09:35 +1100 osaf: return a help message if no parameter is specified [#3118] Complete diffstat: -- src/osaf/consensus/plugins/tcp/tcp.plugin | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) Testing Commands: - Run tcp.plugin without an argument Testing, Expected Results: -- A help message should be printed instead of crashing Conditions of Submission: - Ack or in 7 days Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] osaf: return a help message if no parameter is specified [#3118]
--- src/osaf/consensus/plugins/tcp/tcp.plugin | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/osaf/consensus/plugins/tcp/tcp.plugin b/src/osaf/consensus/plugins/tcp/tcp.plugin index 1b5ddf5..0be20fc 100755 --- a/src/osaf/consensus/plugins/tcp/tcp.plugin +++ b/src/osaf/consensus/plugins/tcp/tcp.plugin @@ -149,7 +149,12 @@ class ArbitratorPlugin(object): params = [] if args: params.append(args) -return getattr(self, command)(*params) +if command: +return getattr(self, command)(*params) +else: +ret = {'code': 0, + 'output': parser.format_help()} +return ret def get_node_name(self): node_file = open(self.node_name_file) -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amf: amfnd should ignore amfd down event during shutting down [#3117]
ack (review only) On 7/11/19 8:33 pm, thuan.tran wrote: When cluster stop by immadm, amfnd (is shutting down) may see amfd down event and order node reboot. --- src/amf/amfnd/di.cc | 6 ++ 1 file changed, 6 insertions(+) diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc index 2043c6064..1f310b949 100644 --- a/src/amf/amfnd/di.cc +++ b/src/amf/amfnd/di.cc @@ -664,6 +664,12 @@ uint32_t avnd_evt_mds_avd_dn_evh(AVND_CB *cb, AVND_EVT *evt) { LOG_WA("AMF director unexpectedly crashed"); + if (m_AVND_IS_SHUTTING_DOWN(cb)) { +LOG_WA("Ignore because AMFND is in SHUTDOWN state"); +TRACE_LEAVE(); +return rc; + } + // if headless is disabled OR if the amfd down came from the local node, just // reboot if (cb->scs_absence_max_duration == 0 || smime.p7s Description: S/MIME Cryptographic Signature ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfnd: reset transition descriptor during comp restart [#3103]
Hi Alex ack Thanks Gary On 18/10/19 2:56 am, Jones, Alex wrote: If a component is configured to restart, instead of failover, on failure, the previous transition descriptor is passed to the CSI set callback after the restart. The transition descriptor is not reset by amfnd in this case. Always reset the transition descriptor to NEW_ASSIGN during a reassignment due to restart. --- src/amf/amfnd/comp.cc | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/amf/amfnd/comp.cc b/src/amf/amfnd/comp.cc index a12171c28..10c77a462 100644 --- a/src/amf/amfnd/comp.cc +++ b/src/amf/amfnd/comp.cc @@ -1448,6 +1448,9 @@ uint32_t avnd_comp_csi_reassign(AVND_CB *cb, AVND_COMP *comp) { m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET( curr, AVND_COMP_CSI_ASSIGN_STATE_ASSIGNING); + // reset the transition descriptor + curr->trans_desc = SA_AMF_CSI_NEW_ASSIGN; + /* invoke the callback */ rc = avnd_comp_cbk_send(cb, curr->comp, AVSV_AMF_CSI_SET, 0, curr); } -- 2.20.1 Notice: This e-mail together with any attachments may contain information of Ribbon Communications Inc. that is confidential and/or proprietary for the sole use of the intended recipient. Any review, disclosure, reliance or distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please notify the sender immediately and then delete all copies, including any attachments. smime.p7s Description: S/MIME Cryptographic Signature ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] mds: Disable mds flow control for mds broadcast/multicast message [#3101]
Hi Minh ack (review only) Thanks On 17/10/19 2:00 pm, Minh Chau wrote: The mds flow control has been disabled for broadcast/mulitcast unfragment message if tipc multicast is enabled. This patch revisits and continues with fragment messages. --- src/mds/mds_tipc_fctrl_intf.cc | 47 src/mds/mds_tipc_fctrl_msg.h | 11 +++--- src/mds/mds_tipc_fctrl_portid.cc | 47 ++-- src/mds/mds_tipc_fctrl_portid.h | 3 ++- 4 files changed, 69 insertions(+), 39 deletions(-) diff --git a/src/mds/mds_tipc_fctrl_intf.cc b/src/mds/mds_tipc_fctrl_intf.cc index b803bfe..fe3dbd5 100644 --- a/src/mds/mds_tipc_fctrl_intf.cc +++ b/src/mds/mds_tipc_fctrl_intf.cc @@ -133,7 +133,7 @@ uint32_t process_flow_event(const Event& evt) { kChunkAckSize, sock_buf_size); portid_map[TipcPortId::GetUniqueId(evt.id_)] = portid; rc = portid->ReceiveData(evt.mseq_, evt.mfrag_, -evt.fseq_, evt.svc_id_); +evt.fseq_, evt.svc_id_, evt.snd_type_, is_mcast_enabled); } else if (evt.type_ == Event::Type::kEvtRcvIntro) { portid = new TipcPortId(evt.id_, data_sock_fd, kChunkAckSize, sock_buf_size); @@ -147,7 +147,7 @@ uint32_t process_flow_event(const Event& evt) { } else { if (evt.type_ == Event::Type::kEvtRcvData) { rc = portid->ReceiveData(evt.mseq_, evt.mfrag_, - evt.fseq_, evt.svc_id_); + evt.fseq_, evt.svc_id_, evt.snd_type_, is_mcast_enabled); } if (evt.type_ == Event::Type::kEvtRcvChunkAck) { portid->ReceiveChunkAck(evt.fseq_, evt.chunk_size_); @@ -430,6 +430,7 @@ uint32_t mds_tipc_fctrl_drop_data(uint8_t *buffer, uint16_t len, HeaderMessage header; header.Decode(buffer); + Event* pevt = nullptr; // if mds support flow control if ((header.pro_ver_ & MDS_PROT_VER_MASK) == MDS_PROT_FCTRL) { if (header.pro_id_ == MDS_PROT_FCTRL_ID) { @@ -438,9 +439,10 @@ uint32_t mds_tipc_fctrl_drop_data(uint8_t *buffer, uint16_t len, ChunkAck ack; ack.Decode(buffer); // send to the event thread -if (m_NCS_IPC_SEND(_events, -new Event(Event::Type::kEvtSendChunkAck, id, ack.svc_id_, -header.mseq_, header.mfrag_, ack.acked_fseq_, ack.chunk_size_), +pevt = new Event(Event::Type::kEvtSendChunkAck, id, ack.svc_id_, +header.mseq_, header.mfrag_, ack.acked_fseq_); +pevt->chunk_size_ = ack.chunk_size_; +if (m_NCS_IPC_SEND(_events, pevt, NCS_IPC_PRIORITY_HIGH) != NCSCC_RC_SUCCESS) { m_MDS_LOG_ERR("FCTRL: Failed to send msg to mbx_events, Error[%s]", strerror(errno)); @@ -453,9 +455,9 @@ uint32_t mds_tipc_fctrl_drop_data(uint8_t *buffer, uint16_t len, DataMessage data; data.Decode(buffer); // send to the event thread - if (m_NCS_IPC_SEND(_events, - new Event(Event::Type::kEvtDropData, id, data.svc_id_, - header.mseq_, header.mfrag_, header.fseq_), + pevt = new Event(Event::Type::kEvtDropData, id, data.svc_id_, + header.mseq_, header.mfrag_, header.fseq_); + if (m_NCS_IPC_SEND(_events, pevt, NCS_IPC_PRIORITY_HIGH) != NCSCC_RC_SUCCESS) { m_MDS_LOG_ERR("FCTRL: Failed to send msg to mbx_events, Error[%s]", strerror(errno)); @@ -474,6 +476,7 @@ uint32_t mds_tipc_fctrl_rcv_data(uint8_t *buffer, uint16_t len, HeaderMessage header; header.Decode(buffer); + Event* pevt = nullptr; // if mds support flow control if ((header.pro_ver_ & MDS_PROT_VER_MASK) == MDS_PROT_FCTRL) { if (header.pro_id_ == MDS_PROT_FCTRL_ID) { @@ -482,9 +485,10 @@ uint32_t mds_tipc_fctrl_rcv_data(uint8_t *buffer, uint16_t len, ChunkAck ack; ack.Decode(buffer); // send to the event thread -if (m_NCS_IPC_SEND(_events, -new Event(Event::Type::kEvtRcvChunkAck, id, ack.svc_id_, -header.mseq_, header.mfrag_, ack.acked_fseq_, ack.chunk_size_), +pevt = new Event(Event::Type::kEvtRcvChunkAck, id, ack.svc_id_, +header.mseq_, header.mfrag_, ack.acked_fseq_); +pevt->chunk_size_ = ack.chunk_size_; +if (m_NCS_IPC_SEND(_events, pevt, NCS_IPC_PRIORITY_HIGH) != NCSCC_RC_SUCCESS) { m_MDS_LOG_ERR("FCTRL: Failed to send msg to mbx_events, Error[%s]", strerror(errno)); @@ -494,9 +498,9 @@ uint32_t mds_tipc_fctrl_rcv_data(uint8_t *buffer, uint16_t len, Nack nack; nack.Decode(buffer); // send to the event thread -if (m_NCS_IPC_SEND(_events, -new Event(Event::Type::kEvtRcvNack, id, nack.svc_id_, -header.mseq_, header.mfrag_, nack.nacked_fseq_), +pevt = new Event(Event::Type::kEvtRcvNack, id, nack.svc_id_, +header.mseq_, header.mfrag_, nack.nacked_fseq_); +if
Re: [devel] [PATCH 1/1] mds: add more tests for mds flow control [#3091]
Hi Thuan Looks OK (review only). Thanks Gary On 14/10/19 8:44 pm, thuan.tran wrote: mdstest for overload - 2 senders overload one receivers - one sender overloads 2 receivers mdstest for SNA (Serial Number Arithmetic) - without overload, mds sender gradually sends more than 65535 messages and receivers should receive them all - with overload, mds sender sends a burst of greater than 65535 messages and receivers should receive them all mdstest for #1960 backward compatibility, in order to test the txprob timer - sender enables, receiver disables - sender disables, receiver enables --- src/mds/apitest/mdstipc.h | 6 + src/mds/apitest/mdstipc_api.c | 480 +- 2 files changed, 421 insertions(+), 65 deletions(-) diff --git a/src/mds/apitest/mdstipc.h b/src/mds/apitest/mdstipc.h index 2bd44b4fa..5fd7b9c6e 100644 --- a/src/mds/apitest/mdstipc.h +++ b/src/mds/apitest/mdstipc.h @@ -145,6 +145,12 @@ typedef struct tet_mds_recvd_msg_info { uint16_t len; } TET_MDS_RECVD_MSG_INFO; +typedef struct COUNTER { + MDS_DEST fr_dest; + uint32_t msg_count; + struct COUNTER *next; +} COUNTER; + /* GLOBAL variables / TET_ADEST gl_tet_adest; TET_VDEST diff --git a/src/mds/apitest/mdstipc_api.c b/src/mds/apitest/mdstipc_api.c index f667d7385..5c0e28ab2 100644 --- a/src/mds/apitest/mdstipc_api.c +++ b/src/mds/apitest/mdstipc_api.c @@ -31,6 +31,7 @@ #define MSG_SIZE MDS_DIRECT_BUF_MAXSIZE static MDS_CLIENT_MSG_FORMAT_VER gl_set_msg_fmt_ver; +COUNTER *gl_head_counters = NULL; MDS_SVC_ID svc_ids[3] = {2006, 2007, 2008}; @@ -13105,9 +13106,62 @@ void tet_create_default_PWE_VDEST_tp() test_validate(FAIL, 0); } -void tet_sender(uint32_t msg_count, uint32_t msg_size) +static void reset_counters(void) +{ + COUNTER *tmp = gl_head_counters; + while (tmp != NULL) { + gl_head_counters = tmp->next; + free(tmp); + tmp = gl_head_counters; + } +} + +static uint32_t increase_counters(MDS_DEST dest) +{ + COUNTER *tmp = gl_head_counters; + while (tmp != NULL) { + if (tmp->fr_dest == dest) { + tmp->msg_count++; + printf("\nGot %d message from %x\n", + tmp->msg_count, dest); + return tmp->msg_count; + } + tmp = tmp->next; + } + if (tmp == NULL) { + COUNTER *new = (COUNTER *)malloc(sizeof(COUNTER)); + new->fr_dest = dest; + new->msg_count = 1; + new->next = gl_head_counters; + gl_head_counters = new; + printf("\nGot %d message from %x\n", + new->msg_count, dest); + return new->msg_count; + } + return 0; +} + +static bool verify_counters(uint32_t expect_num) +{ + COUNTER *tmp = gl_head_counters; + if (tmp == NULL) { + printf("\nNo message\n"); + return false; + } + while (tmp != NULL) { + if (tmp->msg_count != expect_num) { + printf("\nGot %d message from %x\n", + tmp->msg_count, tmp->fr_dest); + return false; + } + tmp = tmp->next; + } + return true; +} + +void tet_sender(MDS_SVC_ID svc_id, uint32_t msg_num, uint32_t msg_size, + int svc_num, MDS_SVC_ID to_svcids[]) { - int live = 100; // sender live max 100s TET_MDS_MSG *mesg; if (msg_size > TET_MSG_SIZE_MIN) { printf("\nSender: msg_size > TET_MSG_SIZE_MIN\n"); @@ -13117,72 +13171,84 @@ void tet_sender(uint32_t msg_count, uint32_t msg_size) memset(mesg, 0, sizeof(TET_MDS_MSG)); printf("\nStarted Sender (pid:%d) svc_id=%d\n", - (int)getpid(), NCSMDS_SVC_ID_INTERNAL_MIN); + (int)getpid(), svc_id); if (adest_get_handle() != NCSCC_RC_SUCCESS) { printf("\n: Sender FAIL to get adest handle\n"); exit(1); } if (mds_service_install(gl_tet_adest.mds_pwe1_hdl, - NCSMDS_SVC_ID_INTERNAL_MIN, 1, + svc_id, 1, NCSMDS_SCOPE_NONE, false, false) != NCSCC_RC_SUCCESS) { printf("\nSender FAIL to install the service\n"); exit(1); } - MDS_SVC_ID svcids[] = {NCSMDS_SVC_ID_EXTERNAL_MIN}; if (mds_service_subscribe( - gl_tet_adest.mds_pwe1_hdl, NCSMDS_SVC_ID_INTERNAL_MIN, - NCSMDS_SCOPE_INTRANODE, 1, svcids) != NCSCC_RC_SUCCESS) { + gl_tet_adest.mds_pwe1_hdl, svc_id, + NCSMDS_SCOPE_INTRANODE, + svc_num, to_svcids) != NCSCC_RC_SUCCESS) { printf("\nSender
Re: [devel] [PATCH 1/1] osaf: perform handshake in tcp_server in new thread [#3099]
Hi I should have put one more comment in. Currently, the handshake is done in the equivalent of accept() running in the 'main thread'. If a client is malicious or faulty, then no one else can connect. But finish_request() is run from the thread created for each client. Gary On 11/10/19 2:22 pm, Gary Lee wrote: --- src/osaf/consensus/plugins/tcp/tcp_server.py | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/osaf/consensus/plugins/tcp/tcp_server.py b/src/osaf/consensus/plugins/tcp/tcp_server.py index a7f22f2..c10859c 100755 --- a/src/osaf/consensus/plugins/tcp/tcp_server.py +++ b/src/osaf/consensus/plugins/tcp/tcp_server.py @@ -73,10 +73,15 @@ class ThreadedRPCServer(ThreadingMixIn, certfile=CERTFILE, keyfile=KEYFILE, cert_reqs=ssl.CERT_NONE, -ssl_version=ssl.PROTOCOL_TLSv1_2) +ssl_version=ssl.PROTOCOL_TLSv1_2, +do_handshake_on_connect=False) self.server_bind() self.server_activate() +def finish_request(self, request, client_address): + request.do_handshake() + return SimpleXMLRPCServer.finish_request(self, request, client_address) + class Arbitrator(object): """ Implementation of a simple arbitrator """ smime.p7s Description: S/MIME Cryptographic Signature ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for osaf: perform handshake in tcp_server in new thread [#3099]
Summary: osaf: perform handshake in tcp_server in new thread [#3099] Review request for Ticket(s): 3099 Peer Reviewer(s): Hans, Minh, Thuan Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3099 Base revision: e4c3c0c95644238fc84f31352e8ef289d9820ab4 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesn Core libraries n Samples y Tests n Other n Comments (indicate scope for each "y" above): - revision fed332c489eb687982071013a8cb64e1932960e0 Author: Gary Lee Date: Fri, 11 Oct 2019 14:08:50 +1100 osaf: perform handshake in tcp_server in new thread [#3099] Complete diffstat: -- src/osaf/consensus/plugins/tcp/tcp_server.py | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) Testing Commands: - 1) Run tcp_server.py manually 2) telnet localhost and don't enter anything 3) Run tcp.plugin and make sure it receives a response from the server Testing, Expected Results: -- As above. Without this patch, Step 3 will not work Conditions of Submission: - Ack from anyone Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] osaf: perform handshake in tcp_server in new thread [#3099]
--- src/osaf/consensus/plugins/tcp/tcp_server.py | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/osaf/consensus/plugins/tcp/tcp_server.py b/src/osaf/consensus/plugins/tcp/tcp_server.py index a7f22f2..c10859c 100755 --- a/src/osaf/consensus/plugins/tcp/tcp_server.py +++ b/src/osaf/consensus/plugins/tcp/tcp_server.py @@ -73,10 +73,15 @@ class ThreadedRPCServer(ThreadingMixIn, certfile=CERTFILE, keyfile=KEYFILE, cert_reqs=ssl.CERT_NONE, -ssl_version=ssl.PROTOCOL_TLSv1_2) +ssl_version=ssl.PROTOCOL_TLSv1_2, +do_handshake_on_connect=False) self.server_bind() self.server_activate() +def finish_request(self, request, client_address): + request.do_handshake() + return SimpleXMLRPCServer.finish_request(self, request, client_address) + class Arbitrator(object): """ Implementation of a simple arbitrator """ -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] osaf: return new takeover_request immediately [#3098]
If a takeover_request is created just before the active controller calls 'watch takeover_request', then it's possible that the active rded instance is not informed of the request. When 'watch takeover_request' is called, check if there's already a takeover_request in 'NEW' state and return immediately. --- src/osaf/consensus/plugins/etcd3.plugin | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/src/osaf/consensus/plugins/etcd3.plugin b/src/osaf/consensus/plugins/etcd3.plugin index d926885..4e09ef6 100644 --- a/src/osaf/consensus/plugins/etcd3.plugin +++ b/src/osaf/consensus/plugins/etcd3.plugin @@ -337,13 +337,22 @@ watch() { orig_value=$(get "$watch_key") result=$? - if [ "$result" -le "1" ]; then + if [ "$result" -le 1 ]; then + if [ "$result" -eq 0 ] && [ "$watch_key" == "$takeover_request" ]; then +state=$(echo $orig_value | awk '{print $4}') +if [ "$state" == "NEW" ]; then + # takeover_request already exists; maybe it was written created + # while this node was being promoted + echo $orig_value + return 0 +fi + fi while true do sleep $heartbeat_interval current_value=$(get "$watch_key") result=$? - if [ "$result" -gt "1" ]; then + if [ "$result" -gt 1 ]; then # etcd down? if [ "$watch_key" == "$takeover_request" ]; then hostname=`cat $node_name_file` -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for osaf: return new takeover_request immediately [#3098]
Summary: osaf: return new takeover_request immediately [#3098] Review request for Ticket(s): 3098 Peer Reviewer(s): Minh, Thuan, Thang, Hans Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3098 Base revision: cafbc5d02c90b57c7c94a7735ce8e002224b3d6b Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesn Core libraries n Samples y Tests n Other n Comments (indicate scope for each "y" above): - revision 903ebd435993cce00350c60827e35b15a78ca3c8 Author: Gary Lee Date: Thu, 10 Oct 2019 14:53:41 +1100 osaf: return new takeover_request immediately [#3098] If a takeover_request is created just before the active controller calls 'watch takeover_request', then it's possible that the active rded instance is not informed of the request. When 'watch takeover_request' is called, check if there's already a takeover_request in 'NEW' state and return immediately. Complete diffstat: -- src/osaf/consensus/plugins/etcd3.plugin | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - ack from anyone Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] osaf: add tcp arbitrator [#3064]
Hi Hans OK that’s a good idea. thanks > On 4 Oct 2019, at 11:06 pm, Hans Nordebäck > wrote: > > Hi Gary, ack, review only. One comment/suggestion can we provide a > small script that generates the x509 certificate (use e.g. openssl X509 > ... ) instead of including a self signed cert? /BR Hans >> On Tue, 2019-10-01 at 12:53 +1000, Gary Lee wrote: >> --- >> src/osaf/consensus/plugins/tcp/README | 41 ++ >> src/osaf/consensus/plugins/tcp/certificate.pem | 20 + >> src/osaf/consensus/plugins/tcp/key.pem | 28 ++ >> src/osaf/consensus/plugins/tcp/tcp.plugin | 520 >> + >> src/osaf/consensus/plugins/tcp/tcp_server.py | 157 >> 5 files changed, 766 insertions(+) >> create mode 100644 src/osaf/consensus/plugins/tcp/README >> create mode 100644 src/osaf/consensus/plugins/tcp/certificate.pem >> create mode 100644 src/osaf/consensus/plugins/tcp/key.pem >> create mode 100755 src/osaf/consensus/plugins/tcp/tcp.plugin >> create mode 100755 src/osaf/consensus/plugins/tcp/tcp_server.py >> >> diff --git a/src/osaf/consensus/plugins/tcp/README >> b/src/osaf/consensus/plugins/tcp/README >> new file mode 100644 >> index 000..6f739e8 >> --- /dev/null >> +++ b/src/osaf/consensus/plugins/tcp/README >> @@ -0,0 +1,41 @@ >> +TCP arbitrator >> + >> +The TCP arbitrator may be useful for deployments where deploying >> etcd is not >> +feasible. An example arbitrator is provided to help prevent split >> brain in >> +clusters that contain up to 2 system controllers. >> + >> +The example arbitrator is a simple python based program that can be >> deployed on >> +a single payload or a node external to the cluster. >> + >> +Two main pieces of information are stored on the arbitrator: the >> hostname of the >> +current active controller and a heartbeat timestamp. >> + >> +An active controller sends a heartbeat to the controller every 100ms >> using TLs >> +over a persistent TCP connection. It should self-fence if it is >> unable to >> +heartbeat, as it is likely to be separated from the arbitrator. >> + >> +A candidate active controller must check the existing controller is >> not >> +heartbeating before promoting itself active. On a cluster using >> TIPC, >> +the timeout value is the TIPC link tolerance timeout. On a TCP based >> cluster, >> +the timeout is calculated from FMS_TAKEOVER_REQUEST_VALID_TIME. >> + >> +Suggested fmd.conf configuration: >> + >> +export FMS_SPLIT_BRAIN_PREVENTION=1 >> +export FMS_KEYVALUE_STORE_PLUGIN_CMD=/full/path/to/tcp.plugin >> +export FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE=0 (any other setting >> is ignored) >> +export FMS_RELAXED_NODE_PROMOTION=1 >> + >> +The above settings will allow a controller to be elected active >> during >> +cluster startup, even if the arbitrator is not yet running. >> +If the arbitrator becomes temporarily unavailable, the controllers >> will >> +remain running if they can see each other. If an active controller >> becomes >> +isolated from the standby *and* the arbitrator, it will self-fence >> and the >> +standby will become active (if located in the same network partition >> as >> +the arbitrator). >> + >> +The provided self-signed certificate is an example only, and was >> generated using: >> + >> +openssl req -newkey rsa:2048 -nodes -keyout key.pem -x509 -days >> 10 -out certificate.pem >> + >> +It must be replaced in an actual deployment!! >> diff --git a/src/osaf/consensus/plugins/tcp/certificate.pem >> b/src/osaf/consensus/plugins/tcp/certificate.pem >> new file mode 100644 >> index 000..e0b4993 >> --- /dev/null >> +++ b/src/osaf/consensus/plugins/tcp/certificate.pem >> @@ -0,0 +1,20 @@ >> +-BEGIN CERTIFICATE- >> +MIIDUTCCAjmgAwIBAgIJANrPYThNMllvMA0GCSqGSIb3DQEBCwUAMD4xCzAJBgNV >> +BAYTAkFVMQ4wDAYDVQQIDAVTdGF0ZTENMAsGA1UEBwwEQ2l0eTEQMA4GA1UECgwH >> +T3BlblNBRjAgFw0xOTA5MzAwMDMxNTRaGA8yMjkzMDcxNTAwMzE1NFowPjELMAkG >> +A1UEBhMCQVUxDjAMBgNVBAgMBVN0YXRlMQ0wCwYDVQQHDARDaXR5MRAwDgYDVQQK >> +DAdPcGVuU0FGMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA5pCFKYnS >> ++pi0gzrRWPRYg1sak9VpNK+MkKbj+m0bptRt/8JvosV62js4q5Da3ldq2AAcEJyf >> +gd02YZ4HUDdCMgMtlWT1CAx89rNpozRwyj5g+4cfmOqiz7ApeZ9yqltInjG720DT >> +lam2/R4/00zmFGAqD2ZGPiOY93bjYx+GhtiHcDvpJuZS2Z2vQ/Dd09v6Omhus0rZ >> +WMrENyfavc7HwFv2z/qi4Hsb/Aa9ZuAXUKp1Q2cvC0XWdRJMdZaZfGUlTfY6X8ar >> +hSnswHJJKIjBq/0jYpztntOubceOuGVyezxPVXPw5qiBLO7ZyYNgN9IMoF6Rbu9y >> +
[devel] [PATCH 0/1] Review Request for amf: add asserts to problematic areas identified by codechecker [#3077]
Summary: amf: add asserts to problematic areas identified by codechecker [#3077] Review request for Ticket(s): 3077 Peer Reviewer(s): Hans, Minh, Thuan Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3077 Base revision: 05064a1cfd0aeaf824dce7602d535654b3457e30 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 39c8ca156da2acbaecb83ae76ce7d9bc480a4c64 Author: Gary Lee Date: Thu, 3 Oct 2019 15:07:30 +1000 amf: add asserts to problematic areas identified by codechecker [#3077] Complete diffstat: -- src/amf/amfd/sg_nway_fsm.cc | 2 ++ src/amf/amfd/sgtype.cc | 1 + src/amf/amfnd/comp.cc | 2 ++ src/amf/amfnd/susm.cc | 1 + 4 files changed, 6 insertions(+) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - ack from anyone Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 0/1] Review Request for osaf: add tcp arbitrator [#3064]
Hi Alex > I see in the README this is usable for "clusters that contain up to 2 system controllers." What are the limiting factors for applying it to a > cluster with more than 2 system controllers (where the others are running as spares)? The TCP arbitrator is intended for use with FMS_RELAXED_NODE_PROMOTION=1. Otherwise, it becomes a single point of failure. With this setting, two SCs can remain up if they can see other. If roaming SC is enabled, consider the case where two spare SCs become isolated in a network partition (partition 2), while existing active/standby/arbitrator is in partition 1. We would end up with dual actives as the SCs in partition 2 will also become active/standby. Hope that explains it better. Gary On 1/10/19 12:53 pm, Gary Lee wrote: Summary: osaf: add tcp arbitrator [#3064] Review request for Ticket(s): 3064 Peer Reviewer(s): Minh, Hans, AndersW Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3064 Base revision: 46e9e0f310a6c21dbc89a9ffd8bee26829342c0c Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesy Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision feea45602df54671c8e769f2e234b03ad6dcdaeb Author: Gary Lee Date: Tue, 1 Oct 2019 12:47:13 +1000 osaf: add tcp arbitrator [#3064] Added Files: src/osaf/consensus/plugins/tcp/certificate.pem src/osaf/consensus/plugins/tcp/key.pem src/osaf/consensus/plugins/tcp/README src/osaf/consensus/plugins/tcp/tcp.plugin src/osaf/consensus/plugins/tcp/tcp_server.py Complete diffstat: -- src/osaf/consensus/plugins/tcp/README | 41 ++ src/osaf/consensus/plugins/tcp/certificate.pem | 20 + src/osaf/consensus/plugins/tcp/key.pem | 28 ++ src/osaf/consensus/plugins/tcp/tcp.plugin | 520 + src/osaf/consensus/plugins/tcp/tcp_server.py | 157 5 files changed, 766 insertions(+) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - ack from anyone Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (
[devel] [PATCH 0/1] Review Request for osaf: add tcp arbitrator [#3064]
Summary: osaf: add tcp arbitrator [#3064] Review request for Ticket(s): 3064 Peer Reviewer(s): Minh, Hans, AndersW Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3064 Base revision: 46e9e0f310a6c21dbc89a9ffd8bee26829342c0c Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesy Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision feea45602df54671c8e769f2e234b03ad6dcdaeb Author: Gary Lee Date: Tue, 1 Oct 2019 12:47:13 +1000 osaf: add tcp arbitrator [#3064] Added Files: src/osaf/consensus/plugins/tcp/certificate.pem src/osaf/consensus/plugins/tcp/key.pem src/osaf/consensus/plugins/tcp/README src/osaf/consensus/plugins/tcp/tcp.plugin src/osaf/consensus/plugins/tcp/tcp_server.py Complete diffstat: -- src/osaf/consensus/plugins/tcp/README | 41 ++ src/osaf/consensus/plugins/tcp/certificate.pem | 20 + src/osaf/consensus/plugins/tcp/key.pem | 28 ++ src/osaf/consensus/plugins/tcp/tcp.plugin | 520 + src/osaf/consensus/plugins/tcp/tcp_server.py | 157 5 files changed, 766 insertions(+) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - ack from anyone Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: correct handling complete/apply callback on standby sc [#3082]
Hi Thang ack (review only) Thanks Gary On 16/9/19 4:44 pm, thang.d.nguyen wrote: During stanby SC comes up, AMF config objects are deleted on active SC. It causes NOT_EXIST error on standby node. AMFD on standby should ignore this error in this case. --- src/amf/amfd/app.cc| 29 - src/amf/amfd/comp.cc | 18 +++--- src/amf/amfd/compcstype.cc | 14 ++ src/amf/amfd/csi.cc| 24 ++-- src/amf/amfd/nodegroup.cc | 7 --- src/amf/amfd/sg.cc | 32 ++-- src/amf/amfd/sgtype.cc | 11 +++ src/amf/amfd/si.cc | 29 ++--- src/amf/amfd/su.cc | 35 --- src/amf/amfd/sutype.cc | 12 10 files changed, 162 insertions(+), 49 deletions(-) diff --git a/src/amf/amfd/app.cc b/src/amf/amfd/app.cc index 67e5e3e9d..17a259199 100644 --- a/src/amf/amfd/app.cc +++ b/src/amf/amfd/app.cc @@ -296,6 +296,11 @@ static void app_ccb_apply_cb(CcbUtilOperationData_t *opdata) { case CCBUTIL_MODIFY: { const SaImmAttrModificationT_2 *attr_mod; app = app_db->find(Amf::to_string(>objectName)); + if (app == nullptr && avd_cb->is_active() == false) { +LOG_WA("App modify apply (STDBY): app does not exist"); +break; + } + assert(app != nullptr); int i = 0; while ((attr_mod = opdata->param.modify.attrMods[i++]) != nullptr) { const SaImmAttrValuesT_2 *attribute = _mod->modAttr; @@ -448,11 +453,12 @@ SaAisErrorT avd_app_config_get(void) { searchParam.searchOneAttr.attrValueType = SA_IMM_ATTR_SASTRINGT; searchParam.searchOneAttr.attrValue = - if (immutil_saImmOmSearchInitialize_2( + if ((rc = immutil_saImmOmSearchInitialize_2( avd_cb->immOmHandle, nullptr, SA_IMM_SUBTREE, SA_IMM_SEARCH_ONE_ATTR | SA_IMM_SEARCH_GET_SOME_ATTR, , - configAttributes, ) != SA_AIS_OK) { -LOG_ER("%s: saImmOmSearchInitialize_2 failed: %u", __FUNCTION__, error); + configAttributes, )) != SA_AIS_OK) { +LOG_ER("%s: saImmOmSearchInitialize_2 failed: %u", __FUNCTION__, rc); +error = rc; goto done1; } @@ -468,9 +474,22 @@ SaAisErrorT avd_app_config_get(void) { app_add_to_model(app); -if (avd_sg_config_get(Amf::to_string(), app) != SA_AIS_OK) goto done2; +if ((rc = avd_sg_config_get(Amf::to_string(), app)) != SA_AIS_OK) { + if ((rc == SA_AIS_ERR_NOT_EXIST) && (avd_cb->is_active() == false)) { +avd_app_delete(app); +continue; + } else { +goto done2; + } +} -if (avd_si_config_get(app) != SA_AIS_OK) goto done2; +if ((rc = avd_si_config_get(app)) != SA_AIS_OK) { + if ((rc == SA_AIS_ERR_NOT_EXIST) && (avd_cb->is_active() == false)) { +avd_app_delete(app); + } else { +goto done2; + } +} } if (rc == SA_AIS_ERR_NOT_EXIST) { diff --git a/src/amf/amfd/comp.cc b/src/amf/amfd/comp.cc index 0ff365e55..7e46584db 100644 --- a/src/amf/amfd/comp.cc +++ b/src/amf/amfd/comp.cc @@ -1509,6 +1509,7 @@ SaAisErrorT avd_comp_config_get(const std::string _name, AVD_SU *su) { SA_IMM_SEARCH_ONE_ATTR | SA_IMM_SEARCH_GET_SOME_ATTR, , configAttributes, )) != SA_AIS_OK) { LOG_ER("%s: saImmOmSearchInitialize_2 failed: %u", __FUNCTION__, rc); +error = rc; goto done1; } @@ -1524,9 +1525,15 @@ SaAisErrorT avd_comp_config_get(const std::string _name, AVD_SU *su) { num_of_comp_in_su++; comp_add_to_model(comp); -if (avd_compcstype_config_get(Amf::to_string(_name), comp) != -SA_AIS_OK) - goto done2; +if ((rc = avd_compcstype_config_get(Amf::to_string(_name), comp)) != +SA_AIS_OK) { + if ((rc == SA_AIS_ERR_NOT_EXIST) && (avd_cb->is_active() == false)) { +avd_comp_delete(comp); +num_of_comp_in_su--; + } else { +goto done2; + } +} } /* If there are no component in the SU, we treat it as invalid configuration. @@ -1695,6 +1702,10 @@ static SaAisErrorT ccb_completed_modify_hdlr(CcbUtilOperationData_t *opdata) { TRACE_ENTER(); comp = comp_db->find(Amf::to_string(>objectName)); + if (comp == nullptr && avd_cb->is_active() == false) { +LOG_WA("Comp modify completed (STDBY): comp does not exist"); +return SA_AIS_OK; + } while ((attr_mod = opdata->param.modify.attrMods[i++]) != nullptr) { const SaImmAttrValuesT_2 *attribute = _mod->modAttr; @@ -2479,6 +2490,7 @@ void comp_ccb_apply_delete_hdlr(struct CcbUtilOperationData *opdata) { AVD_COMP *comp = comp_db->find(Amf::to_string(>objectName)); if (comp == nullptr && avd_cb->is_active() == false) { +LOG_WA("Comp modify apply (STDBY): comp does not exist"); return; } /* comp should be found in the database even if it was diff --git
[devel] [PATCH 0/1] Review Request for amfd: fix coredump during downgrade if delayed failover is enabled V2 [#3078]
Summary: amfd: fix coredump during downgrade if delayed failover is enabled [#3078] Review request for Ticket(s): 3078 Peer Reviewer(s): Minh Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3078 Base revision: 4ac5b9921c64657900a029774636a00de41d8232 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 4a13618129f61b3a24502722d8c7b84bb465639e Author: Gary Lee Date: Thu, 12 Sep 2019 17:17:51 +1000 amfd: fix coredump during downgrade if delayed failover is enabled [#3078] If delayed failover is enabled, and a downgrade to a version without #3060 occurs, then the standby running a newer version with #3060 may complain about an out of sync error during warm sync. Complete diffstat: -- src/amf/amfd/ckpt_dec.cc | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 n n powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] amfd: fix coredump during downgrade if delayed failover is enabled [#3078]
If delayed failover is enabled, and a downgrade to a version without #3060 occurs, then the standby running a newer version with #3060 may complain about an out of sync error during warm sync. --- src/amf/amfd/ckpt_dec.cc | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/src/amf/amfd/ckpt_dec.cc b/src/amf/amfd/ckpt_dec.cc index 6288b4f..75213f8 100644 --- a/src/amf/amfd/ckpt_dec.cc +++ b/src/amf/amfd/ckpt_dec.cc @@ -2721,10 +2721,25 @@ uint32_t avd_dec_warm_sync_rsp(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec) { if (updt_cnt->ng_updt != cb->async_updt_cnt.ng_updt) LOG_ER("ng_updt counters mismatch: Active: %u Standby: %u", updt_cnt->ng_updt, cb->async_updt_cnt.ng_updt); -if (updt_cnt->failover_updt != cb->async_updt_cnt.failover_updt) - LOG_ER("failover_updt counters mismatch: Active: %u Standby: %u", - updt_cnt->failover_updt, cb->async_updt_cnt.failover_updt); - +if (updt_cnt->failover_updt != cb->async_updt_cnt.failover_updt) { + if (dec->i_peer_version >= AVD_MBCSV_SUB_PART_VERSION_10) { +LOG_ER("failover_updt counters mismatch: Active: %u Standby: %u", + updt_cnt->failover_updt, cb->async_updt_cnt.failover_updt); + } else { +// Versions before 10 did not support failover_updt +// After a downgrade scenario, where the active is < v10 +// and this node is >= v10, then there will be failover_updt mismatch +// If so, just set the value to what's on the older active +cb->async_updt_cnt.failover_updt = updt_cnt->failover_updt; + +// check again +if (0 == memcmp(updt_cnt, >async_updt_cnt, +sizeof(AVSV_ASYNC_UPDT_CNT))) { + cb->stby_sync_state = AVD_STBY_IN_SYNC; + return status; +} + } +} LOG_ER("Out of sync detected in warm sync response, exiting"); osafassert(0); -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] amfd: fix coredump during downgrade if delayed failover is enabled [#3078]
If delayed failover is enabled, and a downgrade to a version without #3060 occurs, then the standby running a newer version with #3060 may complain about an out of sync error during warm sync. --- src/amf/amfd/ckpt_dec.cc | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/src/amf/amfd/ckpt_dec.cc b/src/amf/amfd/ckpt_dec.cc index 6288b4f..5d4b3f5 100644 --- a/src/amf/amfd/ckpt_dec.cc +++ b/src/amf/amfd/ckpt_dec.cc @@ -2721,10 +2721,25 @@ uint32_t avd_dec_warm_sync_rsp(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec) { if (updt_cnt->ng_updt != cb->async_updt_cnt.ng_updt) LOG_ER("ng_updt counters mismatch: Active: %u Standby: %u", updt_cnt->ng_updt, cb->async_updt_cnt.ng_updt); -if (updt_cnt->failover_updt != cb->async_updt_cnt.failover_updt) - LOG_ER("failover_updt counters mismatch: Active: %u Standby: %u", - updt_cnt->failover_updt, cb->async_updt_cnt.failover_updt); - +if (updt_cnt->failover_updt != cb->async_updt_cnt.failover_updt) { + if (dec->i_peer_version >= AVD_MBCSV_SUB_PART_VERSION_10) { +LOG_ER("failover_updt counters mismatch: Active: %u Standby: %u", + updt_cnt->failover_updt, cb->async_updt_cnt.failover_updt); + } else { +// Versions before 10 did not support failover_updt +// After a downupgrade scenario, where the active is < v10 +// and this node is >= v10, then there will be failover_updt mismatch +// If so, just set the value to what's on the older active +cb->async_updt_cnt.failover_updt = updt_cnt->failover_updt; + +// check again +if (0 == memcmp(updt_cnt, >async_updt_cnt, +sizeof(AVSV_ASYNC_UPDT_CNT))) { + cb->stby_sync_state = AVD_STBY_IN_SYNC; + return status; +} + } +} LOG_ER("Out of sync detected in warm sync response, exiting"); osafassert(0); -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for amfd: fix coredump during downgrade if delayed failover is enabled V2 [#3078]
Summary: amfd: fix coredump during downgrade if delayed failover is enabled [#3078] Review request for Ticket(s): 3078 Peer Reviewer(s): Minh Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3078 Base revision: 4ac5b9921c64657900a029774636a00de41d8232 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision c6c9d6b8efcd9c8b992b82621bbf7ea8f53865a1 Author: Gary Lee Date: Thu, 12 Sep 2019 17:08:56 +1000 amfd: fix coredump during downgrade if delayed failover is enabled [#3078] If delayed failover is enabled, and a downgrade to a version without #3060 occurs, then the standby running a newer version with #3060 may complain about an out of sync error during warm sync. Complete diffstat: -- src/amf/amfd/ckpt_dec.cc | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 3/9] mds: Add implementation for TIPC buffer overflow solution [#1960]
Please ignore the Encode/Decode comment. On 10/9/19 6:02 pm, Gary Lee wrote: Hi Minh & Thuan Some minor comments marked with [GL]. On 14/8/19 4:38 pm, Minh Chau wrote: This is a collaborative patch of two participants:Thuan, Minh. Main changes: - Add mds_tipc_fctrl_intf.h, mds_tipc_fctrl_intf.cc: These two files introduce new functions which are called in mds_dt_tipc.c if the flow control is enabled - Add mds_tipc_fctrl_portid.h, mds_tipc_fctrl_portid.cc: These files implements the tipc portid instance, which supports the sliding window, mds msg queue - Add mds_tipc_fctrl_msg.h, mds_tipc_fctrl_msg.cc: These files define the event and messages which are used for this solution. --- src/mds/Makefile.am | 10 +- src/mds/mds_dt.h | 8 +- src/mds/mds_dt_tipc.c | 188 +--- src/mds/mds_tipc_fctrl_intf.cc | 376 +++ src/mds/mds_tipc_fctrl_intf.h | 47 + src/mds/mds_tipc_fctrl_msg.cc | 142 +++ src/mds/mds_tipc_fctrl_msg.h | 129 ++ src/mds/mds_tipc_fctrl_portid.cc | 261 +++ src/mds/mds_tipc_fctrl_portid.h | 87 + 9 files changed, 1184 insertions(+), 64 deletions(-) create mode 100644 src/mds/mds_tipc_fctrl_intf.cc create mode 100644 src/mds/mds_tipc_fctrl_intf.h create mode 100644 src/mds/mds_tipc_fctrl_msg.cc create mode 100644 src/mds/mds_tipc_fctrl_msg.h create mode 100644 src/mds/mds_tipc_fctrl_portid.cc create mode 100644 src/mds/mds_tipc_fctrl_portid.h diff --git a/src/mds/Makefile.am b/src/mds/Makefile.am index 2d7b652..d849e8f 100644 --- a/src/mds/Makefile.am +++ b/src/mds/Makefile.am @@ -48,10 +48,16 @@ lib_libopensaf_core_la_SOURCES += \ if ENABLE_TIPC_TRANSPORT noinst_HEADERS += src/mds/mds_dt_tipc.h \ src/mds/mds_tipc_recvq_stats.h \ - src/mds/mds_tipc_recvq_stats_impl.h + src/mds/mds_tipc_recvq_stats_impl.h \ + src/mds/mds_tipc_fctrl_intf.h \ + src/mds/mds_tipc_fctrl_portid.h \ + src/mds/mds_tipc_fctrl_msg.h lib_libopensaf_core_la_SOURCES += src/mds/mds_dt_tipc.c \ src/mds/mds_tipc_recvq_stats.cc \ - src/mds/mds_tipc_recvq_stats_impl.cc + src/mds/mds_tipc_recvq_stats_impl.cc \ + src/mds/mds_tipc_fctrl_intf.cc \ + src/mds/mds_tipc_fctrl_portid.cc \ + src/mds/mds_tipc_fctrl_msg.cc endif if ENABLE_TESTS diff --git a/src/mds/mds_dt.h b/src/mds/mds_dt.h index b645bb4..d9e8633 100644 --- a/src/mds/mds_dt.h +++ b/src/mds/mds_dt.h @@ -162,7 +162,7 @@ uint32_t mdtm_del_from_ref_tbl(MDS_SUBTN_REF_VAL ref); uint32_t mds_tmr_mailbox_processing(void); uint32_t mdtm_get_from_ref_tbl(MDS_SUBTN_REF_VAL ref, MDS_SVC_HDL *svc_hdl); uint32_t mdtm_add_frag_hdr(uint8_t *buf_ptr, uint16_t len, uint32_t seq_num, - uint16_t frag_byte); + uint16_t frag_byte, uint16_t fctrl_seq_num); uint32_t mdtm_free_reassem_msg_mem(MDS_ENCODED_MSG *msg); uint32_t mdtm_process_recv_data(uint8_t *buf, uint16_t len, uint64_t tipc_id, uint32_t *buff_dump); @@ -240,9 +240,13 @@ bool mdtm_mailbox_mbx_cleanup(NCSCONTEXT arg, NCSCONTEXT msg); #define MDS_PROT 0xA0 #define MDS_VERSION 0x08 -#define MDS_PROT_VER_MASK (MDS_PROT | MDS_VERSION) +#define MDS_PROT_VER_MASK 0xFC #define MDTM_PRI_MASK 0x3 +/* MDS protocol/version for flow control */ +#define MDS_PROT_FCTRL (0xB0 | MDS_VERSION) +#define MDS_PROT_FCTRL_ID 0x00AC13F5 + /* Added for the subscription changes */ #define MDS_NCS_CHASSIS_ID (m_NCS_GET_NODE_ID & 0x00ff) #define MDS_TIPC_COMMON_ID 0x01001000 diff --git a/src/mds/mds_dt_tipc.c b/src/mds/mds_dt_tipc.c index 86b52bb..fef1c50 100644 --- a/src/mds/mds_dt_tipc.c +++ b/src/mds/mds_dt_tipc.c @@ -47,6 +47,7 @@ #include "mds_dt_tipc.h" #include "mds_dt_tcp_disc.h" #include "mds_core.h" +#include "mds_tipc_fctrl_intf.h" #include "mds_tipc_recvq_stats.h" #include "base/osaf_utility.h" #include "base/osaf_poll.h" @@ -165,20 +166,22 @@ NCS_PATRICIA_TREE mdtm_reassembly_list; uint32_t mdtm_global_frag_num; const unsigned int MAX_RECV_THRESHOLD = 30; +uint8_t gl_mds_pro_ver = MDS_PROT | MDS_VERSION; -static bool get_tipc_port_id(int sock, uint32_t* port_id) { +static bool get_tipc_port_id(int sock, struct tipc_portid* port_id) { struct sockaddr_tipc addr; socklen_t sz = sizeof(addr); memset(, 0, sizeof(addr)); - *port_id = 0; + port_id->node = 0; + port_id->ref = 0; if (0 > getsockname(sock, (struct sockaddr *), )) { syslog(LOG_ERR, "MDTM:TIPC Failed to get socket name, err: %s", strerror(errno)); return false; } - *port_id = addr.addr.id.ref; + *port_id = addr.addr.id; return true; } @@ -240,12 +243,13 @@ uint32_t mdtm_tipc_init(NODE_ID
Re: [devel] [PATCH 3/9] mds: Add implementation for TIPC buffer overflow solution [#1960]
Hi Minh & Thuan Some minor comments marked with [GL]. On 14/8/19 4:38 pm, Minh Chau wrote: This is a collaborative patch of two participants:Thuan, Minh. Main changes: - Add mds_tipc_fctrl_intf.h, mds_tipc_fctrl_intf.cc: These two files introduce new functions which are called in mds_dt_tipc.c if the flow control is enabled - Add mds_tipc_fctrl_portid.h, mds_tipc_fctrl_portid.cc: These files implements the tipc portid instance, which supports the sliding window, mds msg queue - Add mds_tipc_fctrl_msg.h, mds_tipc_fctrl_msg.cc: These files define the event and messages which are used for this solution. --- src/mds/Makefile.am | 10 +- src/mds/mds_dt.h | 8 +- src/mds/mds_dt_tipc.c| 188 +--- src/mds/mds_tipc_fctrl_intf.cc | 376 +++ src/mds/mds_tipc_fctrl_intf.h| 47 + src/mds/mds_tipc_fctrl_msg.cc| 142 +++ src/mds/mds_tipc_fctrl_msg.h | 129 ++ src/mds/mds_tipc_fctrl_portid.cc | 261 +++ src/mds/mds_tipc_fctrl_portid.h | 87 + 9 files changed, 1184 insertions(+), 64 deletions(-) create mode 100644 src/mds/mds_tipc_fctrl_intf.cc create mode 100644 src/mds/mds_tipc_fctrl_intf.h create mode 100644 src/mds/mds_tipc_fctrl_msg.cc create mode 100644 src/mds/mds_tipc_fctrl_msg.h create mode 100644 src/mds/mds_tipc_fctrl_portid.cc create mode 100644 src/mds/mds_tipc_fctrl_portid.h diff --git a/src/mds/Makefile.am b/src/mds/Makefile.am index 2d7b652..d849e8f 100644 --- a/src/mds/Makefile.am +++ b/src/mds/Makefile.am @@ -48,10 +48,16 @@ lib_libopensaf_core_la_SOURCES += \ if ENABLE_TIPC_TRANSPORT noinst_HEADERS += src/mds/mds_dt_tipc.h \ src/mds/mds_tipc_recvq_stats.h \ - src/mds/mds_tipc_recvq_stats_impl.h + src/mds/mds_tipc_recvq_stats_impl.h \ + src/mds/mds_tipc_fctrl_intf.h \ + src/mds/mds_tipc_fctrl_portid.h \ + src/mds/mds_tipc_fctrl_msg.h lib_libopensaf_core_la_SOURCES += src/mds/mds_dt_tipc.c \ src/mds/mds_tipc_recvq_stats.cc \ - src/mds/mds_tipc_recvq_stats_impl.cc + src/mds/mds_tipc_recvq_stats_impl.cc \ + src/mds/mds_tipc_fctrl_intf.cc \ + src/mds/mds_tipc_fctrl_portid.cc \ + src/mds/mds_tipc_fctrl_msg.cc endif if ENABLE_TESTS diff --git a/src/mds/mds_dt.h b/src/mds/mds_dt.h index b645bb4..d9e8633 100644 --- a/src/mds/mds_dt.h +++ b/src/mds/mds_dt.h @@ -162,7 +162,7 @@ uint32_t mdtm_del_from_ref_tbl(MDS_SUBTN_REF_VAL ref); uint32_t mds_tmr_mailbox_processing(void); uint32_t mdtm_get_from_ref_tbl(MDS_SUBTN_REF_VAL ref, MDS_SVC_HDL *svc_hdl); uint32_t mdtm_add_frag_hdr(uint8_t *buf_ptr, uint16_t len, uint32_t seq_num, - uint16_t frag_byte); + uint16_t frag_byte, uint16_t fctrl_seq_num); uint32_t mdtm_free_reassem_msg_mem(MDS_ENCODED_MSG *msg); uint32_t mdtm_process_recv_data(uint8_t *buf, uint16_t len, uint64_t tipc_id, uint32_t *buff_dump); @@ -240,9 +240,13 @@ bool mdtm_mailbox_mbx_cleanup(NCSCONTEXT arg, NCSCONTEXT msg); #define MDS_PROT 0xA0 #define MDS_VERSION 0x08 -#define MDS_PROT_VER_MASK (MDS_PROT | MDS_VERSION) +#define MDS_PROT_VER_MASK 0xFC #define MDTM_PRI_MASK 0x3 +/* MDS protocol/version for flow control */ +#define MDS_PROT_FCTRL (0xB0 | MDS_VERSION) +#define MDS_PROT_FCTRL_ID 0x00AC13F5 + /* Added for the subscription changes */ #define MDS_NCS_CHASSIS_ID (m_NCS_GET_NODE_ID & 0x00ff) #define MDS_TIPC_COMMON_ID 0x01001000 diff --git a/src/mds/mds_dt_tipc.c b/src/mds/mds_dt_tipc.c index 86b52bb..fef1c50 100644 --- a/src/mds/mds_dt_tipc.c +++ b/src/mds/mds_dt_tipc.c @@ -47,6 +47,7 @@ #include "mds_dt_tipc.h" #include "mds_dt_tcp_disc.h" #include "mds_core.h" +#include "mds_tipc_fctrl_intf.h" #include "mds_tipc_recvq_stats.h" #include "base/osaf_utility.h" #include "base/osaf_poll.h" @@ -165,20 +166,22 @@ NCS_PATRICIA_TREE mdtm_reassembly_list; uint32_t mdtm_global_frag_num; const unsigned int MAX_RECV_THRESHOLD = 30; +uint8_t gl_mds_pro_ver = MDS_PROT | MDS_VERSION; -static bool get_tipc_port_id(int sock, uint32_t* port_id) { +static bool get_tipc_port_id(int sock, struct tipc_portid* port_id) { struct sockaddr_tipc addr; socklen_t sz = sizeof(addr); memset(, 0, sizeof(addr)); - *port_id = 0; + port_id->node = 0; + port_id->ref = 0; if (0 > getsockname(sock, (struct sockaddr *), )) { syslog(LOG_ERR, "MDTM:TIPC Failed to get socket name, err: %s", strerror(errno)); return false; } - *port_id = addr.addr.id.ref; + *port_id = addr.addr.id; return true; } @@ -240,12 +243,13 @@ uint32_t mdtm_tipc_init(NODE_ID nodeid, uint32_t *mds_tipc_ref) } /* Code for getting the self tipc random number */ - if
[devel] [PATCH 0/1] Review Request for amfd: fix coredump during downgrade if delayed failover is enabled [#3078]
Summary: amfd: fix coredump during downgrade if delayed failover is enabled [#3078] Review request for Ticket(s): 3078 Peer Reviewer(s): Minh Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3078 Base revision: 88ba98b8e45621508b528010e524b89068a05d8e Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision f3aac6813bc4fa002f3dbc726f325ed26a70fda4 Author: Gary Lee Date: Mon, 9 Sep 2019 11:20:34 +1000 amfd: fix coredump during downgrade if delayed failover is enabled [#3078] If delayed failover is enabled, and a downgrade to a version without #3060 occurs, then the standby running a newer version with #3060 may complain about an out of sync error during warm sync. Complete diffstat: -- src/amf/amfd/ckpt_dec.cc | 19 +++ 1 file changed, 15 insertions(+), 4 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] amfd: fix coredump during downgrade if delayed failover is enabled [#3078]
If delayed failover is enabled, and a downgrade to a version without #3060 occurs, then the standby running a newer version with #3060 may complain about an out of sync error during warm sync. --- src/amf/amfd/ckpt_dec.cc | 19 +++ 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/src/amf/amfd/ckpt_dec.cc b/src/amf/amfd/ckpt_dec.cc index 6288b4f..3c253d2 100644 --- a/src/amf/amfd/ckpt_dec.cc +++ b/src/amf/amfd/ckpt_dec.cc @@ -2721,10 +2721,21 @@ uint32_t avd_dec_warm_sync_rsp(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec) { if (updt_cnt->ng_updt != cb->async_updt_cnt.ng_updt) LOG_ER("ng_updt counters mismatch: Active: %u Standby: %u", updt_cnt->ng_updt, cb->async_updt_cnt.ng_updt); -if (updt_cnt->failover_updt != cb->async_updt_cnt.failover_updt) - LOG_ER("failover_updt counters mismatch: Active: %u Standby: %u", - updt_cnt->failover_updt, cb->async_updt_cnt.failover_updt); - +if (updt_cnt->failover_updt != cb->async_updt_cnt.failover_updt) { + if (dec->i_peer_version >= AVD_MBCSV_SUB_PART_VERSION_10) { +LOG_ER("failover_updt counters mismatch: Active: %u Standby: %u", + updt_cnt->failover_updt, cb->async_updt_cnt.failover_updt); + } else { +// Versions before 10 did not support failover_updt +// After a downupgrade scenario, where the active is < v10 +// and this node is >= v10, then there will be failover_updt mismatch +// If so, just set the value to what's on the older active +cb->async_updt_cnt.failover_updt = updt_cnt->failover_updt; +// failover_updt must be the LAST comparison made, otherwise +// these if statements need will some refactoring +return status; + } +} LOG_ER("Out of sync detected in warm sync response, exiting"); osafassert(0); -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] amf: handle errors identified by codechecker [#3077]
add assertions where pointers should not be null fix a couple of typos --- src/amf/amfd/comp.cc | 1 + src/amf/amfd/csi.cc| 3 ++- src/amf/amfd/cstype.cc | 2 ++ src/amf/amfd/hlt.cc| 1 + src/amf/amfd/nodeswbundle.cc | 2 +- src/amf/amfd/ntf.cc| 1 + src/amf/amfd/sg_npm_fsm.cc | 34 +++--- src/amf/amfd/sg_nway_fsm.cc| 2 +- src/amf/amfd/sgproc.cc | 1 + src/amf/amfd/su.cc | 1 + src/amf/amfd/sutype.cc | 3 ++- src/amf/amfd/svctype.cc| 1 + src/amf/amfd/svctypecstypes.cc | 1 + src/amf/amfnd/cbq.cc | 2 ++ src/amf/amfnd/clc.cc | 1 + src/amf/amfnd/comp.cc | 4 src/amf/amfnd/compdb.cc| 2 +- src/amf/amfnd/susm.cc | 11 +++ 18 files changed, 53 insertions(+), 20 deletions(-) diff --git a/src/amf/amfd/comp.cc b/src/amf/amfd/comp.cc index 0ff365e..5c6a283 100644 --- a/src/amf/amfd/comp.cc +++ b/src/amf/amfd/comp.cc @@ -2117,6 +2117,7 @@ static void comp_ccb_apply_modify_hdlr(struct CcbUtilOperationData *opdata) { attribute->attrValuesNumber); if (!strcmp(attribute->attrName, "saAmfCompType")) { + osafassert(value != nullptr); SaNameT *dn = (SaNameT *)value; const std::string oldType(comp->saAmfCompType); if (oldType.compare(Amf::to_string(dn)) == 0) { diff --git a/src/amf/amfd/csi.cc b/src/amf/amfd/csi.cc index f7e3730..1856610 100644 --- a/src/amf/amfd/csi.cc +++ b/src/amf/amfd/csi.cc @@ -913,7 +913,8 @@ static void ccb_apply_delete_hdlr(CcbUtilOperationData_t *opdata) { goto done; } - TRACE("'%s'", csi ? csi->name.c_str() : nullptr); + osafassert(csi != nullptr); + TRACE("'%s'", csi->name.c_str()); /* Check whether si has been assigned to any SU. */ if ((nullptr != csi->si->list_of_sisu) && (csi->compcsi_cnt != 0)) { diff --git a/src/amf/amfd/cstype.cc b/src/amf/amfd/cstype.cc index cadc6df..683d3cd 100644 --- a/src/amf/amfd/cstype.cc +++ b/src/amf/amfd/cstype.cc @@ -62,6 +62,7 @@ static AVD_CS_TYPE *cstype_create(const std::string , * @param cst */ static void cstype_delete(AVD_CS_TYPE *cst) { + osafassert(cst != nullptr); cstype_db->erase(cst->name); cst->saAmfCSAttrName.clear(); delete cst; @@ -205,6 +206,7 @@ static SaAisErrorT cstype_ccb_completed_hdlr(CcbUtilOperationData_t *opdata) { opdata->userData = nullptr; break; } + osafassert(cst != nullptr); if (cst->list_of_csi != nullptr) { /* check whether there exists a delete operation for * each of the CSI in the cs_type list in the current CCB diff --git a/src/amf/amfd/hlt.cc b/src/amf/amfd/hlt.cc index 27863db..4c2737e 100644 --- a/src/amf/amfd/hlt.cc +++ b/src/amf/amfd/hlt.cc @@ -75,6 +75,7 @@ static SaAisErrorT ccb_completed_delete_hdlr(CcbUtilOperationData_t *opdata) { opdata->userData = nullptr; goto done; } + osafassert(comp != nullptr); for (curr_susi = comp->su->list_of_susi; curr_susi != nullptr; curr_susi = curr_susi->su_next) for (compcsi = curr_susi->list_of_csicomp; compcsi; diff --git a/src/amf/amfd/nodeswbundle.cc b/src/amf/amfd/nodeswbundle.cc index 4ab79f7..cf280cb 100644 --- a/src/amf/amfd/nodeswbundle.cc +++ b/src/amf/amfd/nodeswbundle.cc @@ -125,7 +125,7 @@ static int is_swbdl_delete_ok(const std::string _dn, if (node == nullptr && avd_cb->is_active() == false) { return 1; } - + osafassert(node != nullptr); if (!is_swbdl_delete_ok_for_node(bundle_dn, node_dn, node->list_of_ncs_su, opdata)) return 0; diff --git a/src/amf/amfd/ntf.cc b/src/amf/amfd/ntf.cc index eb2654a..52ee745 100644 --- a/src/amf/amfd/ntf.cc +++ b/src/amf/amfd/ntf.cc @@ -505,6 +505,7 @@ SaAisErrorT avd_try_send_notification(NtfSend* job) { >notification.alarmNotification.notificationHandle; } + osafassert(notificationHandle != nullptr); // Try to send the notification if not sent. if (job->already_sent == false) { rc = saNtfNotificationSend(*notificationHandle); diff --git a/src/amf/amfd/sg_npm_fsm.cc b/src/amf/amfd/sg_npm_fsm.cc index 0ef094d..0e91eb5 100644 --- a/src/amf/amfd/sg_npm_fsm.cc +++ b/src/amf/amfd/sg_npm_fsm.cc @@ -2773,23 +2773,26 @@ static uint32_t avd_sg_npm_susi_sucss_si_oper(AVD_CL_CB *cb, AVD_SU *su, * modify standby all to the Quiesced SU. Remove the SI from * admin pointer and add the quiesced SU to the SU oper list. */ - if (su->sg_of_su->admin_si->list_of_sisu == i_susi) { -o_susi = i_susi->si_next; - } else { -o_susi = su->sg_of_su->admin_si->list_of_sisu; - } + i_susi = avd_su_susi_find(cb, su, su->sg_of_su->admin_si->name); + if (i_susi != nullptr) { +if (su->sg_of_su->admin_si->list_of_sisu == i_susi) { + o_susi = i_susi->si_next; +} else { +
[devel] [PATCH 0/1] Review Request for amf: handle errors identified by codechecker [#3077]
Summary: amf: handle errors identified by codechecker [#3077] Review request for Ticket(s): 3077 Peer Reviewer(s): Minh, Hans Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3077 Base revision: 2bc054ca85b56bc03bdc9be965593b56124aad00 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesy Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 24b75d78a013c554d5f9731e69a7150c11217ad7 Author: Gary Lee Date: Tue, 3 Sep 2019 12:06:36 +1000 amf: handle errors identified by codechecker [#3077] add assertions where pointers should not be null fix a couple of typos Complete diffstat: -- src/amf/amfd/comp.cc | 1 + src/amf/amfd/csi.cc| 3 ++- src/amf/amfd/cstype.cc | 2 ++ src/amf/amfd/hlt.cc| 1 + src/amf/amfd/nodeswbundle.cc | 2 +- src/amf/amfd/ntf.cc| 1 + src/amf/amfd/sg_npm_fsm.cc | 34 +++--- src/amf/amfd/sg_nway_fsm.cc| 2 +- src/amf/amfd/sgproc.cc | 1 + src/amf/amfd/su.cc | 1 + src/amf/amfd/sutype.cc | 3 ++- src/amf/amfd/svctype.cc| 1 + src/amf/amfd/svctypecstypes.cc | 1 + src/amf/amfnd/cbq.cc | 2 ++ src/amf/amfnd/clc.cc | 1 + src/amf/amfnd/comp.cc | 4 src/amf/amfnd/compdb.cc| 2 +- src/amf/amfnd/susm.cc | 11 +++ 18 files changed, 53 insertions(+), 20 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Ope
Re: [devel] [PATCH 1/1] util: Fenced should only write a log record when two acitve controllers is seen [#3073]
Hi Hans ack (review only) Thanks Gary On 22/8/19 5:49 pm, Hans Nordebäck wrote: --- tools/devel/fenced/node_state_hdlr_pl.cc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/devel/fenced/node_state_hdlr_pl.cc b/tools/devel/fenced/node_state_hdlr_pl.cc index c74fe72b9..6bf032e5a 100644 --- a/tools/devel/fenced/node_state_hdlr_pl.cc +++ b/tools/devel/fenced/node_state_hdlr_pl.cc @@ -169,8 +169,8 @@ void NodeStateHdlrPl::check_isolation() { isolated_ = NodeIsolationState::kNotIsolated; syslog(LOG_NOTICE, "one active controller detected"); } else { - isolated_ = NodeIsolationState::kIsolated; - syslog(LOG_NOTICE, "%d active controllers detected, split brain", no_of_active); + isolated_ = NodeIsolationState::kNotIsolated; + syslog(LOG_NOTICE, "%d active controllers detected", no_of_active); } } notify: ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] amfd: set failover_state on standby [#3072]
Otherwise, after two controller failovers, unexpected reboot of previously rebooted payloads may occur. --- src/amf/amfd/node_state_machine.cc | 6 ++ 1 file changed, 6 insertions(+) diff --git a/src/amf/amfd/node_state_machine.cc b/src/amf/amfd/node_state_machine.cc index efe2085..d38f79e 100644 --- a/src/amf/amfd/node_state_machine.cc +++ b/src/amf/amfd/node_state_machine.cc @@ -63,6 +63,12 @@ void NodeStateMachine::SetState(uint32_t state) { LOG_NO("New state '%u'", state); } + // this is needed for cold sync, in case this node (currently standby) + // becomes active later + AVD_AVND *node = avd_node_find_nodeid(node_id_); + osafassert(node != nullptr); + node->failover_state = state; + switch (state) { case NodeState::kStart: state_ = std::make_shared(this); -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for amfd: set failover_state on standby [#3072]
Summary: amfd: set failover_state on standby [#3072] Review request for Ticket(s): 3072 Peer Reviewer(s): Minh, Hans Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3072 Base revision: 729f71fbfff0eea6d4a6a394780142b87a9fb472 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 252c36529095306e57a859177f9a74f47809b50d Author: Gary Lee Date: Thu, 22 Aug 2019 14:08:39 +1000 amfd: set failover_state on standby [#3072] Otherwise, after two controller failovers, unexpected reboot of previously rebooted payloads may occur. Complete diffstat: -- src/amf/amfd/node_state_machine.cc | 6 ++ 1 file changed, 6 insertions(+) Testing Commands: - 1) set failover delay to 5s, node wait timeout to 15s 2) reboot PL-3 3) reboot active SC 4) reboot active SC again Testing, Expected Results: -- Ensure PL-3 does not get rebooted 15s after step 4 above. Conditions of Submission: - ack from anyone Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] mbc: fix some coding errors [#3070]
Hi Thuan ack (review only) Thanks Gary On 14/8/19 8:24 pm, thuan.tran wrote: --- src/mbc/mbcsv_api.c | 6 +++--- src/mbc/mbcsv_peer.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/src/mbc/mbcsv_api.c b/src/mbc/mbcsv_api.c index 84a2b8771..3a84fdfda 100644 --- a/src/mbc/mbcsv_api.c +++ b/src/mbc/mbcsv_api.c @@ -619,7 +619,7 @@ uint32_t mbcsv_process_close_request(NCS_MBCSV_ARG *arg) if (NULL == (mbc_reg = (MBCSV_REG *)m_MBCSV_TAKE_HANDLE(arg->i_mbcsv_hdl))) { TRACE_2("bad handle specified"); - rc = SA_AIS_ERR_BAD_HANDLE; + return SA_AIS_ERR_BAD_HANDLE; } m_NCS_LOCK(_reg->svc_lock, NCS_LOCK_WRITE); @@ -685,7 +685,7 @@ uint32_t mbcsv_process_chg_role_request(NCS_MBCSV_ARG *arg) if (NULL == (mbc_reg = (MBCSV_REG *)m_MBCSV_TAKE_HANDLE(arg->i_mbcsv_hdl))) { TRACE_2("bad handle specified"); - rc = SA_AIS_ERR_BAD_HANDLE; + return SA_AIS_ERR_BAD_HANDLE; } m_NCS_LOCK(_reg->svc_lock, NCS_LOCK_READ); @@ -804,7 +804,7 @@ uint32_t mbcsv_process_snd_ckpt_request(NCS_MBCSV_ARG *arg) if (NULL == (mbc_reg = (MBCSV_REG *)m_MBCSV_TAKE_HANDLE(arg->i_mbcsv_hdl))) { TRACE_2("bad handle specified"); - rc = SA_AIS_ERR_BAD_HANDLE; + return SA_AIS_ERR_BAD_HANDLE; } m_NCS_LOCK(_reg->svc_lock, NCS_LOCK_READ); diff --git a/src/mbc/mbcsv_peer.c b/src/mbc/mbcsv_peer.c index 1d4b257a3..1a9eeb125 100644 --- a/src/mbc/mbcsv_peer.c +++ b/src/mbc/mbcsv_peer.c @@ -54,7 +54,7 @@ the messages received from the peer. static const char *disc_trace[] = {"Peer UP msg", "Peer DOWN msg", "Peer INFO msg", "Peer INFO resp msg", - "Peer Role change msg" + "Peer Role change msg", "Invalid peer discovery msg"}; typedef enum {ANCHOR_SEARCH, NODE_ID_SEARCH} SearchMode; ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] rde: missing comma between elements in array [#3069]
Hi Thuan ack, will push on your behalf. Thanks On 14/8/19 7:42 pm, thuan.tran wrote: --- src/rde/rded/rde_main.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/rde/rded/rde_main.cc b/src/rde/rded/rde_main.cc index 1a7e58792..6594b3d49 100644 --- a/src/rde/rded/rde_main.cc +++ b/src/rde/rded/rde_main.cc @@ -53,7 +53,7 @@ const char *rde_msg_name[] = {"-", "RDE_MSG_PEER_DOWN(2)", "RDE_MSG_PEER_INFO_REQ(3)", "RDE_MSG_PEER_INFO_RESP(4)", - "RDE_MSG_NEW_ACTIVE_CALLBACK(5)" + "RDE_MSG_NEW_ACTIVE_CALLBACK(5)", "RDE_MSG_NODE_UP(6)", "RDE_MSG_NODE_DOWN(7)", "RDE_MSG_TAKEOVER_REQUEST_CALLBACK(8)", ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] nid: use the tipc command instead of tipc-config [#2104]
Hi Vu ack (review only) Thanks On 1/8/19 12:53 pm, Vu Minh Nguyen wrote: The tipc-config command is obsolete and no longer being maintained. We should switch to using the "tipc" command instead --- Makefile.am | 3 ++- opensaf.spec.in | 1 + .../archive/scripts => scripts}/tipc-config | 15 -- src/nid/configure_tipc.in | 16 ++- src/nid/opensafd.in | 20 +++ tools/cluster_sim_uml/build_uml | 2 +- 6 files changed, 35 insertions(+), 22 deletions(-) rename {tools/cluster_sim_uml/archive/scripts => scripts}/tipc-config (83%) diff --git a/Makefile.am b/Makefile.am index b3d6553c1..6d86ec180 100644 --- a/Makefile.am +++ b/Makefile.am @@ -159,7 +159,8 @@ dist_osaf_execbin_SCRIPTS += \ $(top_srcdir)/scripts/opensaf_reboot \ $(top_srcdir)/scripts/opensaf_sc_active \ $(top_srcdir)/scripts/opensaf_scale_out \ - $(top_srcdir)/scripts/plm_scale_out + $(top_srcdir)/scripts/plm_scale_out \ + $(top_srcdir)/scripts/tipc-config include $(top_srcdir)/src/ais/Makefile.am include $(top_srcdir)/src/base/Makefile.am diff --git a/opensaf.spec.in b/opensaf.spec.in index 0effd59cd..37be5de6d 100644 --- a/opensaf.spec.in +++ b/opensaf.spec.in @@ -950,6 +950,7 @@ fi %{_pkglibdir}/plm_scale_out %{_pkglibdir}/opensaf_sc_active %{_pkglibdir}/configure_tipc +%{_pkglibdir}/tipc-config %files amf-libs diff --git a/tools/cluster_sim_uml/archive/scripts/tipc-config b/scripts/tipc-config similarity index 83% rename from tools/cluster_sim_uml/archive/scripts/tipc-config rename to scripts/tipc-config index f9fd47937..34eb9a539 100755 --- a/tools/cluster_sim_uml/archive/scripts/tipc-config +++ b/scripts/tipc-config @@ -1,4 +1,4 @@ -#!/bin/ash +#!/bin/bash # # -*- OpenSAF -*- # @@ -39,7 +39,18 @@ fi while [ $# -gt 0 ]; do case "$1" in -addr) - echo "node address: $(/sbin/tipc node get address)" + addr=$(/sbin/tipc node get address) + hex_pattern="^[0-9a-fA-F]+$" + if [[ $addr =~ $hex_pattern ]]; then + dec_addr=$((16#$addr)) + # the algorithm is based on /usr/include/linux/tipc.h + # to form tipc node address into 'Z.C.N' format. + tipc_zone=$((dec_addr >> 24)) + tipc_cluster=$(((dec_addr >> 12) & 0xfff)) + tipc_node=$((dec_addr & 0xfff)) + addr="<$tipc_zone.$tipc_cluster.$tipc_node>" + fi + echo "node address: $addr" ;; -a=*) /sbin/tipc node set address "$(echo "$1" | cut -d= -f2)" diff --git a/src/nid/configure_tipc.in b/src/nid/configure_tipc.in index 33621a0ef..5d0bf6efb 100644 --- a/src/nid/configure_tipc.in +++ b/src/nid/configure_tipc.in @@ -78,12 +78,13 @@ if ! [ -x "${tipc}" ] && ! [ -x "${tipc_config}" ]; then exit 1 fi +# Prefer using `tipc` over the obsoleted `tipc-config` +if [ -x "${tipc}" ]; then +tipc_config="${pkglibdir}"/tipc-config +fi + if [ "$MANAGE_TIPC" != "yes" ] && ! [ -s "$pkglocalstatedir/node_id" ]; then -if [ -x "${tipc}" ]; then - addr=$(tipc node get address | cut -d'<' -f2 | cut -d'>' -f1) -else - addr=$(tipc-config -addr | cut -d'<' -f2 | cut -d'>' -f1) -fi + addr=$(${tipc-config} -addr | cut -d'<' -f2 | cut -d'>' -f1) addr=$(echo "$addr" | cut -d. -f3) CHASSIS_ID=2 SLOT_ID=$((addr & 255)) @@ -98,11 +99,6 @@ fi ETH_NAME=$2 TIPC_NETID=$3 -if ! [ -x "${tipc_config}" ]; then -echo "error: tipc-config is not available" -exit 1 -fi - # Get the Chassis Id and Slot Id from @sysconfdir@/@PACKAGE_NAME@/chassis_id and @sysconfdir@/@PACKAGE_NAME@/slot_id if ! test -f "$CHASSIS_ID_FILE"; then echo "$CHASSIS_ID_FILE doesnt exists, exiting " diff --git a/src/nid/opensafd.in b/src/nid/opensafd.in index 94888039a..f85cf5b0c 100644 --- a/src/nid/opensafd.in +++ b/src/nid/opensafd.in @@ -50,7 +50,7 @@ osafcshash=@INTERNAL_VERSION_ID@ unload_tipc() { # Unload TIPC if already loaded - if [ $MANAGE_TIPC = "yes" ] && grep tipc /proc/modules >/dev/null 2>&1; then + if [ "$MANAGE_TIPC" = "yes" ] && grep tipc /proc/modules >/dev/null 2>&1; then modprobe -r tipc >/dev/null 2>&1 if [ $? -eq 1 ]; then logger -t $osafprog "warning: TIPC module unloading failed" @@ -59,13 +59,17 @@ unload_tipc() { } check_tipc() { - # Exit if tipc-config is not installed - if [ "$MANAGE_TIPC" = "yes" ] && [ ! -x /sbin/tipc-config ]; then - which tipc-config >/dev/null 2>&1 - if [ $? -eq 1 ] ; then - logger -s -t $osafprog "Can't find tipc-config in the PATH, exiting." -
Re: [devel] [PATCH 0/1] Review Request for amfd: add support for dynamically changing saAmfRank of SaAmfSIRankedSU [#3058]
Hi Alex Ack, review only. Thanks Gary On 19/7/19 5:04 am, Jones, Alex wrote: Summary: amfd: add support for dynamically changing saAmfRank of SaAmfSIRankedSU [#3058] Review request for Ticket(s): 3058 Peer Reviewer(s): Nagu, Hans, Gary Pull request to: Affected branch(es): develop Development branch: ticket-3058 Base revision: ec296cbb38761831929a97a8d94d177130f656c9 Personal repository: git://git.code.sf.net/u/trguitar/review Impacted area Impact y/n Docs n Build system n RPM/packaging n Configuration files n Startup scripts n SAF services y OpenSAF services n Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 620fd473bfa6f28598a6171ac82b8a7e19056d1b Author:Alex Jones Date:Thu, 18 Jul 2019 14:43:29 -0400 amfd: add support for dynamically changing saAmfRank of SaAmfSIRankedSU [#3058] Allow saAmfRank of SaAmfSIRankedSU to be changed at runtime Complete diffstat: -- src/amf/amfd/si.cc | 103 + src/amf/amfd/si.h | 3 ++ src/amf/amfd/siass.cc | 38 + src/amf/amfd/sirankedsu.cc | 73 +++- src/amf/amfd/util.cc | 30 - 5 files changed, 243 insertions(+), 4 deletions(-) Testing Commands: - 1) create N-way service group with SUs and components 2) create SaAmfSIRankedSU objects for the SUs 3) Once the assignments have been made on the components, change the value of saAmfRank in one of the SaAmfSIRankedSU objects Testing, Expected Results: -- 1) changing the value should be accepted 2) failover choice by amfd should reflect the new rank Conditions of Submission: - Aug 5, or ack from developer Arch Built Started Linux distro --- mips n n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. Notice: This e-mail together with any attachments may contain information of Ribbon Communications Inc. that is confidential and/or proprietary for the sole use of the intended recipient. Any review, disclosure, reliance or distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please notify the sender immediately and then delete all copies, including any attachments.
[devel] [PATCH 0/1] Review Request for amfd: include failover info in coldsync [#3060]
Summary: amfd: include failover info in coldsync [#3060] Review request for Ticket(s): 3060 Peer Reviewer(s): Minh, Hans, Thang, Thuan Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3060 Base revision: ec296cbb38761831929a97a8d94d177130f656c9 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 9443abefdeaae481dbe483b708db8d467619b8c1 Author: Gary Lee Date: Fri, 19 Jul 2019 16:02:19 +1000 amfd: include failover info in coldsync [#3060] Failover information is not currently included in coldsync. This means if a delayed failover is in progress *before* a standby controller is available, *and* a controller failover occurs, then information about the delayed failover is lost. Complete diffstat: -- src/amf/amfd/chkop.cc | 4 ++ src/amf/amfd/ckpt.h| 4 +- src/amf/amfd/ckpt_dec.cc | 77 -- src/amf/amfd/ckpt_edu.cc | 2 + src/amf/amfd/ckpt_enc.cc | 5 ++- src/amf/amfd/node.h| 3 ++ src/amf/amfd/node_state_machine.cc | 2 + src/amf/amfd/util.cc | 1 + 8 files changed, 76 insertions(+), 22 deletions(-) Testing Commands: - 1. Enable delayed node failover and network fence a PL while there is no standby SC. Before the failover occurs, power up the standby SC, and force a controller failover. 2. Ensure different versions of amfd can cold sync with each other. Testing, Expected Results: -- 1. The standby SC (now active) should continue the node failover. 2. It works. Conditions of Submission: - ack from any reviewer Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge
[devel] [PATCH 1/1] amfd: include failover info in coldsync [#3060]
Failover information is not currently included in coldsync. This means if a delayed failover is in progress *before* a standby controller is available, *and* a controller failover occurs, then information about the delayed failover is lost. --- src/amf/amfd/chkop.cc | 4 ++ src/amf/amfd/ckpt.h| 4 +- src/amf/amfd/ckpt_dec.cc | 77 -- src/amf/amfd/ckpt_edu.cc | 2 + src/amf/amfd/ckpt_enc.cc | 5 ++- src/amf/amfd/node.h| 3 ++ src/amf/amfd/node_state_machine.cc | 2 + src/amf/amfd/util.cc | 1 + 8 files changed, 76 insertions(+), 22 deletions(-) diff --git a/src/amf/amfd/chkop.cc b/src/amf/amfd/chkop.cc index e9a68f4..56b0142 100644 --- a/src/amf/amfd/chkop.cc +++ b/src/amf/amfd/chkop.cc @@ -1051,6 +1051,10 @@ uint32_t avsv_send_ckpt_data(AVD_CL_CB *cb, uint32_t action, avd_cb->avd_peer_ver); return NCSCC_RC_SUCCESS; } + if (avd_cb->avd_peer_ver >= AVD_MBCSV_SUB_PART_VERSION_10) { +cb->async_updt_cnt.failover_updt++; + } + break; default: return NCSCC_RC_SUCCESS; diff --git a/src/amf/amfd/ckpt.h b/src/amf/amfd/ckpt.h index 875776a..2e15387 100644 --- a/src/amf/amfd/ckpt.h +++ b/src/amf/amfd/ckpt.h @@ -35,9 +35,10 @@ #define AMF_AMFD_CKPT_H_ // current version -#define AVD_MBCSV_SUB_PART_VERSION 9 +#define AVD_MBCSV_SUB_PART_VERSION 10 // supported versions +#define AVD_MBCSV_SUB_PART_VERSION_10 10 #define AVD_MBCSV_SUB_PART_VERSION_9 9 #define AVD_MBCSV_SUB_PART_VERSION_8 8 #define AVD_MBCSV_SUB_PART_VERSION_7 7 @@ -109,6 +110,7 @@ typedef struct avsv_async_updt_cnt { uint32_t compcstype_updt; uint32_t si_trans_updt; uint32_t ng_updt; + uint32_t failover_updt; } AVSV_ASYNC_UPDT_CNT; /* diff --git a/src/amf/amfd/ckpt_dec.cc b/src/amf/amfd/ckpt_dec.cc index a46f6d3..6288b4f 100644 --- a/src/amf/amfd/ckpt_dec.cc +++ b/src/amf/amfd/ckpt_dec.cc @@ -178,6 +178,31 @@ const AVSV_DECODE_COLD_SYNC_RSP_DATA_FUNC_PTR dec_cs_data_func_list[] = { dec_cs_comp_config, dec_cs_comp_cs_type_config, dec_cs_siass, dec_cs_si_trans,dec_cs_async_updt_cnt}; +void set_node_failover_state(AVD_CL_CB *cb, const SaClmNodeIdT node_id, +const uint32_t state) { + TRACE_ENTER(); + + if (state == NodeState::NodeStates::kUndefined) { +// not in failover list +return; + } + + auto failed_node = cb->failover_list.find(node_id); + if (failed_node != cb->failover_list.end()) { +failed_node->second->SetState(state); + } else { +LOG_NO("Node '%u' not found in failover_list. Create new entry", +node_id); +auto new_node = std::make_shared(cb, node_id); +// node must be added to failover_list before SetState() is called. +// If the state is 'end', then it will be deleted by SetState(). +// Otherwise, we will leave a node in 'End' state mistakenly in +// failover_list. +cb->failover_list[node_id] = new_node; +new_node->SetState(state); + } +} + void decode_cb(NCS_UBAID *ub, AVD_CL_CB *cb, const uint16_t peer_version) { osaf_decode_uint32(ub, reinterpret_cast(>init_state)); osaf_decode_satimet(ub, >cluster_init_time); @@ -254,6 +279,9 @@ void decode_node_config(NCS_UBAID *ub, AVD_AVND *avnd, osaf_decode_uint32(ub, >rcv_msg_id); osaf_decode_uint32(ub, >snd_msg_id); osaf_extended_name_free(_name); + if (peer_version >= AVD_MBCSV_SUB_PART_VERSION_10) { +osaf_decode_uint32(ub, >failover_state); + } TRACE_LEAVE(); } @@ -585,7 +613,7 @@ void decode_siass(NCS_UBAID *ub, AVSV_SU_SI_REL_CKPT_MSG *su_si_ckpt, su_si_ckpt->csi_add_rem = static_cast(csi_add_rem); osaf_decode_sanamet(ub, _si_ckpt->comp_name); osaf_decode_sanamet(ub, _si_ckpt->csi_name); - }; + } } /\ @@ -2199,6 +2227,7 @@ static uint32_t dec_cs_node_config(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec, for (count = 0; count < num_of_obj; count++) { decode_node_config(>i_uba, , dec->i_peer_version); status = avd_ckpt_node(cb, , dec->i_action); +set_node_failover_state(cb, avnd.node_info.nodeId, avnd.failover_state); osafassert(status == NCSCC_RC_SUCCESS); } @@ -2552,14 +2581,23 @@ static uint32_t dec_cs_async_updt_cnt(AVD_CL_CB *cb, NCS_MBCSV_CB_DEC *dec, /* * Decode and send async update counts for all the data structures. */ - if (dec->i_peer_version >= AVD_MBCSV_SUB_PART_VERSION_7) { + if (dec->i_peer_version >= AVD_MBCSV_SUB_PART_VERSION_10) { TRACE( -"Peer AMFD version is >= AVD_MBCSV_SUB_PART_VERSION_7," +"Peer AMFD version is >= AVD_MBCSV_SUB_PART_VERSION_10," "peer ver:%d", avd_cb->avd_peer_ver); status = m_NCS_EDU_VER_EXEC(>edu_hdl, avsv_edp_ckpt_msg_async_updt_cnt, >i_uba, EDP_OP_TYPE_DEC, _cnt, , dec->i_peer_version); +
[devel] [PATCH 2/4] fmd: add active promotion supervision timer [#3029]
Add supervision timer so controller will reboot if it cannot obtain consensus lock within the allocation period (2* FMS_TAKEOVER_REQUEST_VALID_TIME). The peer controller can then safely perform a node failover after this period of time. --- src/fm/fmd/fm_cb.h| 2 ++ src/fm/fmd/fm_main.cc | 14 - src/fm/fmd/fm_rda.cc | 87 ++- 3 files changed, 74 insertions(+), 29 deletions(-) diff --git a/src/fm/fmd/fm_cb.h b/src/fm/fmd/fm_cb.h index 6eb0d54..b5ea5ae 100644 --- a/src/fm/fmd/fm_cb.h +++ b/src/fm/fmd/fm_cb.h @@ -39,6 +39,7 @@ typedef enum { FM_TMR_TYPE_MIN, FM_TMR_PROMOTE_ACTIVE, FM_TMR_ACTIVATION_SUPERVISION, + FM_TMR_CONSENSUS_SERVICE_SUPERVISION, FM_TMR_TYPE_MAX } FM_TMR_TYPE; @@ -83,6 +84,7 @@ struct FM_CB { /* Timers */ FM_TMR promote_active_tmr{}; FM_TMR activation_supervision_tmr{}; + FM_TMR consensus_service_supervision_tmr{}; /* Time in terms of one hundredth of seconds (500 for 5 secs.) */ uint32_t active_promote_tmr_val{}; diff --git a/src/fm/fmd/fm_main.cc b/src/fm/fmd/fm_main.cc index 2eb3c16..4a843cc 100644 --- a/src/fm/fmd/fm_main.cc +++ b/src/fm/fmd/fm_main.cc @@ -59,7 +59,8 @@ static uint32_t fm_get_args(FM_CB *); static uint32_t fms_fms_exchange_node_info(FM_CB *); static uint32_t fms_fms_inform_terminating(FM_CB *fm_cb); static uint32_t fm_nid_notify(uint32_t); -static uint32_t fm_tmr_start(FM_TMR *, SaTimeT); +uint32_t fm_tmr_start(FM_TMR *, SaTimeT); +void fm_tmr_stop(FM_TMR *tmr); static SaAisErrorT get_peer_clm_node_name(NODE_ID); static SaAisErrorT fm_clm_init(); static void fm_mbx_msg_handler(FM_CB *, FM_EVT *); @@ -449,6 +450,8 @@ static uint32_t fm_get_args(FM_CB *fm_cb) { /* Set timer variables */ fm_cb->promote_active_tmr.type = FM_TMR_PROMOTE_ACTIVE; fm_cb->activation_supervision_tmr.type = FM_TMR_ACTIVATION_SUPERVISION; + fm_cb->consensus_service_supervision_tmr.type = +FM_TMR_CONSENSUS_SERVICE_SUPERVISION; char *node_isolation_timeout = getenv("FMS_NODE_ISOLATION_TIMEOUT"); if (node_isolation_timeout != NULL) { @@ -704,6 +707,11 @@ static void fm_mbx_msg_handler(FM_CB *fm_cb, FM_EVT *fm_mbx_evt) { "Activation timer supervision " "expired: no ACTIVE assignment received " "within the time limit"); + } else if (fm_mbx_evt->info.fm_tmr->type == + FM_TMR_CONSENSUS_SERVICE_SUPERVISION) { +opensaf_quick_reboot("Consensus service supervision " + "expired: controller was not promoted " + "within the time limit"); } break; @@ -728,6 +736,10 @@ static void fm_evt_proc_rda_callback(FM_CB *cb, FM_EVT *evt) { uint32_t rc = NCSCC_RC_SUCCESS; TRACE_ENTER2("%d", (int)evt->info.rda_info.role); + if (evt->info.rda_info.role == PCS_RDA_ACTIVE) { +LOG_NO("Controller promoted. Stop supervision timer"); +fm_tmr_stop(_cb->consensus_service_supervision_tmr); + } if (evt->info.rda_info.role != PCS_RDA_ACTIVE && cb->activation_supervision_tmr.status == FM_TMR_RUNNING) { fm_tmr_stop(>activation_supervision_tmr); diff --git a/src/fm/fmd/fm_rda.cc b/src/fm/fmd/fm_rda.cc index d3063ba..c072cb0 100644 --- a/src/fm/fmd/fm_rda.cc +++ b/src/fm/fmd/fm_rda.cc @@ -23,6 +23,8 @@ #include "osaf/consensus/consensus.h" #include "rde/agent/rda_papi.h" +extern uint32_t fm_tmr_start(FM_TMR *tmr, SaTimeT period); +extern void fm_tmr_stop(FM_TMR *tmr); extern void rda_cb(uint32_t cb_hdl, PCS_RDA_CB_INFO *cb_info, PCSRDA_RETURN_CODE error_code); / @@ -64,6 +66,47 @@ done: return rc; } +void promote_node(FM_CB *fm_cb) { + TRACE_ENTER(); + + Consensus consensus_service; + if (consensus_service.PrioritisePartitionSize() == true) { +// Allow topology events to be processed first. The MDS thread may +// be processing MDS down events and updating cluster_size concurrently. +// We need cluster_size to be as accurate as possible, without waiting +// too long for node down events. +std::this_thread::sleep_for(std::chrono::seconds(2)); + } + + uint32_t rc; + rc = consensus_service.PromoteThisNode(true, fm_cb->cluster_size); + if (rc != SA_AIS_OK && rc != SA_AIS_ERR_EXIST) { +LOG_ER("Unable to set active controller in consensus service"); +opensaf_quick_reboot("Unable to set active controller " + "in consensus service"); + } else if (rc == SA_AIS_ERR_EXIST) { +// @todo if we don't reboot, we don't seem to recover from this. Can we +// improve? +LOG_ER( +"A controller is already active. We were separated from the " +"cluster?"); +opensaf_quick_reboot("A controller is already active. We were separated " + "from the cluster?"); + } + + PCS_RDA_REQ rda_req; + + /* set the RDA role to active */ +
[devel] [PATCH 4/4] osaf: make wait time configurable [#3029]
If FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is enabled, make the time that we wait for MDS node events configurable. --- src/fm/fmd/fm_rda.cc| 4 +++- src/fm/fmd/fmd.conf | 5 + src/osaf/consensus/consensus.cc | 9 + src/osaf/consensus/consensus.h | 2 ++ src/rde/rded/role.cc| 4 +++- 5 files changed, 22 insertions(+), 2 deletions(-) diff --git a/src/fm/fmd/fm_rda.cc b/src/fm/fmd/fm_rda.cc index c072cb0..fca417f 100644 --- a/src/fm/fmd/fm_rda.cc +++ b/src/fm/fmd/fm_rda.cc @@ -75,7 +75,9 @@ void promote_node(FM_CB *fm_cb) { // be processing MDS down events and updating cluster_size concurrently. // We need cluster_size to be as accurate as possible, without waiting // too long for node down events. -std::this_thread::sleep_for(std::chrono::seconds(2)); +std::this_thread::sleep_for( + std::chrono::seconds( +consensus_service.PrioritisePartitionSizeWaitTime())); } uint32_t rc; diff --git a/src/fm/fmd/fmd.conf b/src/fm/fmd/fmd.conf index 209e484..4dbf53a 100644 --- a/src/fm/fmd/fmd.conf +++ b/src/fm/fmd/fmd.conf @@ -36,6 +36,11 @@ export FMS_TAKEOVER_REQUEST_VALID_TIME=20 # Default is 1 #export FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE=1 +# If FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is set to 1, wait until +# this number of seconds for MDS events before making a decision +# on partition size. Default is 4 seconds +#export FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE_MDS_WAIT_TIME=4 + # Default behaviour is not to allow promotion of this node to Active # unless a lock can be obtained, if split brain prevention is enabled. # Uncomment the next line to allow promotion of this node at cluster startup, diff --git a/src/osaf/consensus/consensus.cc b/src/osaf/consensus/consensus.cc index 814885e..0e37fa3 100644 --- a/src/osaf/consensus/consensus.cc +++ b/src/osaf/consensus/consensus.cc @@ -207,6 +207,10 @@ bool Consensus::PrioritisePartitionSize() const { return prioritise_partition_size_; } +uint32_t Consensus::PrioritisePartitionSizeWaitTime() const { + return prioritise_partition_size_mds_wait_time_; +} + uint32_t Consensus::TakeoverValidTime() const { return takeover_valid_time_; } @@ -253,6 +257,8 @@ void Consensus::ProcessEnvironmentSettings() { uint32_t use_remote_fencing = base::GetEnv("FMS_USE_REMOTE_FENCING", 0); uint32_t prioritise_partition_size = base::GetEnv("FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE", 1); + uint32_t prioritise_partition_size_mds_wait_time = +base::GetEnv("FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE_MDS_WAIT_TIME", 4); uint32_t relaxed_node_promotion = base::GetEnv("FMS_RELAXED_NODE_PROMOTION", 0); config_file_ = base::GetEnv("FMS_CONF_FILE", ""); @@ -281,6 +287,9 @@ void Consensus::ProcessEnvironmentSettings() { if (use_consensus_ == true && relaxed_node_promotion == 1) { relaxed_node_promotion_ = true; } + + prioritise_partition_size_mds_wait_time_ = +prioritise_partition_size_mds_wait_time; } bool Consensus::ReloadConfiguration() { diff --git a/src/osaf/consensus/consensus.h b/src/osaf/consensus/consensus.h index 1fabf90..1aba561 100644 --- a/src/osaf/consensus/consensus.h +++ b/src/osaf/consensus/consensus.h @@ -61,6 +61,7 @@ class Consensus { bool IsRelaxedNodePromotionEnabled() const; bool PrioritisePartitionSize() const; + uint32_t PrioritisePartitionSizeWaitTime() const; uint32_t TakeoverValidTime() const; @@ -100,6 +101,7 @@ class Consensus { bool use_consensus_{false}; bool use_remote_fencing_{false}; bool prioritise_partition_size_{true}; + uint32_t prioritise_partition_size_mds_wait_time_{4}; bool relaxed_node_promotion_{false}; uint32_t takeover_valid_time_{20}; uint32_t max_takeover_retry_{0}; diff --git a/src/rde/rded/role.cc b/src/rde/rded/role.cc index b8c8157..b890117 100644 --- a/src/rde/rded/role.cc +++ b/src/rde/rded/role.cc @@ -83,7 +83,9 @@ void Role::MonitorCallback(const std::string& key, const std::string& new_value, consensus_service.PrioritisePartitionSize() == true) { // don't send this to the main thread straight away, as it will // need some time to process topology changes. - std::this_thread::sleep_for(std::chrono::seconds(4)); + std::this_thread::sleep_for( +std::chrono::seconds( + consensus_service.PrioritisePartitionSizeWaitTime())); } } else { msg->type = RDE_MSG_NEW_ACTIVE_CALLBACK; -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/4] Review Request for amfd: improve controller failover behavior V2 [#3029]
Summary: amfd: improve controller failover behavior [#3029] Review request for Ticket(s): 3029 Peer Reviewer(s): Canh, Minh, Hans Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3029 Base revision: 71852f322b42437f074bfa4c618c021798357143 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesy Core libraries y Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 4feee2b631afa3393ae9e53fd6575c3768861dca Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 osaf: make wait time configurable [#3029] If FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is enabled, make the time that we wait for MDS node events configurable. revision 2c419ba5fffb85272f0d15118b561bcfc1de4814 Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 amfd: improve controller failover behavior [#3029] If consensus service is enabled, only perform node failover after peer controller has self-fenced (after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds). This also means if node failover delay is set to a large value, we do not unnecesarily wait too long before failing over assignments previously assigned to the peer controller. Remove unused fmd_conf_file variable. Change some LOG_ER calls to LOG_WA. revision 7c4fff483477082ca66a26f921a50b3bc1240538 Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 fmd: add active promotion supervision timer [#3029] Add supervision timer so controller will reboot if it cannot obtain consensus lock within the allocation period (2* FMS_TAKEOVER_REQUEST_VALID_TIME). The peer controller can then safely perform a node failover after this period of time. revision 8b596a228402ff99b26906138daf920c23e965e7 Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 osaf: add function to return takeover request expiry time [#3029] Complete diffstat: -- src/amf/amfd/cb.h | 1 - src/amf/amfd/clm.cc| 4 +- src/amf/amfd/main.cc | 1 - src/amf/amfd/ndfsm.cc | 8 ++-- src/amf/amfd/ndproc.cc | 19 src/amf/amfd/node_state.cc | 23 +- src/amf/amfd/node_state_machine.cc | 19 src/amf/amfd/node_state_machine.h | 2 + src/amf/amfd/proc.h| 1 + src/fm/fmd/fm_cb.h | 2 + src/fm/fmd/fm_main.cc | 14 +- src/fm/fmd/fm_rda.cc | 89 ++ src/fm/fmd/fmd.conf| 5 +++ src/osaf/consensus/consensus.cc| 13 ++ src/osaf/consensus/consensus.h | 4 ++ src/rde/rded/role.cc | 4 +- 16 files changed, 160 insertions(+), 49 deletions(-) Testing Commands: - 1) Ensure a 2N application is active on standby controller, and standy on the active controller 2) Isolate active & standby controller Testing, Expected Results: -- amfd should failover 2N application only after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds Conditions of Submission: - ack from any reviewer Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code c
Re: [devel] [PATCH 1/1] amfd: disallow delete of CtCs object if Ct maps to comp [#3028]
Hi Phuc ack, will push on your behalf. Thanks Gary On 25/6/19 7:13 pm, phuc.h.chau wrote: Amfd crashes when su is unlocked, The reason for the crash is in the function avd_snd_susi_msg(),get_comp_capability() is called with csi and comp as input parameter. In the function, get_comp_capability(), there is no CtCs object available so ctcstype_db->find returns NULL to ctcs_type. While accessing ctcs_type->saAmfCtCompCapability, AMfd crashes because ctcs_type is NULL. --- src/amf/amfd/ctcstype.cc | 46 +- 1 file changed, 45 insertions(+), 1 deletion(-) diff --git a/src/amf/amfd/ctcstype.cc b/src/amf/amfd/ctcstype.cc index 5dffdae..7e62358 100644 --- a/src/amf/amfd/ctcstype.cc +++ b/src/amf/amfd/ctcstype.cc @@ -187,13 +187,57 @@ static SaAisErrorT ctcstype_ccb_completed_cb(CcbUtilOperationData_t *opdata) { opdata, "Modification of SaAmfCtCsType not supported"); break; case CCBUTIL_DELETE: + AVD_CTCS_TYPE *ctcstype; + AVD_COMP_TYPE *comp_type; + AVD_COMP *comp; + CcbUtilOperationData_t *t_opData; + + ctcstype = ctcstype_db->find(Amf::to_string(>objectName)); + if (ctcstype != nullptr) { +std::string cst_name, ct_name; +avsv_sanamet_init(Amf::to_string(>objectName), + cst_name, "safCSType="); +avsv_sanamet_init(cst_name, ct_name, "safVersion"); +TRACE("'%s'", ct_name.c_str()); +comp_type = comptype_db->find(ct_name); +if ((comp_type) && (nullptr != comp_type->list_of_comp)) { + /* check whether there exists a delete operation for + * each of the Comp in the comp_type list in the current CCB + */ + bool comp_exist = false; + TRACE("SaAmfCompType '%s' has components", comp_type->name.c_str()); + comp = comp_type->list_of_comp; + while (comp != nullptr) { +TRACE("%s", osaf_extended_name_borrow(>comp_info.name)); +t_opData = ccbutil_getCcbOpDataByDN(opdata->ccbId, +>comp_info.name); +TRACE("%p", t_opData); +if ((t_opData == nullptr) || +(t_opData->operationType != CCBUTIL_DELETE)) { + TRACE("OperationType: %p", t_opData); + comp_exist = true; + break; +} +comp = comp->comp_type_list_comp_next; + } + if (comp_exist == true) { +rc = SA_AIS_ERR_BAD_OPERATION; +report_ccb_validation_error(opdata, "SaAmfCompType '%s' is in use", +comp_type->name.c_str()); +goto done; + } +} else { +TRACE("SaAmfCompType '%p'. SaAmfCompType '%s' has no components", + comp_type, ct_name.c_str()); + } + } rc = SA_AIS_OK; break; default: osafassert(0); break; } - +done: TRACE_LEAVE2("%u", rc); return rc; } ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/3] osaf: add function to return takeover request expiry time [#3029]
--- src/osaf/consensus/consensus.cc | 4 src/osaf/consensus/consensus.h | 2 ++ 2 files changed, 6 insertions(+) diff --git a/src/osaf/consensus/consensus.cc b/src/osaf/consensus/consensus.cc index 0bebab2..814885e 100644 --- a/src/osaf/consensus/consensus.cc +++ b/src/osaf/consensus/consensus.cc @@ -207,6 +207,10 @@ bool Consensus::PrioritisePartitionSize() const { return prioritise_partition_size_; } +uint32_t Consensus::TakeoverValidTime() const { + return takeover_valid_time_; +} + std::string Consensus::CurrentActive() const { TRACE_ENTER(); if (use_consensus_ == false) { diff --git a/src/osaf/consensus/consensus.h b/src/osaf/consensus/consensus.h index eb12b2c..1fabf90 100644 --- a/src/osaf/consensus/consensus.h +++ b/src/osaf/consensus/consensus.h @@ -62,6 +62,8 @@ class Consensus { bool PrioritisePartitionSize() const; + uint32_t TakeoverValidTime() const; + // Determine if plugin is telling us to self-fence due to loss // of connectivity to the KV store bool SelfFence(const std::string& request) const; -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/3] Review Request for amfd: improve controller failover behavior [#3029]
Summary: osaf: add function to return takeover request expiry time [#3029] Review request for Ticket(s): 3029 Peer Reviewer(s): Minh, Hans Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3029 Base revision: 4f86e371d28a385f689011a0effef8aaae65e713 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesy Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 1f48477cdcd92356cd446ad81741f9373724be7c Author: Gary Lee Date: Wed, 3 Jul 2019 16:19:17 +1000 amfd: improve controller failover behavior [#3029] If consensus service is enabled, only perform node failover after peer controller has self-fenced (after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds). This also means if node failover delay is set to a large value, we do not unnecesarily wait too long before failing over assignments previously assigned to the peer controller. Remove unused fmd_conf_file variable. Change some LOG_ER calls to LOG_WA. revision 5e03fc3e30920989080f6617ca404f7f60f4a8cc Author: Gary Lee Date: Wed, 3 Jul 2019 16:19:10 +1000 fmd: add active promotion supervision timer [#3029] Add supervision timer so controller will reboot if it cannot obtain consensus lock within the allocation period (2* FMS_TAKEOVER_REQUEST_VALID_TIME). The peer controller can then safely perform a node failover after this period of time. revision c2a9e9d8712952526660efe678daee39f85d1d68 Author: Gary Lee Date: Wed, 3 Jul 2019 15:34:36 +1000 osaf: add function to return takeover request expiry time [#3029] Complete diffstat: -- src/amf/amfd/cb.h | 1 - src/amf/amfd/clm.cc| 4 +- src/amf/amfd/main.cc | 1 - src/amf/amfd/ndfsm.cc | 8 ++-- src/amf/amfd/ndproc.cc | 19 ++ src/amf/amfd/node_state.cc | 23 +-- src/amf/amfd/node_state_machine.cc | 19 ++ src/amf/amfd/node_state_machine.h | 2 + src/amf/amfd/proc.h| 1 + src/fm/fmd/fm_cb.h | 2 + src/fm/fmd/fm_main.cc | 14 ++- src/fm/fmd/fm_rda.cc | 78 ++ src/osaf/consensus/consensus.cc| 4 ++ src/osaf/consensus/consensus.h | 2 + 14 files changed, 134 insertions(+), 44 deletions(-) Testing Commands: - 1) Ensure a 2N application is active on standby controller, and standy on the active controller 2) Isolate active & standby controller Testing, Expected Results: -- amfd should failover 2N application only after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds Conditions of Submission: - Ack from reviewer Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tre
[devel] [PATCH 2/3] fmd: add active promotion supervision timer [#3029]
Add supervision timer so controller will reboot if it cannot obtain consensus lock within the allocation period (2* FMS_TAKEOVER_REQUEST_VALID_TIME). The peer controller can then safely perform a node failover after this period of time. --- src/fm/fmd/fm_cb.h| 2 ++ src/fm/fmd/fm_main.cc | 14 - src/fm/fmd/fm_rda.cc | 78 +++ 3 files changed, 69 insertions(+), 25 deletions(-) diff --git a/src/fm/fmd/fm_cb.h b/src/fm/fmd/fm_cb.h index 6eb0d54..b5ea5ae 100644 --- a/src/fm/fmd/fm_cb.h +++ b/src/fm/fmd/fm_cb.h @@ -39,6 +39,7 @@ typedef enum { FM_TMR_TYPE_MIN, FM_TMR_PROMOTE_ACTIVE, FM_TMR_ACTIVATION_SUPERVISION, + FM_TMR_CONSENSUS_SERVICE_SUPERVISION, FM_TMR_TYPE_MAX } FM_TMR_TYPE; @@ -83,6 +84,7 @@ struct FM_CB { /* Timers */ FM_TMR promote_active_tmr{}; FM_TMR activation_supervision_tmr{}; + FM_TMR consensus_service_supervision_tmr{}; /* Time in terms of one hundredth of seconds (500 for 5 secs.) */ uint32_t active_promote_tmr_val{}; diff --git a/src/fm/fmd/fm_main.cc b/src/fm/fmd/fm_main.cc index 2eb3c16..4a843cc 100644 --- a/src/fm/fmd/fm_main.cc +++ b/src/fm/fmd/fm_main.cc @@ -59,7 +59,8 @@ static uint32_t fm_get_args(FM_CB *); static uint32_t fms_fms_exchange_node_info(FM_CB *); static uint32_t fms_fms_inform_terminating(FM_CB *fm_cb); static uint32_t fm_nid_notify(uint32_t); -static uint32_t fm_tmr_start(FM_TMR *, SaTimeT); +uint32_t fm_tmr_start(FM_TMR *, SaTimeT); +void fm_tmr_stop(FM_TMR *tmr); static SaAisErrorT get_peer_clm_node_name(NODE_ID); static SaAisErrorT fm_clm_init(); static void fm_mbx_msg_handler(FM_CB *, FM_EVT *); @@ -449,6 +450,8 @@ static uint32_t fm_get_args(FM_CB *fm_cb) { /* Set timer variables */ fm_cb->promote_active_tmr.type = FM_TMR_PROMOTE_ACTIVE; fm_cb->activation_supervision_tmr.type = FM_TMR_ACTIVATION_SUPERVISION; + fm_cb->consensus_service_supervision_tmr.type = +FM_TMR_CONSENSUS_SERVICE_SUPERVISION; char *node_isolation_timeout = getenv("FMS_NODE_ISOLATION_TIMEOUT"); if (node_isolation_timeout != NULL) { @@ -704,6 +707,11 @@ static void fm_mbx_msg_handler(FM_CB *fm_cb, FM_EVT *fm_mbx_evt) { "Activation timer supervision " "expired: no ACTIVE assignment received " "within the time limit"); + } else if (fm_mbx_evt->info.fm_tmr->type == + FM_TMR_CONSENSUS_SERVICE_SUPERVISION) { +opensaf_quick_reboot("Consensus service supervision " + "expired: controller was not promoted " + "within the time limit"); } break; @@ -728,6 +736,10 @@ static void fm_evt_proc_rda_callback(FM_CB *cb, FM_EVT *evt) { uint32_t rc = NCSCC_RC_SUCCESS; TRACE_ENTER2("%d", (int)evt->info.rda_info.role); + if (evt->info.rda_info.role == PCS_RDA_ACTIVE) { +LOG_NO("Controller promoted. Stop supervision timer"); +fm_tmr_stop(_cb->consensus_service_supervision_tmr); + } if (evt->info.rda_info.role != PCS_RDA_ACTIVE && cb->activation_supervision_tmr.status == FM_TMR_RUNNING) { fm_tmr_stop(>activation_supervision_tmr); diff --git a/src/fm/fmd/fm_rda.cc b/src/fm/fmd/fm_rda.cc index d3063ba..0544152 100644 --- a/src/fm/fmd/fm_rda.cc +++ b/src/fm/fmd/fm_rda.cc @@ -23,6 +23,8 @@ #include "osaf/consensus/consensus.h" #include "rde/agent/rda_papi.h" +extern uint32_t fm_tmr_start(FM_TMR *tmr, SaTimeT period); +extern void fm_tmr_stop(FM_TMR *tmr); extern void rda_cb(uint32_t cb_hdl, PCS_RDA_CB_INFO *cb_info, PCSRDA_RETURN_CODE error_code); / @@ -64,6 +66,47 @@ done: return rc; } +void promote_node(FM_CB *fm_cb) { + TRACE_ENTER(); + + Consensus consensus_service; + if (consensus_service.PrioritisePartitionSize() == true) { +// Allow topology events to be processed first. The MDS thread may +// be processing MDS down events and updating cluster_size concurrently. +// We need cluster_size to be as accurate as possible, without waiting +// too long for node down events. +std::this_thread::sleep_for(std::chrono::seconds(2)); + } + + uint32_t rc; + rc = consensus_service.PromoteThisNode(true, fm_cb->cluster_size); + if (rc != SA_AIS_OK && rc != SA_AIS_ERR_EXIST) { +LOG_ER("Unable to set active controller in consensus service"); +opensaf_quick_reboot("Unable to set active controller " + "in consensus service"); + } else if (rc == SA_AIS_ERR_EXIST) { +// @todo if we don't reboot, we don't seem to recover from this. Can we +// improve? +LOG_ER( +"A controller is already active. We were separated from the " +"cluster?"); +opensaf_quick_reboot("A controller is already active. We were separated " + "from the cluster?"); + } + + PCS_RDA_REQ rda_req; + + /* set the RDA role to active */ +
[devel] [PATCH 3/3] amfd: improve controller failover behavior [#3029]
If consensus service is enabled, only perform node failover after peer controller has self-fenced (after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds). This also means if node failover delay is set to a large value, we do not unnecesarily wait too long before failing over assignments previously assigned to the peer controller. Remove unused fmd_conf_file variable. Change some LOG_ER calls to LOG_WA. --- src/amf/amfd/cb.h | 1 - src/amf/amfd/clm.cc| 4 ++-- src/amf/amfd/main.cc | 1 - src/amf/amfd/ndfsm.cc | 8 src/amf/amfd/ndproc.cc | 19 +++ src/amf/amfd/node_state.cc | 23 --- src/amf/amfd/node_state_machine.cc | 19 +++ src/amf/amfd/node_state_machine.h | 2 ++ src/amf/amfd/proc.h| 1 + 9 files changed, 59 insertions(+), 19 deletions(-) diff --git a/src/amf/amfd/cb.h b/src/amf/amfd/cb.h index 89cf15d..7ac743e 100644 --- a/src/amf/amfd/cb.h +++ b/src/amf/amfd/cb.h @@ -202,7 +202,6 @@ typedef struct cl_cb_tag { AVD_TMR heartbeat_tmr; /* The timer for sending heart beats to nd. */ SaTimeT heartbeat_tmr_period; uint32_t minimum_cluster_size; - std::string fmd_conf_file; uint32_t nodes_exit_cnt; /* The counter to identifies the number of nodes that have exited the membership diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc index aeae939..cfbe36a 100644 --- a/src/amf/amfd/clm.cc +++ b/src/amf/amfd/clm.cc @@ -203,7 +203,7 @@ static void clm_node_exit_complete(SaClmNodeIdT nodeId) { } if (avd_cb->failover_list.count(node->node_info.nodeId) == 0 && -avd_cb->node_failover_delay == 0) { +delay_failover(avd_cb, node->node_info.nodeId) == false) { avd_node_failover(node); avd_node_delete_nodeid(node); } @@ -322,7 +322,7 @@ static void clm_track_cb( LOG_IN("%s: CLM node '%s' is not an AMF cluster member; MDS down received", __FUNCTION__, node_name.c_str()); if (avd_cb->failover_list.count(node->node_info.nodeId) == 0 && - avd_cb->node_failover_delay == 0) { + delay_failover(avd_cb, node->node_info.nodeId) == false) { avd_node_delete_nodeid(node); } goto done; diff --git a/src/amf/amfd/main.cc b/src/amf/amfd/main.cc index e3d0957..03857a1 100644 --- a/src/amf/amfd/main.cc +++ b/src/amf/amfd/main.cc @@ -582,7 +582,6 @@ static uint32_t initialize(void) { } cb->minimum_cluster_size = base::GetEnv("OSAF_AMF_MIN_CLUSTER_SIZE", uint32_t{2}); - cb->fmd_conf_file = base::GetEnv("FMS_CONF_FILE", ""); node_list_db = new AmfDb; amfnd_svc_db = new std::set; diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc index 7099196..16b2def 100644 --- a/src/amf/amfd/ndfsm.cc +++ b/src/amf/amfd/ndfsm.cc @@ -811,7 +811,7 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) { std::shared_ptr failed_node = cb->failover_list.at(evt->info.node_id); failed_node->MdsDown(); -} else if (cb->node_failover_delay > 0) { +} else if (delay_failover(cb, evt->info.node_id) == true) { LOG_NO("Node '%s' is down. Start failover delay timer", node->node_name.c_str()); @@ -821,10 +821,10 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) { } if (avd_cb->avail_state_avd == SA_AMF_HA_ACTIVE) { - if (cb->node_failover_delay == 0) { + check_quorum(cb); + if (delay_failover(cb, evt->info.node_id) == false) { avd_node_failover(node); } - check_quorum(cb); node->node_info.member = SA_FALSE; // Update standby out of sync if standby sc goes down if (avd_cb->node_id_avd_other == node->node_info.nodeId) { @@ -833,7 +833,7 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) { m_AVSV_SEND_CKPT_UPDT_ASYNC_UPDT(avd_cb, node, AVSV_CKPT_AVD_NODE_CONFIG); } -} else if (cb->node_failover_delay == 0) { +} else if (delay_failover(cb, evt->info.node_id) == false) { /* Remove dynamic info for node but keep in nodeid tree. * Possibly used at the end of controller failover to * to failover payload nodes. diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc index 5f5cbcd..0d30dfe 100644 --- a/src/amf/amfd/ndproc.cc +++ b/src/amf/amfd/ndproc.cc @@ -1277,6 +1277,25 @@ void avd_node_failover(AVD_AVND *node, const bool mw_only) { TRACE_LEAVE(); } +bool delay_failover(const AVD_CL_CB *cb, const SaClmNodeIdT node_id) { + TRACE_ENTER(); + Consensus consensus_service; + bool delay = false; + + if (cb->node_failover_delay > 0) { + delay = true; + } else if (node_id == cb->node_id_avd_other && + consensus_service.IsEnabled() == true && + consensus_service.IsRemoteFencingEnabled() == false) { +// even though node failover delay is set to
Re: [devel] [PATCH 1/1] amf: check null before access to config objects [#3055]
Hi Thang ack (review only) Thanks Gary On 2/7/19 12:25 pm, thang.d.nguyen wrote: During controller goes up, it creats config object from IMM. In case the object was deleted but comming up amfd still receives ccb object delete callback. And it validates and crash due to access to null pointer. --- src/amf/amfd/app.cc | 17 ++--- src/amf/amfd/apptype.cc | 13 +++-- src/amf/amfd/comptype.cc | 10 +- 3 files changed, 30 insertions(+), 10 deletions(-) diff --git a/src/amf/amfd/app.cc b/src/amf/amfd/app.cc index 424d828..67e5e3e 100644 --- a/src/amf/amfd/app.cc +++ b/src/amf/amfd/app.cc @@ -319,13 +319,16 @@ static void app_ccb_apply_cb(CcbUtilOperationData_t *opdata) { } case CCBUTIL_DELETE: app = app_db->find(Amf::to_string(>objectName)); - /* by this time all the SGs and SIs under this - * app object should have been *DELETED* just - * do a sanity check here - */ - osafassert(app->list_of_sg == nullptr); - osafassert(app->list_of_si == nullptr); - avd_app_delete(app); + if ((app != nullptr) || (avd_cb->is_active() == true)) { +/* by this time all the SGs and SIs under this + * app object should have been *DELETED* just + * do a sanity check here + */ +osafassert(app); +osafassert(app->list_of_sg == nullptr); +osafassert(app->list_of_si == nullptr); +avd_app_delete(app); + } break; default: osafassert(0); diff --git a/src/amf/amfd/apptype.cc b/src/amf/amfd/apptype.cc index c22147f..20c94cb 100644 --- a/src/amf/amfd/apptype.cc +++ b/src/amf/amfd/apptype.cc @@ -155,6 +155,12 @@ static SaAisErrorT apptype_ccb_completed_cb(CcbUtilOperationData_t *opdata) { break; case CCBUTIL_DELETE: app_type = avd_apptype_get(object_name); + if (app_type == nullptr && avd_cb->is_active() == false) { +opdata->userData = nullptr; +rc = SA_AIS_OK; +break; + } + osafassert(app_type); if (nullptr != app_type->list_of_app) { /* check whether there exists a delete operation for * each of the App in the app_type list in the current CCB @@ -201,8 +207,11 @@ static void apptype_ccb_apply_cb(CcbUtilOperationData_t *opdata) { apptype_add_to_model(app_type); break; case CCBUTIL_DELETE: - app_type = static_cast(opdata->userData); - apptype_delete(_type); + if ((opdata->userData != nullptr) || (avd_cb->is_active() == true)) { +app_type = static_cast(opdata->userData); +osafassert(app_type); +apptype_delete(_type); + } break; default: osafassert(0); diff --git a/src/amf/amfd/comptype.cc b/src/amf/amfd/comptype.cc index 38582cc..48a333e 100644 --- a/src/amf/amfd/comptype.cc +++ b/src/amf/amfd/comptype.cc @@ -630,7 +630,9 @@ static void comptype_ccb_apply_cb(CcbUtilOperationData_t *opdata) { comptype_db_add(comp_type); break; case CCBUTIL_DELETE: - comptype_delete(static_cast(opdata->userData)); + if ((opdata->userData != nullptr) || (avd_cb->is_active() == true)) { +comptype_delete(static_cast(opdata->userData)); + } break; case CCBUTIL_MODIFY: ccb_apply_modify_hdlr(opdata); @@ -802,6 +804,12 @@ static SaAisErrorT comptype_ccb_completed_cb(CcbUtilOperationData_t *opdata) { break; case CCBUTIL_DELETE: comp_type = comptype_db->find(Amf::to_string(>objectName)); + if (comp_type == nullptr && avd_cb->is_active() == false) { +rc = SA_AIS_OK; +opdata->userData = nullptr; +break; + } + osafassert(comp_type); if (nullptr != comp_type->list_of_comp) { /* check whether there exists a delete operation for * each of the Comp in the comp_type list in the current CCB ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] utils: Use a fence daemon as an alternative to payload reboot fencing [#3048]
Hi Hans Looks good, ack (review only). One very, very minor comment: # systemd services managed by fenced. Separate service names by whitespace, e.g. "opensafd" SERVICES_TO_FENCE="opensafd" I guess you could put a second service in the example :-) Thanks Gary On 5/6/19 6:36 pm, Hans Nordebäck wrote: --- src/fm/Makefile.am| 6 +- src/fm/fmd/fm_amf.cc | 14 + src/fm/fmd/tipc_server.cc | 93 ++ src/fm/fmd/tipc_server.h | 45 +++ tools/devel/fenced/Makefile | 63 tools/devel/fenced/README_TOOLS | 15 + tools/devel/fenced/command.cc | 134 tools/devel/fenced/command.h | 43 +++ tools/devel/fenced/cpp_macros.h | 33 ++ tools/devel/fenced/fenced.conf| 17 + tools/devel/fenced/fenced_main.cc | 179 +++ tools/devel/fenced/node_state_file.cc | 87 ++ tools/devel/fenced/node_state_file.h | 41 +++ tools/devel/fenced/node_state_hdlr.cc | 54 tools/devel/fenced/node_state_hdlr.h | 45 +++ tools/devel/fenced/node_state_hdlr_factory.cc | 66 tools/devel/fenced/node_state_hdlr_factory.h | 35 +++ tools/devel/fenced/node_state_hdlr_pl.cc | 292 ++ tools/devel/fenced/node_state_hdlr_pl.h | 60 tools/devel/fenced/node_state_hdlr_sc.cc | 42 +++ tools/devel/fenced/node_state_hdlr_sc.h | 41 +++ tools/devel/fenced/osaffenced.service | 14 + tools/devel/fenced/service.cc | 53 tools/devel/fenced/service.h | 42 +++ tools/devel/fenced/timer.cc | 62 tools/devel/fenced/timer.h| 53 tools/devel/fenced/watchdog.cc| 37 +++ tools/devel/fenced/watchdog.h | 39 +++ 28 files changed, 1703 insertions(+), 2 deletions(-) create mode 100644 src/fm/fmd/tipc_server.cc create mode 100644 src/fm/fmd/tipc_server.h create mode 100755 tools/devel/fenced/Makefile create mode 100644 tools/devel/fenced/README_TOOLS create mode 100644 tools/devel/fenced/command.cc create mode 100644 tools/devel/fenced/command.h create mode 100644 tools/devel/fenced/cpp_macros.h create mode 100644 tools/devel/fenced/fenced.conf create mode 100644 tools/devel/fenced/fenced_main.cc create mode 100644 tools/devel/fenced/node_state_file.cc create mode 100644 tools/devel/fenced/node_state_file.h create mode 100644 tools/devel/fenced/node_state_hdlr.cc create mode 100644 tools/devel/fenced/node_state_hdlr.h create mode 100644 tools/devel/fenced/node_state_hdlr_factory.cc create mode 100644 tools/devel/fenced/node_state_hdlr_factory.h create mode 100644 tools/devel/fenced/node_state_hdlr_pl.cc create mode 100644 tools/devel/fenced/node_state_hdlr_pl.h create mode 100644 tools/devel/fenced/node_state_hdlr_sc.cc create mode 100644 tools/devel/fenced/node_state_hdlr_sc.h create mode 100644 tools/devel/fenced/osaffenced.service create mode 100644 tools/devel/fenced/service.cc create mode 100644 tools/devel/fenced/service.h create mode 100644 tools/devel/fenced/timer.cc create mode 100644 tools/devel/fenced/timer.h create mode 100644 tools/devel/fenced/watchdog.cc create mode 100644 tools/devel/fenced/watchdog.h diff --git a/src/fm/Makefile.am b/src/fm/Makefile.am index 0f254b94f..325847ae9 100644 --- a/src/fm/Makefile.am +++ b/src/fm/Makefile.am @@ -20,7 +20,8 @@ noinst_HEADERS += \ src/fm/fmd/fm_cb.h \ src/fm/fmd/fm_evt.h \ src/fm/fmd/fm_mds.h \ - src/fm/fmd/fm_mem.h + src/fm/fmd/fm_mem.h \ + src/fm/fmd/tipc_server.h osaf_execbin_PROGRAMS += bin/osaffmd nodist_pkgclccli_SCRIPTS += \ @@ -44,7 +45,8 @@ bin_osaffmd_SOURCES = \ src/fm/fmd/fm_amf.cc \ src/fm/fmd/fm_main.cc \ src/fm/fmd/fm_mds.cc \ - src/fm/fmd/fm_rda.cc + src/fm/fmd/fm_rda.cc \ + src/fm/fmd/tipc_server.cc bin_osaffmd_LDADD = \ lib/libSaAmf.la \ diff --git a/src/fm/fmd/fm_amf.cc b/src/fm/fmd/fm_amf.cc index e99f3ba7e..8cf284f97 100644 --- a/src/fm/fmd/fm_amf.cc +++ b/src/fm/fmd/fm_amf.cc @@ -34,6 +34,12 @@ **/ #include "fm.h" +#include "tipc_server.h" + +namespace { +TIPCServer tipc_srv; +} + extern uint32_t gl_fm_hdl; uint32_t fm_amf_init(FM_AMF_CB *fm_amf_cb); @@ -151,6 +157,11 @@ void fm_saf_CSI_set_callback(SaInvocationT invocation, const SaNameT *compName, } else { fm_cb->amf_state = new_haState; fm_cb->csi_assigned = true; + if (new_haState == SA_AMF_HA_ACTIVE) { +tipc_srv.publish(); + } else { +tipc_srv.unpublish(); + } } error = saAmfResponse(fm_amf_cb->amf_hdl, invocation, error); } @@
[devel] [PATCH 0/1] Review Request for amfd: prevent infinite loop V3 [#3050]
Summary: amfd: prevent infinite loop [#3050] Review request for Ticket(s): 3050 Peer Reviewer(s): Hans, Minh Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3050 Base revision: 68efc6010fda86d62300a687bbd8c52cba232479 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 67404028af391330860c8edb45fc0442fb90a283 Author: Gary Lee Date: Thu, 20 Jun 2019 12:07:57 +1000 amfd: prevent infinite loop [#3050] In handle_event_in_failover_state(), we iterate through queue_evt in a while loop, but process_event() can insert items into the queue inside the loop, and we may end up never exiting the while loop. Complete diffstat: -- src/amf/amfd/main.cc | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) Testing Commands: - See ticket Testing, Expected Results: -- amfd does not go into an infinite loop and get terminated by the watchdog Conditions of Submission: - ack from anyone Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] amfd: prevent infinite loop [#3050]
In handle_event_in_failover_state(), we iterate through queue_evt in a while loop, but process_event() can insert items into the queue inside the loop, and we may end up never exiting the while loop. --- src/amf/amfd/main.cc | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/src/amf/amfd/main.cc b/src/amf/amfd/main.cc index 50daa59..e3d0957 100644 --- a/src/amf/amfd/main.cc +++ b/src/amf/amfd/main.cc @@ -406,12 +406,18 @@ static void handle_event_in_failover_state(AVD_EVT *evt) { /* Dequeue, all the messages from the queue and process them now */ - -while (!cb->evt_queue.empty()) { +auto size_before_loop = cb->evt_queue.size(); +std::queue::size_type count = 0; +while (count < size_before_loop) { + // note: process_event() may insert items into + // the queue, so terminate loop when we have + // processed all the original elements + // to avoid infinite loop AVD_EVT_QUEUE *queue_evt = cb->evt_queue.front(); cb->evt_queue.pop(); process_event(cb, queue_evt->evt); delete queue_evt; + ++count; } /* Walk through all the nodes to check if any of the nodes state is -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: prevent infinite loop [#3050]
Hi I think I have to re-work this. 23.3.3.4 of the C++11 standard says: Effects: An erase operation that erases the last element of a deque invalidates only the past-the-end iterator and all iterators and references to the erased elements. So I've probably done the wrong thing here. On 19/6/19 1:24 pm, Gary Lee wrote: In handle_event_in_failover_state(), we iterate through queue_evt in a while loop, but process_event() can insert items into the queue inside the loop, and we may end up never exiting the while loop. --- src/amf/amfd/cb.h | 3 ++- src/amf/amfd/main.cc | 13 + src/amf/amfd/ndfsm.cc | 4 ++-- src/amf/amfd/ndproc.cc | 4 ++-- 4 files changed, 15 insertions(+), 9 deletions(-) diff --git a/src/amf/amfd/cb.h b/src/amf/amfd/cb.h index 89cf15d..4418db6 100644 --- a/src/amf/amfd/cb.h +++ b/src/amf/amfd/cb.h @@ -38,6 +38,7 @@ #include #include +#include #include #include #include @@ -166,7 +167,7 @@ typedef struct cl_cb_tag { std::queue nd_msg_queue_list{}; /* Event Queue to hold the events during fail-over */ - std::queue evt_queue{}; + std::deque evt_queue{}; /* * MBCSv related variables. */ diff --git a/src/amf/amfd/main.cc b/src/amf/amfd/main.cc index 50daa59..d22bcb6 100644 --- a/src/amf/amfd/main.cc +++ b/src/amf/amfd/main.cc @@ -395,7 +395,7 @@ static void handle_event_in_failover_state(AVD_EVT *evt) { /* Enqueue this event */ queue_evt = new AVD_EVT_QUEUE(); queue_evt->evt = evt; -cb->evt_queue.push(queue_evt); +cb->evt_queue.push_back(queue_evt); } std::map::const_iterator it = @@ -407,9 +407,14 @@ static void handle_event_in_failover_state(AVD_EVT *evt) { /* Dequeue, all the messages from the queue and process them now */ -while (!cb->evt_queue.empty()) { - AVD_EVT_QUEUE *queue_evt = cb->evt_queue.front(); - cb->evt_queue.pop(); +// get ref to end of queue, to make sure we don't get stuck +// iterating through the deque, as events may be added into +// evt_queue inside the loop (to be refactored?) +auto end_iter = cb->evt_queue.end(); +auto iter = cb->evt_queue.begin(); +while (iter != end_iter) { + AVD_EVT_QUEUE *queue_evt = *iter++; + cb->evt_queue.pop_front(); process_event(cb, queue_evt->evt); delete queue_evt; } diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc index 8c8f3c5..b763c79 100644 --- a/src/amf/amfd/ndfsm.cc +++ b/src/amf/amfd/ndfsm.cc @@ -69,7 +69,7 @@ void avd_process_state_info_queue(AVD_CL_CB *cb) { for (i = 0; i < queue_size; i++) { queue_evt = cb->evt_queue.front(); osafassert(queue_evt->evt); -cb->evt_queue.pop(); +cb->evt_queue.pop_front(); TRACE("rcv_evt: %u", queue_evt->evt->rcv_evt); @@ -95,7 +95,7 @@ void avd_process_state_info_queue(AVD_CL_CB *cb) { delete queue_evt->evt; delete queue_evt; } else { - cb->evt_queue.push(queue_evt); + cb->evt_queue.push_back(queue_evt); } } diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc index 5f5cbcd..433b00a 100644 --- a/src/amf/amfd/ndproc.cc +++ b/src/amf/amfd/ndproc.cc @@ -350,7 +350,7 @@ void avd_nd_sisu_state_info_evh(AVD_CL_CB *cb, AVD_EVT *evt) { state_info_evt->evt = new AVD_EVT{}; state_info_evt->evt->rcv_evt = evt->rcv_evt; state_info_evt->evt->info.avnd_msg = n2d_msg; -cb->evt_queue.push(state_info_evt); +cb->evt_queue.push_back(state_info_evt); } else { LOG_WA( "Ignore this sisu_state_info message since node sync window has closed"); @@ -392,7 +392,7 @@ void avd_nd_compcsi_state_info_evh(AVD_CL_CB *cb, AVD_EVT *evt) { state_info_evt->evt = new AVD_EVT{}; state_info_evt->evt->rcv_evt = evt->rcv_evt; state_info_evt->evt->info.avnd_msg = n2d_msg; -cb->evt_queue.push(state_info_evt); +cb->evt_queue.push_back(state_info_evt); } else { LOG_WA( "Ignore this compcsi_state_info message since node sync window has closed"); ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for amfd: prevent infinite loop V2 [#3050]
Summary: amfd: prevent infinite loop [#3050] Review request for Ticket(s): 3050 Peer Reviewer(s): Minh, Hans Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3050 Base revision: 68efc6010fda86d62300a687bbd8c52cba232479 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 7455f8da651fb14838140da1c80fe0bf2db443fd Author: Gary Lee Date: Wed, 19 Jun 2019 13:12:35 +1000 amfd: prevent infinite loop [#3050] In handle_event_in_failover_state(), we iterate through queue_evt in a while loop, but process_event() can insert items into the queue inside the loop, and we may end up never exiting the while loop. Complete diffstat: -- src/amf/amfd/cb.h | 3 ++- src/amf/amfd/main.cc | 13 + src/amf/amfd/ndfsm.cc | 4 ++-- src/amf/amfd/ndproc.cc | 4 ++-- 4 files changed, 15 insertions(+), 9 deletions(-) Testing Commands: - See ticket for reproduction steps. Testing, Expected Results: -- amfd does not go into an infinite loop and get terminated by the watchdog Conditions of Submission: - ack from reviewer Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] amfd: prevent infinite loop [#3050]
In handle_event_in_failover_state(), we iterate through queue_evt in a while loop, but process_event() can insert items into the queue inside the loop, and we may end up never exiting the while loop. --- src/amf/amfd/cb.h | 3 ++- src/amf/amfd/main.cc | 13 + src/amf/amfd/ndfsm.cc | 4 ++-- src/amf/amfd/ndproc.cc | 4 ++-- 4 files changed, 15 insertions(+), 9 deletions(-) diff --git a/src/amf/amfd/cb.h b/src/amf/amfd/cb.h index 89cf15d..4418db6 100644 --- a/src/amf/amfd/cb.h +++ b/src/amf/amfd/cb.h @@ -38,6 +38,7 @@ #include #include +#include #include #include #include @@ -166,7 +167,7 @@ typedef struct cl_cb_tag { std::queue nd_msg_queue_list{}; /* Event Queue to hold the events during fail-over */ - std::queue evt_queue{}; + std::deque evt_queue{}; /* * MBCSv related variables. */ diff --git a/src/amf/amfd/main.cc b/src/amf/amfd/main.cc index 50daa59..d22bcb6 100644 --- a/src/amf/amfd/main.cc +++ b/src/amf/amfd/main.cc @@ -395,7 +395,7 @@ static void handle_event_in_failover_state(AVD_EVT *evt) { /* Enqueue this event */ queue_evt = new AVD_EVT_QUEUE(); queue_evt->evt = evt; -cb->evt_queue.push(queue_evt); +cb->evt_queue.push_back(queue_evt); } std::map::const_iterator it = @@ -407,9 +407,14 @@ static void handle_event_in_failover_state(AVD_EVT *evt) { /* Dequeue, all the messages from the queue and process them now */ -while (!cb->evt_queue.empty()) { - AVD_EVT_QUEUE *queue_evt = cb->evt_queue.front(); - cb->evt_queue.pop(); +// get ref to end of queue, to make sure we don't get stuck +// iterating through the deque, as events may be added into +// evt_queue inside the loop (to be refactored?) +auto end_iter = cb->evt_queue.end(); +auto iter = cb->evt_queue.begin(); +while (iter != end_iter) { + AVD_EVT_QUEUE *queue_evt = *iter++; + cb->evt_queue.pop_front(); process_event(cb, queue_evt->evt); delete queue_evt; } diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc index 8c8f3c5..b763c79 100644 --- a/src/amf/amfd/ndfsm.cc +++ b/src/amf/amfd/ndfsm.cc @@ -69,7 +69,7 @@ void avd_process_state_info_queue(AVD_CL_CB *cb) { for (i = 0; i < queue_size; i++) { queue_evt = cb->evt_queue.front(); osafassert(queue_evt->evt); -cb->evt_queue.pop(); +cb->evt_queue.pop_front(); TRACE("rcv_evt: %u", queue_evt->evt->rcv_evt); @@ -95,7 +95,7 @@ void avd_process_state_info_queue(AVD_CL_CB *cb) { delete queue_evt->evt; delete queue_evt; } else { - cb->evt_queue.push(queue_evt); + cb->evt_queue.push_back(queue_evt); } } diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc index 5f5cbcd..433b00a 100644 --- a/src/amf/amfd/ndproc.cc +++ b/src/amf/amfd/ndproc.cc @@ -350,7 +350,7 @@ void avd_nd_sisu_state_info_evh(AVD_CL_CB *cb, AVD_EVT *evt) { state_info_evt->evt = new AVD_EVT{}; state_info_evt->evt->rcv_evt = evt->rcv_evt; state_info_evt->evt->info.avnd_msg = n2d_msg; -cb->evt_queue.push(state_info_evt); +cb->evt_queue.push_back(state_info_evt); } else { LOG_WA( "Ignore this sisu_state_info message since node sync window has closed"); @@ -392,7 +392,7 @@ void avd_nd_compcsi_state_info_evh(AVD_CL_CB *cb, AVD_EVT *evt) { state_info_evt->evt = new AVD_EVT{}; state_info_evt->evt->rcv_evt = evt->rcv_evt; state_info_evt->evt->info.avnd_msg = n2d_msg; -cb->evt_queue.push(state_info_evt); +cb->evt_queue.push_back(state_info_evt); } else { LOG_WA( "Ignore this compcsi_state_info message since node sync window has closed"); -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: do not queue sync messages from 'lost' nodes [#3050]
Hi Minh On 11/6/19 10:33 am, Minh Hon Chau wrote: Hi Gary, Those variables e.g node_sync_window_closed have been used before headless sync complete. If there is a failover during the headless sync, the new active will start the headless sync again, so those variables have not been needed to checkpoint. But here the scenario happens in split brain, in which the new active is in separated network instead of coming from headless, so I guess we do need checkpoint it, but the checkpoint should be done after the headless sync ? I will checkpoint node_sync_window_closed in a new version. As you pointed out, using the timer alone isn't sufficient as sync messages could come before the active controller's amfnd has sent node_up (and therefore starting the timer). And the change in timer.h seems not much relates to this ticket? The values in the timer structure aren't initialized at startup. So things like is_active has random values. It would be good just to set them to known values. Thanks Gary ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: disallow delete of CtCs object if Ct maps to comp [#3028]
Hi Phuc Some comments below. Thanks Gary On 23/5/19 4:48 pm, phuc.h.chau wrote: Amfd crashes when su is unlocked, The reason for the crash is in the function avd_snd_susi_msg(),get_comp_capability() is called with csi and comp as input parameter. In the function, get_comp_capability(), there is no CtCs object available so ctcstype_db->find returns NULL to ctcs_type. While accessing ctcs_type->saAmfCtCompCapability, AMfd crashes because ctcs_type is NULL. --- src/amf/amfd/ctcstype.cc | 65 +++- 1 file changed, 64 insertions(+), 1 deletion(-) diff --git a/src/amf/amfd/ctcstype.cc b/src/amf/amfd/ctcstype.cc index 5dffdae..3f30ebc 100644 --- a/src/amf/amfd/ctcstype.cc +++ b/src/amf/amfd/ctcstype.cc @@ -28,6 +28,10 @@ AmfDb *ctcstype_db = nullptr; +static void find_ct_name_from_association(const std::string& haystack, + std::string *dn, + const char *needle); + static void ctcstype_db_add(AVD_CTCS_TYPE *ctcstype) { unsigned int rc = ctcstype_db->insert(ctcstype->name, ctcstype); osafassert(rc == NCSCC_RC_SUCCESS); @@ -187,16 +191,75 @@ static SaAisErrorT ctcstype_ccb_completed_cb(CcbUtilOperationData_t *opdata) { opdata, "Modification of SaAmfCtCsType not supported"); break; case CCBUTIL_DELETE: + AVD_CTCS_TYPE *ctcstype; + AVD_COMP_TYPE *comp_type; + AVD_COMP *comp; + CcbUtilOperationData_t *t_opData; + + ctcstype = ctcstype_db->find(Amf::to_string(>objectName)); + if (ctcstype != nullptr) { +std::string ct_name; +find_ct_name_from_association(Amf::to_string(>objectName), + _name, ",safVersion"); +TRACE("'%s'", ct_name.c_str()); +comp_type = comptype_db->find(ct_name); +if ((comp_type) && (nullptr != comp_type->list_of_comp)) { + /* check whether there exists a delete operation for + * each of the Comp in the comp_type list in the current CCB + */ + bool comp_exist = false; + TRACE("SaAmfCompType '%s' has components", comp_type->name.c_str()); + comp = comp_type->list_of_comp; + while (comp != nullptr) { +TRACE("%s", osaf_extended_name_borrow(>comp_info.name)); +t_opData = ccbutil_getCcbOpDataByDN(opdata->ccbId, +>comp_info.name); +TRACE("%p", t_opData); +if ((t_opData == nullptr) || +(t_opData->operationType != CCBUTIL_DELETE)) { + TRACE("Here %p", t_opData); [Gary] Maybe replace "Here" with a more useful description. + comp_exist = true; + break; +} +comp = comp->comp_type_list_comp_next; + } + if (comp_exist == true) { +rc = SA_AIS_ERR_BAD_OPERATION; +report_ccb_validation_error(opdata, "SaAmfCompType '%s' is in use", +comp_type->name.c_str()); +goto done; + } +} else { +TRACE("SaAmfCompType '%p'. SaAmfCompType '%s' has no components", + comp_type, ct_name.c_str()); + } + } rc = SA_AIS_OK; break; default: osafassert(0); break; } - +done: TRACE_LEAVE2("%u", rc); return rc; } [Gary] avsv_sanamet_init() should already do what you need below. +/** +* Initialize a DN by searching for needle in haystack +* where two times safVersion comes. +* @param haystack +* @param dn +* @param needle +* @note: "safSupportedCsType=safVersion=1\, +* safCSType=AmfDemo1,safVersion=1,safCompType=AmfDemo1" +*/ +static void find_ct_name_from_association(const std::string& haystack, + std::string *dn, + const char *needle) { + std::string::size_type pos = haystack.find(needle); + *dn = haystack.substr(pos + 1); + TRACE("dn %s", (*dn).c_str()); +} static void ctcstype_ccb_apply_cb(CcbUtilOperationData_t *opdata) { AVD_CTCS_TYPE *ctcstype; ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfnd: fix error reading from deallocated memory [#2568]
Hi Thanh I will push on your behalf. Thanks Gary On 5/6/19 12:29 pm, Thanh Nguyen wrote: Invalid read is from the following - avnd_evt_mds_ava_dn_evh() (amf/amfnd/comp.cc) - avsv_create_association_class_dn() (amf/common/util.c) Other changes are to fix cppcheck error report --- src/amf/amfnd/comp.cc | 17 + src/amf/common/util.c | 6 +++--- 2 files changed, 12 insertions(+), 11 deletions(-) diff --git a/src/amf/amfnd/comp.cc b/src/amf/amfnd/comp.cc index 38b9224..857c1dc 100644 --- a/src/amf/amfnd/comp.cc +++ b/src/amf/amfnd/comp.cc @@ -428,8 +428,10 @@ uint32_t avnd_evt_mds_ava_dn_evh(AVND_CB *cb, AVND_EVT *evt) { entry from the cbk list and delete the cbq */ m_AVND_COMP_CBQ_INV_GET(comp, comp->term_cbq_inv_value, cbk_rec); comp->term_cbq_inv_value = 0; + uint32_t opq_hdl = 0; + if (cbk_rec) opq_hdl = cbk_rec->opq_hdl; rc = avnd_comp_clc_fsm_run(cb, comp, AVND_COMP_CLC_PRES_FSM_EV_TERM_SUCC); - if (cbk_rec) avnd_comp_cbq_rec_pop_and_del(cb, comp, cbk_rec->opq_hdl, false); + if (cbk_rec) avnd_comp_cbq_rec_pop_and_del(cb, comp, opq_hdl, false); goto done; } /* found the matching comp; trigger error processing */ @@ -2228,9 +2230,7 @@ uint32_t avnd_amf_resp_send(AVND_CB *cb, AVSV_AMF_API_TYPE type, AVND_MSG msg; AVSV_ND2ND_AVND_MSG *avnd_msg; uint32_t rc = NCSCC_RC_SUCCESS; - MDS_DEST i_to_dest; AVSV_NDA_AVA_MSG *temp_ptr = nullptr; - NODE_ID node_id = 0; MDS_SYNC_SND_CTXT temp_ctxt; TRACE_ENTER(); @@ -2267,8 +2267,8 @@ uint32_t avnd_amf_resp_send(AVND_CB *cb, AVSV_AMF_API_TYPE type, msg.info.avnd->type = AVND_AVND_AVA_MSG; msg.type = AVND_MSG_AVND; /* Send it to AvND */ -node_id = m_NCS_NODE_ID_FROM_MDS_DEST(*dest); -i_to_dest = avnd_get_mds_dest_from_nodeid(cb, node_id); +NODE_ID node_id = m_NCS_NODE_ID_FROM_MDS_DEST(*dest); +MDS_DEST i_to_dest = avnd_get_mds_dest_from_nodeid(cb, node_id); rc = avnd_avnd_mds_send(cb, i_to_dest, ); } else { /* now send the response */ @@ -2646,7 +2646,8 @@ void avnd_comp_cmplete_all_assignment(AVND_CB *cb, AVND_COMP *comp) { */ temp_csi = m_AVND_COMPDB_REC_CSI_GET_FIRST(*comp); -if (cbk->cbk_info->param.csi_set.ha != temp_csi->si->curr_state) { +if (temp_csi && + (cbk->cbk_info->param.csi_set.ha != temp_csi->si->curr_state)) { avnd_comp_cbq_rec_pop_and_del(cb, comp, cbk->opq_hdl, true); continue; } @@ -2788,7 +2789,7 @@ uint32_t comp_restart_initiate(AVND_COMP *comp) { rc = avnd_comp_curr_info_del(cb, it.second); if (NCSCC_RC_SUCCESS != rc) goto done; - // unregister the contained comp +// unregister the contained comp rc = avnd_comp_unregister_contained(cb, it.second); if (NCSCC_RC_SUCCESS != rc) goto done; @@ -2956,7 +2957,7 @@ void avnd_comp_pres_state_set(const AVND_CB *cb, AVND_COMP *comp, (SA_AMF_PRESENCE_ORPHANED == prv_st { if (cb->is_avd_down == false) { avnd_di_uns32_upd_send(AVSV_SA_AMF_COMP, saAmfCompPresenceState_ID, - comp->name.c_str(), comp->pres); + comp->name, comp->pres); } } diff --git a/src/amf/common/util.c b/src/amf/common/util.c index ec76c32..d17b766 100644 --- a/src/amf/common/util.c +++ b/src/amf/common/util.c @@ -240,12 +240,12 @@ void avsv_create_association_class_dn(const SaNameT *child_dn, } if (dn) { + TRACE("dn: %s", buf); osaf_extended_name_steal(buf, dn); } - TRACE_LEAVE2("child_dn: %s parent_dn: %s dn: %s", + TRACE_LEAVE2("child_dn: %s parent_dn: %s", child_dn_ptr ? child_dn_ptr : "no child dn", - parent_dn_ptr ? parent_dn_ptr : "no parent dn", - buf); + parent_dn_ptr ? parent_dn_ptr : "no parent dn"); } void avsv_sanamet_init_from_association_dn(const SaNameT *haystack, SaNameT *dn, ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for amfd: do not queue sync messages from 'lost' nodes [#3050]
Summary: amfd: do not queue sync messages from 'lost' nodes [#3050] Review request for Ticket(s): 3050 Peer Reviewer(s): Hans, Minh Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3050 Base revision: 135b0b8862da9a036553c5db02062edb278089aa Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 9d64d3c1d386f1019103d12588ab46fa830ee793 Author: Gary Lee Date: Wed, 5 Jun 2019 13:49:45 +1000 amfd: do not queue sync messages from 'lost' nodes [#3050] The 'lost' nodes will be rebooted, thus there is no need to queue sync messages from these nodes. In addition, node_sync_window_closed is not reliable as it's not check pointed. We should remove all uses of it in another ticket? Instead, check if the timer is running. Complete diffstat: -- src/amf/amfd/cb.h | 2 ++ src/amf/amfd/ndproc.cc | 30 ++ src/amf/amfd/timer.h | 12 ++-- 3 files changed, 30 insertions(+), 14 deletions(-) Testing Commands: - See ticket for reproduction steps. Testing, Expected Results: -- Sync messages should be discarded and not put back into the queue. 2019-06-05 12:52:31.833 SC-2 osafamfd[254]: NO Receive message with event type:12, msg_type:31, from node:2030f, msg_id:0 2019-06-05 12:52:31.834 SC-2 osafamfd[254]: WA sisu_state_info messages received from lost node (2030f) 2019-06-05 12:52:31.834 SC-2 osafamfd[254]: NO Receive message with event type:13, msg_type:32, from node:2030f, msg_id:0 2019-06-05 12:52:31.834 SC-2 osafamfd[254]: WA compcsi_state_info messages received from lost node (2030f) Conditions of Submission: - Ack from anyone Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Ope
[devel] [PATCH 1/1] amfd: do not queue sync messages from 'lost' nodes [#3050]
The 'lost' nodes will be rebooted, thus there is no need to queue sync messages from these nodes. In addition, node_sync_window_closed is not reliable as it's not check pointed. We should remove all uses of it in another ticket? Instead, check if the timer is running. --- src/amf/amfd/cb.h | 2 ++ src/amf/amfd/ndproc.cc | 30 ++ src/amf/amfd/timer.h | 12 ++-- 3 files changed, 30 insertions(+), 14 deletions(-) diff --git a/src/amf/amfd/cb.h b/src/amf/amfd/cb.h index 89cf15d..8902d78 100644 --- a/src/amf/amfd/cb.h +++ b/src/amf/amfd/cb.h @@ -237,6 +237,8 @@ typedef struct cl_cb_tag { */ bool active_services_exist; bool all_nodes_synced; + // @todo this should be checkpointed to standby? otherwise + // after a controller failover, it will still be false? bool node_sync_window_closed; /* diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc index 5f5cbcd..20008d9 100644 --- a/src/amf/amfd/ndproc.cc +++ b/src/amf/amfd/ndproc.cc @@ -345,19 +345,26 @@ void avd_nd_sisu_state_info_evh(AVD_CL_CB *cb, AVD_EVT *evt) { evt->info.avnd_msg->msg_info.n2d_nd_sisu_state_info.node_id, evt->info.avnd_msg->msg_info.n2d_nd_sisu_state_info.msg_id); - if (cb->node_sync_window_closed == false) { + const SaClmNodeIdT node_id = +evt->info.avnd_msg->msg_info.n2d_nd_sisu_state_info.node_id; + + if (cb->failover_list.find(node_id) != cb->failover_list.end()) { +// ignore msg +LOG_WA("sisu_state_info messages received from lost node (%x)", + node_id); + } else if (cb->node_sync_tmr.is_active == true) { AVD_EVT_QUEUE *state_info_evt = new AVD_EVT_QUEUE(); state_info_evt->evt = new AVD_EVT{}; state_info_evt->evt->rcv_evt = evt->rcv_evt; state_info_evt->evt->info.avnd_msg = n2d_msg; cb->evt_queue.push(state_info_evt); +return; } else { LOG_WA( -"Ignore this sisu_state_info message since node sync window has closed"); -avsv_dnd_msg_free(n2d_msg); + "Ignore this sisu_state_info message since node sync window has closed"); } - TRACE_LEAVE(); + avsv_dnd_msg_free(n2d_msg); } /* @@ -387,19 +394,26 @@ void avd_nd_compcsi_state_info_evh(AVD_CL_CB *cb, AVD_EVT *evt) { evt->info.avnd_msg->msg_info.n2d_nd_csicomp_state_info.node_id, evt->info.avnd_msg->msg_info.n2d_nd_csicomp_state_info.msg_id); - if (cb->node_sync_window_closed == false) { + const SaClmNodeIdT node_id = +evt->info.avnd_msg->msg_info.n2d_nd_csicomp_state_info.node_id; + + if (cb->failover_list.find(node_id) != cb->failover_list.end()) { +// ignore msg +LOG_WA("compcsi_state_info messages received from lost node (%x)", + node_id); + } else if (cb->node_sync_tmr.is_active == true) { AVD_EVT_QUEUE *state_info_evt = new AVD_EVT_QUEUE(); state_info_evt->evt = new AVD_EVT{}; state_info_evt->evt->rcv_evt = evt->rcv_evt; state_info_evt->evt->info.avnd_msg = n2d_msg; cb->evt_queue.push(state_info_evt); +return; } else { LOG_WA( -"Ignore this compcsi_state_info message since node sync window has closed"); -avsv_dnd_msg_free(n2d_msg); + "Ignore this compcsi_state_info message since node sync window has closed"); } - TRACE_LEAVE(); + avsv_dnd_msg_free(n2d_msg); } /** diff --git a/src/amf/amfd/timer.h b/src/amf/amfd/timer.h index 5316879..6db04c7 100644 --- a/src/amf/amfd/timer.h +++ b/src/amf/amfd/timer.h @@ -52,12 +52,12 @@ typedef enum avd_tmr_type { /* AVD Timer definition */ typedef struct avd_tmr_tag { - tmr_t tmr_id; - AVD_TMR_TYPE type; - SaClmNodeIdT node_id; - std::string spons_si_name; - std::string dep_si_name; - bool is_active; + tmr_t tmr_id{}; + AVD_TMR_TYPE type{AVD_TMR_MAX}; + SaClmNodeIdT node_id{}; + std::string spons_si_name{}; + std::string dep_si_name{}; + bool is_active{}; } AVD_TMR; /* macro to start the cluster init timer. The cb structure -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfnd: fix error reading from deallocated memory [#2568]
Hi Thanh ack (review only). Thanks On 4/6/19 8:48 am, Thanh Nguyen wrote: Invalid read is from the following - avnd_evt_mds_ava_dn_evh() (amf/amfnd/comp.cc) - avsv_create_association_class_dn() (amf/common/util.c) Other changes are to fix cppcheck error report --- src/amf/amfnd/comp.cc | 16 src/amf/common/util.c | 6 +++--- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/src/amf/amfnd/comp.cc b/src/amf/amfnd/comp.cc index 38b9224..facbace 100644 --- a/src/amf/amfnd/comp.cc +++ b/src/amf/amfnd/comp.cc @@ -428,8 +428,9 @@ uint32_t avnd_evt_mds_ava_dn_evh(AVND_CB *cb, AVND_EVT *evt) { entry from the cbk list and delete the cbq */ m_AVND_COMP_CBQ_INV_GET(comp, comp->term_cbq_inv_value, cbk_rec); comp->term_cbq_inv_value = 0; + uint32_t opq_hdl = cbk_rec? cbk_rec->opq_hdl: 0; rc = avnd_comp_clc_fsm_run(cb, comp, AVND_COMP_CLC_PRES_FSM_EV_TERM_SUCC); - if (cbk_rec) avnd_comp_cbq_rec_pop_and_del(cb, comp, cbk_rec->opq_hdl, false); + if (cbk_rec) avnd_comp_cbq_rec_pop_and_del(cb, comp, opq_hdl, false); goto done; } /* found the matching comp; trigger error processing */ @@ -2228,9 +2229,7 @@ uint32_t avnd_amf_resp_send(AVND_CB *cb, AVSV_AMF_API_TYPE type, AVND_MSG msg; AVSV_ND2ND_AVND_MSG *avnd_msg; uint32_t rc = NCSCC_RC_SUCCESS; - MDS_DEST i_to_dest; AVSV_NDA_AVA_MSG *temp_ptr = nullptr; - NODE_ID node_id = 0; MDS_SYNC_SND_CTXT temp_ctxt; TRACE_ENTER(); @@ -2267,8 +2266,8 @@ uint32_t avnd_amf_resp_send(AVND_CB *cb, AVSV_AMF_API_TYPE type, msg.info.avnd->type = AVND_AVND_AVA_MSG; msg.type = AVND_MSG_AVND; /* Send it to AvND */ -node_id = m_NCS_NODE_ID_FROM_MDS_DEST(*dest); -i_to_dest = avnd_get_mds_dest_from_nodeid(cb, node_id); +NODE_ID node_id = m_NCS_NODE_ID_FROM_MDS_DEST(*dest); +MDS_DEST i_to_dest = avnd_get_mds_dest_from_nodeid(cb, node_id); rc = avnd_avnd_mds_send(cb, i_to_dest, ); } else { /* now send the response */ @@ -2646,7 +2645,8 @@ void avnd_comp_cmplete_all_assignment(AVND_CB *cb, AVND_COMP *comp) { */ temp_csi = m_AVND_COMPDB_REC_CSI_GET_FIRST(*comp); -if (cbk->cbk_info->param.csi_set.ha != temp_csi->si->curr_state) { +if (temp_csi && + (cbk->cbk_info->param.csi_set.ha != temp_csi->si->curr_state)) { avnd_comp_cbq_rec_pop_and_del(cb, comp, cbk->opq_hdl, true); continue; } @@ -2788,7 +2788,7 @@ uint32_t comp_restart_initiate(AVND_COMP *comp) { rc = avnd_comp_curr_info_del(cb, it.second); if (NCSCC_RC_SUCCESS != rc) goto done; - // unregister the contained comp +// unregister the contained comp rc = avnd_comp_unregister_contained(cb, it.second); if (NCSCC_RC_SUCCESS != rc) goto done; @@ -2956,7 +2956,7 @@ void avnd_comp_pres_state_set(const AVND_CB *cb, AVND_COMP *comp, (SA_AMF_PRESENCE_ORPHANED == prv_st { if (cb->is_avd_down == false) { avnd_di_uns32_upd_send(AVSV_SA_AMF_COMP, saAmfCompPresenceState_ID, - comp->name.c_str(), comp->pres); + comp->name, comp->pres); } } diff --git a/src/amf/common/util.c b/src/amf/common/util.c index ec76c32..d17b766 100644 --- a/src/amf/common/util.c +++ b/src/amf/common/util.c @@ -240,12 +240,12 @@ void avsv_create_association_class_dn(const SaNameT *child_dn, } if (dn) { + TRACE("dn: %s", buf); osaf_extended_name_steal(buf, dn); } - TRACE_LEAVE2("child_dn: %s parent_dn: %s dn: %s", + TRACE_LEAVE2("child_dn: %s parent_dn: %s", child_dn_ptr ? child_dn_ptr : "no child dn", - parent_dn_ptr ? parent_dn_ptr : "no parent dn", - buf); + parent_dn_ptr ? parent_dn_ptr : "no parent dn"); } void avsv_sanamet_init_from_association_dn(const SaNameT *haystack, SaNameT *dn, ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] mds: use new TIPC getsockopt to log receive queue utilization [#3038]
Hi Hans ack (review only) Thanks On 20/5/19 10:27 pm, Hans Nordebäck wrote: --- 00-README.conf | 14 +++ src/base/Makefile.am | 1 + src/base/statistics.h| 88 + src/mds/Makefile.am | 8 +- src/mds/mds_dt_tipc.c| 3 + src/mds/mds_tipc_recvq_stats.cc | 29 + src/mds/mds_tipc_recvq_stats.h | 32 + src/mds/mds_tipc_recvq_stats_impl.cc | 178 +++ src/mds/mds_tipc_recvq_stats_impl.h | 39 ++ 9 files changed, 390 insertions(+), 2 deletions(-) create mode 100644 src/base/statistics.h create mode 100644 src/mds/mds_tipc_recvq_stats.cc create mode 100644 src/mds/mds_tipc_recvq_stats.h create mode 100644 src/mds/mds_tipc_recvq_stats_impl.cc create mode 100644 src/mds/mds_tipc_recvq_stats_impl.h diff --git a/00-README.conf b/00-README.conf index 8f20e5209..da1825f06 100644 --- a/00-README.conf +++ b/00-README.conf @@ -737,3 +737,17 @@ initiate a 'self-fencing' by rebooting the node, if it determines the node should no longer be active according to the consensus service, to prevent a split-brain situation. +TIPC receive queue utilization +== + +If setting the environment variable MDS_RECVQ_STATS_LOG_FREQ_SEC in a service config +file enables TIPC receive queue utilisation statistics. The argument is how often the +statistics will be written to syslog. + +Example amfd.conf: + +export MDS_RECVQ_STATS_LOG_FREQ_SEC=5 + +then every 5 seconds a log record is written: + +May 20 12:23:30 SC-1 local0.notice osafamfd[545]: NO TIPC receive queue utilization (in %): min: 3.86 max: 4.38 mean: 4.15 std dev: 0.18 diff --git a/src/base/Makefile.am b/src/base/Makefile.am index ce93562e5..025fb86a2 100644 --- a/src/base/Makefile.am +++ b/src/base/Makefile.am @@ -157,6 +157,7 @@ noinst_HEADERS += \ src/base/saf_error.h \ src/base/saf_mem.h \ src/base/sprr_dl_api.h \ + src/base/statistics.h \ src/base/string_parse.h \ src/base/sysf_exc_scr.h \ src/base/sysf_ipc.h \ diff --git a/src/base/statistics.h b/src/base/statistics.h new file mode 100644 index 0..9ce980fc1 --- /dev/null +++ b/src/base/statistics.h @@ -0,0 +1,88 @@ +/* -*- OpenSAF -*- + * + * (C) Copyright 2019 The OpenSAF Foundation + * Copyright Ericsson AB 2019 - All Rights Reserved. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed + * under the GNU Lesser General Public License Version 2.1, February 1999. + * The complete license can be accessed from the following location: + * http://opensource.org/licenses/lgpl-license.php + * See the Copying file included with the OpenSAF distribution for full + * licensing terms. + * + * Author(s): Ericsson AB + * + */ + +#ifndef STATISTICS_H_ +#define STATISTICS_H_ + +#include + +namespace base { + +class Statistics { + public: + void clear() { +n_ = 0; + } + + void push(double x) { +n_++; + +// See Knuth, Art Of Computer Programming, Volume 2. The Seminumerical Algorithms, 4.2.2. Accuracy of Floating Point Arithmetic, +// using the recurrence formulas: +// M1 = x1, Mk = Mk-1 + (xk - Mk-1) / k (15) +// S1 = 0, Sk = Sk-1 + (xk - Mk-1) * (xk - Mk) (16) +// for 2 <= k <= n, sqrt(Sn/(n-1) +if (n_ == 1) { + prev_m_ = current_m_ = x; + prev_s_ = 0; + min_ = x; + max_ = x; +} else { + current_m_ = prev_m_ + (x - prev_m_) / n_; + current_s_ = prev_s_ + (x - prev_m_) * (x - current_m_); + + if (x > max_) max_ = x; + if (x < min_) min_ = x; + prev_m_ = current_m_; + prev_s_ = current_s_; +} + } + + double mean() const { +return (n_ > 0) ? current_m_ : 0; + } + + double variance() const { +return (n_ > 1) ? current_s_ / (n_ - 1) : 0; + } + + double std_dev() const { +return sqrt(variance()); + } + + double min() const { +return min_; + } + double max() const { +return max_; + } + + private: + int n_{0}; + double prev_m_{0}; + double current_m_{0}; + double prev_s_{0}; + double current_s_{0}; + double min_{0}; + double max_{0}; +}; + +} // namespace base + +#endif // STATISTICS_H_ + diff --git a/src/mds/Makefile.am b/src/mds/Makefile.am index 3724d2ea8..2d7b652e9 100644 --- a/src/mds/Makefile.am +++ b/src/mds/Makefile.am @@ -46,8 +46,12 @@ lib_libopensaf_core_la_SOURCES += \ src/mds/ncs_vda.c if ENABLE_TIPC_TRANSPORT -noinst_HEADERS += src/mds/mds_dt_tipc.h -lib_libopensaf_core_la_SOURCES += src/mds/mds_dt_tipc.c +noinst_HEADERS += src/mds/mds_dt_tipc.h \ + src/mds/mds_tipc_recvq_stats.h \ + src/mds/mds_tipc_recvq_stats_impl.h +lib_libopensaf_core_la_SOURCES += src/mds/mds_dt_tipc.c \ + src/mds/mds_tipc_recvq_stats.cc
[devel] [PATCH 0/1] Review Request for rded: improve self-fencing response time [#3039]
Summary: rded: improve self-fencing response time [#3039] Review request for Ticket(s): 3039 Peer Reviewer(s): Hans Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3039 Base revision: 1bff38564b69175fa4a0ea2cb1d40bd432581bd6 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesy Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision f8b4a473feafd23ce9d130a8ad245c5da75ab9b4 Author: Gary Lee Date: Mon, 27 May 2019 09:54:40 +1000 rded: improve self-fencing response time [#3039] When connectivity to consensus service is lost, it is recorded in a state variable. When all RDE peers are lost, the node will now self-fence immediately. Complete diffstat: -- src/rde/rded/rde_cb.h| 5 + src/rde/rded/rde_main.cc | 18 -- src/rde/rded/role.cc | 24 src/rde/rded/role.h | 3 +++ 4 files changed, 48 insertions(+), 2 deletions(-) Testing Commands: - 'export FMS_RELAXED_NODE_PROMOTION=1' in fmd.conf Block cluster from accessing consensus service Reboot standby SC Testing, Expected Results: -- Active SC should self-fence immediately after noticing peer RDE is down Conditions of Submission: - ack, or in 7 days Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] rded: improve self-fencing response time [#3039]
When connectivity to consensus service is lost, it is recorded in a state variable. When all RDE peers are lost, the node will now self-fence immediately. --- src/rde/rded/rde_cb.h| 5 + src/rde/rded/rde_main.cc | 18 -- src/rde/rded/role.cc | 24 src/rde/rded/role.h | 3 +++ 4 files changed, 48 insertions(+), 2 deletions(-) diff --git a/src/rde/rded/rde_cb.h b/src/rde/rded/rde_cb.h index 9a0919c..e35fdab 100644 --- a/src/rde/rded/rde_cb.h +++ b/src/rde/rded/rde_cb.h @@ -18,6 +18,7 @@ #ifndef RDE_RDED_RDE_CB_H_ #define RDE_RDED_RDE_CB_H_ +#include #include #include #include "base/osaf_utility.h" @@ -37,6 +38,8 @@ enum class State {kNotActive = 0, kNotActiveSeenPeer, kActiveElected, kActiveElectedSeenPeer, kActiveFailover}; +enum class ConsensusState {kUnknown = 0, kConnected, kDisconnected}; + struct RDE_CONTROL_BLOCK { SYSF_MBX mbx; NCSCONTEXT task_handle; @@ -49,6 +52,8 @@ struct RDE_CONTROL_BLOCK { // used for discovering peer controllers, regardless of their role std::set peer_controllers{}; State state{State::kNotActive}; + std::atomic consensus_service_state{ConsensusState::kUnknown}; + std::atomic state_refresh_thread_started{false}; // consensus service }; enum RDE_MSG_TYPE { diff --git a/src/rde/rded/rde_main.cc b/src/rde/rded/rde_main.cc index 456d2ce..1a7e587 100644 --- a/src/rde/rded/rde_main.cc +++ b/src/rde/rded/rde_main.cc @@ -178,6 +178,19 @@ static void handle_mbx_event() { case RDE_MSG_CONTROLLER_DOWN: rde_cb->peer_controllers.erase(msg->fr_node_id); TRACE("peer_controllers: size %zu", rde_cb->peer_controllers.size()); + if (role->role() == PCS_RDA_ACTIVE) { +Consensus consensus_service; +if (consensus_service.IsEnabled() == true && +rde_cb->consensus_service_state == ConsensusState::kDisconnected && +consensus_service.IsRelaxedNodePromotionEnabled() == true && +role->IsPeerPresent() == false) { +LOG_NO("Lost connectivity to consensus service. No peer present"); +if (consensus_service.IsRemoteFencingEnabled() == false) { +opensaf_quick_reboot("Lost connectivity to consensus service. " + "Rebooting this node"); +} +} + } break; case RDE_MSG_TAKEOVER_REQUEST_CALLBACK: { rde_cb->monitor_takeover_req_thread_running = false; @@ -214,7 +227,7 @@ static void handle_mbx_event() { if (consensus_service.IsRelaxedNodePromotionEnabled() == true) { if (rde_cb->state == State::kActiveElected) { TRACE("Relaxed mode is enabled"); -TRACE(" No peer SC yet seen, ignore consensus service failure"); +TRACE("No peer SC yet seen, ignore consensus service failure"); // if relaxed node promotion is enabled, and we have yet to see // a peer SC after being promoted, tolerate consensus service // not working @@ -227,13 +240,14 @@ static void handle_mbx_event() { // we have seen the peer, and peer is still connected, tolerate // consensus service not working fencing_required = false; +rde_cb->consensus_service_state = ConsensusState::kDisconnected; } } if (fencing_required == true) { LOG_NO("Lost connectivity to consensus service"); if (consensus_service.IsRemoteFencingEnabled() == false) { opensaf_quick_reboot("Lost connectivity to consensus service. " - "Rebooting this node"); + "Rebooting this node"); } } } diff --git a/src/rde/rded/role.cc b/src/rde/rded/role.cc index 3effc25..b8c8157 100644 --- a/src/rde/rded/role.cc +++ b/src/rde/rded/role.cc @@ -215,6 +215,18 @@ timespec* Role::Poll(timespec* ts) { is_candidate).detach(); } } + } else if (role_ == PCS_RDA_ACTIVE) { +RDE_CONTROL_BLOCK* cb = rde_get_control_block(); +if (cb->consensus_service_state == ConsensusState::kUnknown || +cb->consensus_service_state == ConsensusState::kDisconnected) { + // consensus service was previously disconnected, refresh state + Consensus consensus_service; + if (consensus_service.IsEnabled() == true && +cb->state_refresh_thread_started == false) { +cb->state_refresh_thread_started = true; +std::thread(::RefreshConsensusState, this, cb).detach(); + } +} } return timeout; } @@ -351,3 +363,15 @@ void Role::PromoteNodeLate() { this, cb->cluster_members.size(), true).detach(); } + +void Role::RefreshConsensusState(RDE_CONTROL_BLOCK* cb) { + TRACE_ENTER(); + + Consensus consensus_service; + if (consensus_service.IsWritable()
Re: [devel] [PATCH 1/1] amfnd: reboot to recovery if msg id received by amfd mismatch with msg id sent by amfnd [#3040]
Hi Thang Looks good to me. Nagu, any comments? Thanks Gary On 15/5/19 12:14 am, thang.d.nguyen wrote: During SC failover, message received on ACTIVE AMFD can not be checked point to AMFD on STANDBY SC. But the AMFND still process the message ack for that message then it remove from queue. STANDBY SC takes ACTIVE and mismatch message id b/w AMFD and AMFND on new ACTIVE. As consequence, clm track start can not invoked to update cluster member nodes if these nodes was rebooted. In this case, amfnd need rebooting automatically to recovery it. --- src/amf/amfnd/verify.cc | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/src/amf/amfnd/verify.cc b/src/amf/amfnd/verify.cc index 5726ad9..ddb1d15 100644 --- a/src/amf/amfnd/verify.cc +++ b/src/amf/amfnd/verify.cc @@ -116,12 +116,14 @@ uint32_t avnd_evt_avd_verify_evh(AVND_CB *cb, AVND_EVT *evt) { avnd_diq_rec_del(cb, rec); continue; } else { + if ((rcv_id + 1) == (*((uint32_t *)(>msg.info.avd->msg_info))) && + (msg_found == false)) { +msg_found = true; + } avnd_diq_rec_send(cb, rec); TRACE_1("AVND record %u sent, upon fail-over", *((uint32_t *)(>msg.info.avd->msg_info))); - - msg_found = true; } ++iter; } @@ -129,9 +131,12 @@ uint32_t avnd_evt_avd_verify_evh(AVND_CB *cb, AVND_EVT *evt) { if ((cb->snd_msg_id != info->rcv_id_cnt) && (msg_found == false)) { /* Log error, seems to be some problem.*/ LOG_EM( -"AVND record not found, after failover, snd_msg_id = %u, receive id = %u", -cb->snd_msg_id, info->rcv_id_cnt); -return NCSCC_RC_FAILURE; +"AVND record not found for msg id = %u", (rcv_id + 1)); +opensaf_reboot( +avnd_cb->node_info.nodeId, +osaf_extended_name_borrow(_cb->node_info.executionEnvironment), +"AVND record not found, after failover"); +exit(0); } /* ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for base: strip leading and trailing quotes [#3041]
Summary: base: strip leading and trailing quotes [#3041] Review request for Ticket(s): 3041 Peer Reviewer(s): Hans, Minh, Vu Pull request to: Affected branch(es): develop Development branch: ticket-3041 Base revision: 55466efcacc6d83f104ee747ebc189688ccc2de1 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesn Core libraries y Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 6bd164279a2fbd881c4700566960f3ede728f4df Author: Gary Lee Date: Fri, 17 May 2019 22:09:05 +1000 base: strip leading and trailing quotes [#3041] ConfigFileReader enables runtime 'reload' of .conf files. However, if the environment variable is surrounded by quotes, it adds the quotes to the value which is not the expected behaviour. export FOO="foo" FOO should contain just foo, not "foo". Complete diffstat: -- src/base/config_file_reader.cc | 15 +++ src/osaf/consensus/consensus.cc | 1 + 2 files changed, 16 insertions(+) Testing Commands: - pkill -SIGUSR2 osaffmd (turn on tracing) pkill -SIGHUP osaffmd (reload) Testing, Expected Results: -- Check osaffmd trace: <143>1 2019-05-17T20:11:56.865293+10:00 SC-1 osaffmd 188 osaffmd [meta sequenceId="13"] 188:osaf/consensus/consensus.cc:298 TR Setting 'FMS_HA_ENV_HEALTHCHECK_KEY' to 'Default' It should not say '"Default"', but 'Default' Conditions of Submission: - Ack from anyone Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] base: strip leading and trailing quotes [#3041]
ConfigFileReader enables runtime 'reload' of .conf files. However, if the environment variable is surrounded by quotes, it adds the quotes to the value which is not the expected behaviour. export FOO="foo" FOO should contain just foo, not "foo". --- src/base/config_file_reader.cc | 15 +++ src/osaf/consensus/consensus.cc | 1 + 2 files changed, 16 insertions(+) diff --git a/src/base/config_file_reader.cc b/src/base/config_file_reader.cc index 63cad7d..0132547 100644 --- a/src/base/config_file_reader.cc +++ b/src/base/config_file_reader.cc @@ -36,6 +36,18 @@ static void trim(std::string& str) { right_trim(str); } +static void strip_quotes(std::string& str) { + // trim leading and trailing quotes + if (str.front() == '"' || + str.front() == '\'') { +str.erase(0, 1); // delete first char + } + if (str.back() == '"' || +str.back() == '\'') { +str.pop_back(); // delete last char + } +} + ConfigFileReader::SettingsMap ConfigFileReader::ParseFile( const std::string& filename) { const std::string prefix("export"); @@ -80,6 +92,9 @@ ConfigFileReader::SettingsMap ConfigFileReader::ParseFile( std::string value = line.substr(equal + 1); trim(value); + strip_quotes(key); + strip_quotes(value); + map[key] = value; } file.close(); diff --git a/src/osaf/consensus/consensus.cc b/src/osaf/consensus/consensus.cc index 480f7d2..0bebab2 100644 --- a/src/osaf/consensus/consensus.cc +++ b/src/osaf/consensus/consensus.cc @@ -295,6 +295,7 @@ bool Consensus::ReloadConfiguration() { continue; } int rc; +TRACE("Setting '%s' to '%s'", kv.first.c_str(), kv.second.c_str()); rc = setenv(kv.first.c_str(), kv.second.c_str(), 1); osafassert(rc == 0); } -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfnd: don't attempt su failover if active controller is rebooting [#3035]
Hi Alex ack (review only) Gary On 8/5/19 5:46 am, Jones, Alex wrote: In N+M model CSI-remove responses can get lost if active controller reboots. In this case SG will be stuck in unstable state, and standby will never get assignments. We are the active controller, active for N+M, SU failover is set, and failfast on termination failure is set for the nodes. If a component in the SU crashes, and another component fails during cleanup, the node does failfast. It currently attempts to do su failover in this case, but the csi-remove responses from the payload can get lost because we are rebooting. They eventually show up on the new active, but we get message-id errors. Set a flag when the active controller is about to reboot. If the flag is set, then don't do SU failover. Let the new active take care of the failover. --- src/amf/amfd/node.cc | 1 + src/amf/amfd/node.h | 1 + src/amf/amfd/sgproc.cc | 7 +++ src/amf/amfd/util.cc | 3 +++ 4 files changed, 12 insertions(+) diff --git a/src/amf/amfd/node.cc b/src/amf/amfd/node.cc index 7fc764f22..b8d8a7d77 100644 --- a/src/amf/amfd/node.cc +++ b/src/amf/amfd/node.cc @@ -121,6 +121,7 @@ void AVD_AVND::initialize() { clm_pend_inv = {}; clm_change_start_preceded = {}; recvr_fail_sw = {}; + actv_ctrl_reboot_in_progress = {}; admin_ng = {}; } diff --git a/src/amf/amfd/node.h b/src/amf/amfd/node.h index ecee5c591..dbe48dc43 100644 --- a/src/amf/amfd/node.h +++ b/src/amf/amfd/node.h @@ -140,6 +140,7 @@ class AVD_AVND { CLM completed cb. */ bool recvr_fail_sw; /* to indicate there was node reboot because of node failover/switchover.*/ + bool actv_ctrl_reboot_in_progress; AVD_AMF_NG *admin_ng; /* points to the nodegroup on which admin operation is going on.*/ uint16_t node_up_msg_count; /* to count of node_up msg that director had diff --git a/src/amf/amfd/sgproc.cc b/src/amf/amfd/sgproc.cc index 1537acac3..7c8d9a558 100644 --- a/src/amf/amfd/sgproc.cc +++ b/src/amf/amfd/sgproc.cc @@ -478,6 +478,13 @@ static uint32_t sg_su_failover_func(AVD_SU *su) { goto done; } + if (su->su_on_node->actv_ctrl_reboot_in_progress) { + TRACE("'%s' is already going down, so not doing SU failover", + su->name.c_str()); + rc = NCSCC_RC_SUCCESS; + goto done; + } + su->set_oper_state(SA_AMF_OPERATIONAL_DISABLED); su->set_readiness_state(SA_AMF_READINESS_OUT_OF_SERVICE); if (su->saAmfSUAdminState == SA_AMF_ADMIN_LOCKED) diff --git a/src/amf/amfd/util.cc b/src/amf/amfd/util.cc index 14a4e0485..0dc3e99e3 100644 --- a/src/amf/amfd/util.cc +++ b/src/amf/amfd/util.cc @@ -1802,6 +1802,9 @@ void avd_d2n_reboot_snd(AVD_AVND *node) { if (avd_d2n_msg_snd(avd_cb, node, d2n_msg) != NCSCC_RC_SUCCESS) { LOG_ER("%s: snd to %x failed", __FUNCTION__, node->node_info.nodeId); d2n_msg_free(d2n_msg); + } else if (node->node_info.nodeId == avd_cb->node_id_avd) { + TRACE("rebooting active amf director which is ourself"); + node->actv_ctrl_reboot_in_progress = true; } } -- 2.17.2 Notice: This e-mail together with any attachments may contain information of Ribbon Communications Inc. that is confidential and/or proprietary for the sole use of the intended recipient. Any review, disclosure, reliance or distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please notify the sender immediately and then delete all copies, including any attachments. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] mbc: prevent infinite peer_up message loop [#3021]
Hi I will push this on Wednesday if there are no comments. Thanks Gary On 26/3/19 1:16 pm, Gary Lee wrote: If the active and standby SCs are split into network partitions, it is possible a RED_UP never arrives even though we have already received MBC PEER_UP. The service using MBC will then get stuck in an infinite loop and probably fail health checks. To cater for 'normal' race conditions between MDS topology and data messages, allow only up to 255 loops. If this is exceeded, the msg will be discarded. --- src/mbc/mbcsv_evt_msg.h | 2 ++ src/mbc/mbcsv_peer.c| 10 ++ 2 files changed, 12 insertions(+) diff --git a/src/mbc/mbcsv_evt_msg.h b/src/mbc/mbcsv_evt_msg.h index f11a553..9eef747 100644 --- a/src/mbc/mbcsv_evt_msg.h +++ b/src/mbc/mbcsv_evt_msg.h @@ -197,6 +197,8 @@ typedef struct mbcsv_evt { MBCSV_EVT_MDS_SUBSCR_INFO mds_sub_evt; } info; + uint32_t hops; + } MBCSV_EVT; /*** diff --git a/src/mbc/mbcsv_peer.c b/src/mbc/mbcsv_peer.c index b45904f..1d4b257 100644 --- a/src/mbc/mbcsv_peer.c +++ b/src/mbc/mbcsv_peer.c @@ -826,6 +826,15 @@ uint32_t mbcsv_process_peer_up_info(MBCSV_EVT *msg, CKPT_INST *ckpt, memcpy(evt, msg, sizeof(MBCSV_EVT)); TRACE_4("Still RED_UP event not arrived of the peer"); + if (evt->hops < 255) { + ++evt->hops; + } else { + LOG_WA("RED_UP missing, discarding peer up"); + m_NCS_UNLOCK(_cb.peer_list_lock, + NCS_LOCK_WRITE); + m_MMGR_FREE_MBCSV_EVT(evt); + return NCSCC_RC_FAILURE; + } /* Again post the event, till RED_UP event arrives */ if (NCSCC_RC_SUCCESS != @@ -833,6 +842,7 @@ uint32_t mbcsv_process_peer_up_info(MBCSV_EVT *msg, CKPT_INST *ckpt, TRACE_LEAVE2("ipc send failed"); m_NCS_UNLOCK(_cb.peer_list_lock, NCS_LOCK_WRITE); + m_MMGR_FREE_MBCSV_EVT(evt); return NCSCC_RC_FAILURE; } ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for mbc: prevent infinite peer_up message loop [#3021]
Summary: mbc: prevent infinite peer_up message loop [#3021] Review request for Ticket(s): 3021 Peer Reviewer(s): Canh, Anders, Hans Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3021 Base revision: 7f68859e0dc70179eff72515f28bc69ffd1ab208 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesn Core libraries y Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 4825d97b7e9565daae7b36aaba7a7c8717ff627c Author: Gary Lee Date: Tue, 26 Mar 2019 13:08:16 +1100 mbc: prevent infinite peer_up message loop [#3021] If the active and standby SCs are split into network partitions, it is possible a RED_UP never arrives even though we have already received MBC PEER_UP. The service using MBC will then get stuck in an infinite loop and probably fail health checks. To cater for 'normal' race conditions between MDS topology and data messages, allow only up to 255 loops. If this is exceeded, the msg will be discarded. Complete diffstat: -- src/mbc/mbcsv_evt_msg.h | 2 ++ src/mbc/mbcsv_peer.c| 10 ++ 2 files changed, 12 insertions(+) Testing Commands: - Ran test that splits SCs and reproduces the reported issue Testing, Expected Results: -- No more amfd coredumps due to watchdog timeouts Conditions of Submission: - Ack from reviewer Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] mbc: prevent infinite peer_up message loop [#3021]
If the active and standby SCs are split into network partitions, it is possible a RED_UP never arrives even though we have already received MBC PEER_UP. The service using MBC will then get stuck in an infinite loop and probably fail health checks. To cater for 'normal' race conditions between MDS topology and data messages, allow only up to 255 loops. If this is exceeded, the msg will be discarded. --- src/mbc/mbcsv_evt_msg.h | 2 ++ src/mbc/mbcsv_peer.c| 10 ++ 2 files changed, 12 insertions(+) diff --git a/src/mbc/mbcsv_evt_msg.h b/src/mbc/mbcsv_evt_msg.h index f11a553..9eef747 100644 --- a/src/mbc/mbcsv_evt_msg.h +++ b/src/mbc/mbcsv_evt_msg.h @@ -197,6 +197,8 @@ typedef struct mbcsv_evt { MBCSV_EVT_MDS_SUBSCR_INFO mds_sub_evt; } info; + uint32_t hops; + } MBCSV_EVT; /*** diff --git a/src/mbc/mbcsv_peer.c b/src/mbc/mbcsv_peer.c index b45904f..1d4b257 100644 --- a/src/mbc/mbcsv_peer.c +++ b/src/mbc/mbcsv_peer.c @@ -826,6 +826,15 @@ uint32_t mbcsv_process_peer_up_info(MBCSV_EVT *msg, CKPT_INST *ckpt, memcpy(evt, msg, sizeof(MBCSV_EVT)); TRACE_4("Still RED_UP event not arrived of the peer"); + if (evt->hops < 255) { + ++evt->hops; + } else { + LOG_WA("RED_UP missing, discarding peer up"); + m_NCS_UNLOCK(_cb.peer_list_lock, + NCS_LOCK_WRITE); + m_MMGR_FREE_MBCSV_EVT(evt); + return NCSCC_RC_FAILURE; + } /* Again post the event, till RED_UP event arrives */ if (NCSCC_RC_SUCCESS != @@ -833,6 +842,7 @@ uint32_t mbcsv_process_peer_up_info(MBCSV_EVT *msg, CKPT_INST *ckpt, TRACE_LEAVE2("ipc send failed"); m_NCS_UNLOCK(_cb.peer_list_lock, NCS_LOCK_WRITE); + m_MMGR_FREE_MBCSV_EVT(evt); return NCSCC_RC_FAILURE; } -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for osaf: ensure an error is returned if takeover_request fails [#3023]
Summary: osaf: ensure an error is returned if takeover_request fails [#3023] Review request for Ticket(s): 3023 Peer Reviewer(s): Hans, Minh Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3023 Base revision: 819801c5414f73bfbdb3f4101958981ae1d29bb3 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesn Core libraries y Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 7034e7149d0cd4e74078287c516fc33fad21076f Author: Gary Lee Date: Tue, 26 Mar 2019 10:51:52 +1100 osaf: ensure an error is returned if takeover_request fails [#3023] if we cannot read the result of a takeover_request, ensure we return an error Complete diffstat: -- src/osaf/consensus/consensus.cc | 2 ++ 1 file changed, 2 insertions(+) Testing Commands: - ran regression tests Testing, Expected Results: -- OK Conditions of Submission: - ack from any reviewer Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] osaf: ensure an error is returned if takeover_request fails [#3023]
if we cannot read the result of a takeover_request, ensure we return an error --- src/osaf/consensus/consensus.cc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/osaf/consensus/consensus.cc b/src/osaf/consensus/consensus.cc index cf307b3..480f7d2 100644 --- a/src/osaf/consensus/consensus.cc +++ b/src/osaf/consensus/consensus.cc @@ -433,6 +433,8 @@ SaAisErrorT Consensus::CreateTakeoverRequest(const std::string& current_owner, return rc; } + // in case takeover request cannot be read + rc = SA_AIS_ERR_FAILED_OPERATION; // wait up to max_takeover_retry seconds for request to be answered retries = 0; while (retries < max_takeover_retry_) { -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] osaf: improve response time in etcd3.plugin [#3016]
if the initial call to watch takeover request in etcd3.plugin is made when etcd has already been shutdown (for example, when etcd is running locally and the node is being shutdown), the plugin should return 0 with a fake takeover request to ensure rded shuts down promptly. Otherwise, it will keep calling watch, delaying node shutdown. --- src/osaf/consensus/plugins/etcd3.plugin | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/src/osaf/consensus/plugins/etcd3.plugin b/src/osaf/consensus/plugins/etcd3.plugin index acccd98..d926885 100644 --- a/src/osaf/consensus/plugins/etcd3.plugin +++ b/src/osaf/consensus/plugins/etcd3.plugin @@ -357,9 +357,16 @@ watch() { return 0 fi done + else +# etcd down? +if [ "$watch_key" == "$takeover_request" ]; then + hostname=`cat $node_name_file` + echo "$hostname SC-0 1000 UNDEFINED" + return 0 +else + return 1 +fi fi - - return 1 } # argument parsing -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for osaf: improve response time in etcd3.plugin [#3016]
Summary: osaf: improve response time in etcd3.plugin [#3016] Review request for Ticket(s): 3016 Peer Reviewer(s): Thuan Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3016 Base revision: 0a3f48cfaf9f443c405cfd7122904c5cbe607226 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesn Core libraries n Samples y Tests n Other n Comments (indicate scope for each "y" above): - revision ce0af7444b489620bc3f1a5ba5d876f563167b00 Author: Gary Lee Date: Tue, 12 Mar 2019 11:20:35 +1100 osaf: improve response time in etcd3.plugin [#3016] if the initial call to watch takeover request in etcd3.plugin is made when etcd has already been shutdown (for example, when etcd is running locally and the node is being shutdown), the plugin should return 0 with a fake takeover request to ensure rded shuts down promptly. Otherwise, it will keep calling watch, delaying node shutdown. Complete diffstat: -- src/osaf/consensus/plugins/etcd3.plugin | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] dtm: Fix dtm close socket due to duplication of adding node IP info [#2984]
Hi Canh One minor comment, KEY_TYPES should probably be called KeyTypes. Also, can you make it an enum class, rather than plain enum? Thanks Gary On 7/3/19 12:53 am, Hans Nordebäck wrote: Hi Canh, ack, review only. I think it would be good to separate the re-factoring part in a separate ticket though. /BR Hans On 12/18/18 08:25, Canh Van Truong wrote: During cluster start, one node (node 1) broadcast up msg to other node. The remote node (node 2) get this msg and send the connection to node 1 (connect()). Similarly node 1 send the connection to node 2 after node 2 broadcast up msg to. Beside of node 2 connect() to node 1, node 2 also add the IP and ID info of node 1 to database. But before of that, node 2 may also accept the connection that come from node 1. The acception is also add node ID of node 1. So there is 2 times adding the node ID info of node 1 to database in node 2. This causes the socket connection is closed and node is restart again. The patch change to retrieve node from database by node IP instead node ID in processing connection. This will reject the double of establishing connection between 2 nodes and also double of adding node IP to database. --- src/dtm/dtmnd/dtm.h | 11 -- src/dtm/dtmnd/dtm_inter_trans.cc | 3 +- src/dtm/dtmnd/dtm_node.cc | 2 +- src/dtm/dtmnd/dtm_node_db.cc | 79 --- src/dtm/dtmnd/dtm_node_sockets.cc | 20 ++ 5 files changed, 72 insertions(+), 43 deletions(-) diff --git a/src/dtm/dtmnd/dtm.h b/src/dtm/dtmnd/dtm.h index 28c811e65..a06b8f503 100644 --- a/src/dtm/dtmnd/dtm.h +++ b/src/dtm/dtmnd/dtm.h @@ -45,6 +45,11 @@ typedef enum { DTM_MBX_MSG_TYPE = 5, } MBX_POST_TYPES; +typedef enum { + DTM_NODE_ID_KEY_TYPE = 0, + DTM_NODE_IP_KEY_TYPE = 2, +} KEY_TYPES; + typedef struct dtm_rcv_msg_elem { void *next; MBX_POST_TYPES type; @@ -99,10 +104,10 @@ typedef struct dtm_snd_msg_elem { extern void node_discovery_process(void *arg); extern uint32_t dtm_cb_init(DTM_INTERNODE_CB *dtms_cb); -extern DTM_NODE_DB *dtm_node_get_by_id(uint32_t nodeid); +extern DTM_NODE_DB *dtm_node_get(uint8_t *key, KEY_TYPES type); extern DTM_NODE_DB *dtm_node_getnext_by_id(uint32_t node_id); -extern uint32_t dtm_node_add(DTM_NODE_DB *node, int i); -extern uint32_t dtm_node_delete(DTM_NODE_DB *nnode, int i); +extern uint32_t dtm_node_add(DTM_NODE_DB *node, KEY_TYPES type); +extern uint32_t dtm_node_delete(DTM_NODE_DB *nnode, KEY_TYPES type); extern DTM_NODE_DB *dtm_node_new(const DTM_NODE_DB *new_node); extern void dtm_print_config(DTM_INTERNODE_CB *config); extern int dtm_read_config(DTM_INTERNODE_CB *config, diff --git a/src/dtm/dtmnd/dtm_inter_trans.cc b/src/dtm/dtmnd/dtm_inter_trans.cc index 9d8335466..9b4194614 100644 --- a/src/dtm/dtmnd/dtm_inter_trans.cc +++ b/src/dtm/dtmnd/dtm_inter_trans.cc @@ -235,9 +235,10 @@ static uint32_t dtm_internode_snd_msg_common(DTM_NODE_DB *node, uint8_t *buffer, uint32_t dtm_internode_snd_msg_to_node(uint8_t *buffer, uint16_t len, NODE_ID node_id) { DTM_NODE_DB *node = nullptr; + uint8_t *key = reinterpret_cast(_id); TRACE_ENTER(); - node = dtm_node_get_by_id(node_id); + node = dtm_node_get(key, DTM_NODE_ID_KEY_TYPE); if (nullptr != node) { if (NCSCC_RC_SUCCESS != dtm_internode_snd_msg_common(node, buffer, len)) { diff --git a/src/dtm/dtmnd/dtm_node.cc b/src/dtm/dtmnd/dtm_node.cc index de2f94738..72506f262 100644 --- a/src/dtm/dtmnd/dtm_node.cc +++ b/src/dtm/dtmnd/dtm_node.cc @@ -125,7 +125,7 @@ uint32_t dtm_process_node_info(DTM_INTERNODE_CB *dtms_cb, DTM_NODE_DB *node, memcpy(node->node_name, data, nodename_len); node->node_name[nodename_len] = '\0'; node->comm_status = true; - if (dtm_node_add(node, 0) != NCSCC_RC_SUCCESS) { + if (dtm_node_add(node, DTM_NODE_ID_KEY_TYPE) != NCSCC_RC_SUCCESS) { LOG_ER( "DTM: A node already exists in the cluster with similar " "configuration (possible duplicate IP address and/or node id), please " diff --git a/src/dtm/dtmnd/dtm_node_db.cc b/src/dtm/dtmnd/dtm_node_db.cc index 1c9da4dac..1038f0918 100644 --- a/src/dtm/dtmnd/dtm_node_db.cc +++ b/src/dtm/dtmnd/dtm_node_db.cc @@ -123,24 +123,49 @@ uint32_t dtm_cb_init(DTM_INTERNODE_CB *dtms_cb) { } /** - * Retrieve node from node db by nodeid + * Retrieve node from node db * - * @param nodeid + * @param key + * @param i * - * @return NCSCC_RC_SUCCESS - * @return NCSCC_RC_FAILURE + * @return node * */ -DTM_NODE_DB *dtm_node_get_by_id(uint32_t nodeid) { +DTM_NODE_DB *dtm_node_get(uint8_t *key, KEY_TYPES type) { TRACE_ENTER(); DTM_INTERNODE_CB *dtms_cb = dtms_gl_cb; + DTM_NODE_DB *node = nullptr; - DTM_NODE_DB *node = reinterpret_cast(ncs_patricia_tree_get( - _cb->nodeid_tree, reinterpret_cast())); - if (node !=
Re: [devel] [PATCH 1/1] imm: fix racing in sending discard-node during network split [#3012]
Hi Vu Ack (review only) Thanks On 25/2/19, 6:30 pm, "Vu Minh Nguyen" wrote: At the time of spliting the cluster into 02 partitions but keeping a node such as PL-3 connecting with both partitions, just IMMND on PL-3 will get discard-node messages from both active IMMD on partition #1 and from standby IMMD on partition #2. That race later on caused IMMND on PL-3 crashed due to the mismatch found at finalize-sync. This patch makes a minor change at standby IMMD - rather then sending the discard-node message even in standby role, will put the message in queue and only broadcast it when the standby is assigned to active. --- src/imm/immd/immd_proc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/imm/immd/immd_proc.c b/src/imm/immd/immd_proc.c index c16232d2d..69e23f2d3 100644 --- a/src/imm/immd/immd_proc.c +++ b/src/imm/immd/immd_proc.c @@ -778,7 +778,7 @@ uint32_t immd_process_immnd_down(IMMD_CB *cb, IMMD_IMMND_INFO_NODE *immnd_info, } } - if (active || !cb->immd_remote_up) { + if (active) { /* ** HAFE - Let IMMND subscribe for IMMND up/down events instead? ** ABT - Not for now. IMMND up/down are only subscribed by -- 2.19.2 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 2/2] rded: do not send SUCCESS to main thread [#3008]
Hi Hans Without the return statement, RDE could potentially proceed with setting itself to active etc.. We didn't notice this because opensaf_reboot() has this, but we're no longer calling that. if (use_fallback) { /* Wait for the alarm signal we set up earlier. */ for (;;) pause(); } Probably a better fix is to add something similar to opensaf_quick_reboot(). Thanks Gary On 20/2/19 11:54 pm, Hans Nordebäck wrote: Hi Gary, a question, why was the return's added? /BR HansN On 2/19/19 05:10, Gary Lee wrote: do not send RDE_MSG_ACTIVE_PROMOTION_SUCCESS to main thread if lock cannot be obtained --- src/rde/rded/role.cc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/rde/rded/role.cc b/src/rde/rded/role.cc index 06e93c6..3effc25 100644 --- a/src/rde/rded/role.cc +++ b/src/rde/rded/role.cc @@ -114,6 +114,7 @@ void Role::PromoteNode(const uint64_t cluster_size, LOG_ER("Unable to set active controller in consensus service"); opensaf_quick_reboot("Unable to set active controller " "in consensus service"); +return; } RDE_CONTROL_BLOCK* cb = rde_get_control_block(); @@ -135,6 +136,7 @@ void Role::PromoteNode(const uint64_t cluster_size, LOG_ER("Unable to set active controller in consensus service"); opensaf_quick_reboot("Unable to set active controller in " "consensus service"); +return; } std::this_thread::sleep_for(std::chrono::seconds(1)); } ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/2] Review Request for fmd: improve failover response time [#3008]
Summary: fmd: improve failover response time V2 [#3008] Review request for Ticket(s): 3008 Peer Reviewer(s): Hans, Minh Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3008 Base revision: 5766361568498f8a496d87d8daafe9bffbd75ed9 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 8ccffc2cd9cd117578227e9cd49421e5c578fec6 Author: Gary Lee Date: Tue, 19 Feb 2019 14:57:53 +1100 rded: do not send SUCCESS to main thread [#3008] do not send RDE_MSG_ACTIVE_PROMOTION_SUCCESS to main thread if lock cannot be obtained revision 28e17d107f4a079155e03d9f875a3c0262ea19f5 Author: Gary Lee Date: Tue, 19 Feb 2019 14:57:53 +1100 fmd: improve failover response time [#3008] Improve failover response time if split brain prevention is enabled but FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is set to 0. Also, return immediately if node promotion fails to avoid sending active role to RDA. Complete diffstat: -- src/fm/fmd/fm_rda.cc | 14 +- src/rde/rded/role.cc | 2 ++ 2 files changed, 11 insertions(+), 5 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 2/2] rded: do not send SUCCESS to main thread [#3008]
do not send RDE_MSG_ACTIVE_PROMOTION_SUCCESS to main thread if lock cannot be obtained --- src/rde/rded/role.cc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/rde/rded/role.cc b/src/rde/rded/role.cc index 06e93c6..3effc25 100644 --- a/src/rde/rded/role.cc +++ b/src/rde/rded/role.cc @@ -114,6 +114,7 @@ void Role::PromoteNode(const uint64_t cluster_size, LOG_ER("Unable to set active controller in consensus service"); opensaf_quick_reboot("Unable to set active controller " "in consensus service"); +return; } RDE_CONTROL_BLOCK* cb = rde_get_control_block(); @@ -135,6 +136,7 @@ void Role::PromoteNode(const uint64_t cluster_size, LOG_ER("Unable to set active controller in consensus service"); opensaf_quick_reboot("Unable to set active controller in " "consensus service"); +return; } std::this_thread::sleep_for(std::chrono::seconds(1)); } -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/2] fmd: improve failover response time [#3008]
Improve failover response time if split brain prevention is enabled but FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is set to 0. Also, return immediately if node promotion fails to avoid sending active role to RDA. --- src/fm/fmd/fm_rda.cc | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/src/fm/fmd/fm_rda.cc b/src/fm/fmd/fm_rda.cc index 504757c..d3063ba 100644 --- a/src/fm/fmd/fm_rda.cc +++ b/src/fm/fmd/fm_rda.cc @@ -88,17 +88,20 @@ uint32_t fm_rda_set_role(FM_CB *fm_cb, PCS_RDA_ROLE role) { Consensus consensus_service; if (consensus_service.IsEnabled() == true) { -// Allow topology events to be processed first. The MDS thread may -// be processing MDS down events and updating cluster_size concurrently. -// We need cluster_size to be as accurate as possible, without waiting -// too long for node down events. -std::this_thread::sleep_for(std::chrono::seconds(4)); +if (consensus_service.PrioritisePartitionSize() == true) { + // Allow topology events to be processed first. The MDS thread may + // be processing MDS down events and updating cluster_size concurrently. + // We need cluster_size to be as accurate as possible, without waiting + // too long for node down events. + std::this_thread::sleep_for(std::chrono::seconds(4)); +} rc = consensus_service.PromoteThisNode(true, fm_cb->cluster_size); if (rc != SA_AIS_OK && rc != SA_AIS_ERR_EXIST) { LOG_ER("Unable to set active controller in consensus service"); opensaf_quick_reboot("Unable to set active controller " "in consensus service"); + return NCSCC_RC_FAILURE; } else if (rc == SA_AIS_ERR_EXIST) { // @todo if we don't reboot, we don't seem to recover from this. Can we // improve? @@ -107,6 +110,7 @@ uint32_t fm_rda_set_role(FM_CB *fm_cb, PCS_RDA_ROLE role) { "cluster?"); opensaf_quick_reboot("A controller is already active. We were separated " "from the cluster?"); + return NCSCC_RC_FAILURE; } } -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel