Re: [devel] [PATCH 1/1] ntf: fix ntfd remove client in standby node while not finalize in active node [#2705]
Hi Canh, Please see my comment marked with [Minh] Thanks, Minh On 06/12/17 21:21, Canh Van Truong wrote: Hi aMinh, Please see my comment. Regards Canh *From:*Minh Hon Chau [mailto:minh.c...@dektech.com.au] *Sent:* Wednesday, December 6, 2017 7:20 AM *To:* Canh Van Truong*Cc:* opensaf-devel@lists.sourceforge.net *Subject:* Re: [PATCH 1/1] ntf: fix ntfd remove client in standby node while not finalize in active node [#2705] Hi Canh, I got a build error with 32 bit src/ntf/ntfd/ntfs_com.c: In function ‘sendNtfaDownUpdate’: src/ntf/ntfd/ntfs_com.c:562:2: error: format ‘%ld’ expects argument of type ‘long int’, but argument 6 has type ‘MDS_DEST’ [-Werror=format=] TRACE_ENTER2("mdsDest: %ld", mdsDest); ^ cc1: all warnings being treated as errors make[2]: *** [src/ntf/ntfd/bin_osafntfd-ntfs_com.o] Error 1 And another question, with this patch the standby NTFD will no longer delete client in proc_ntfa_updn_mds_msg(), it will memorize the client to down_list. So in the situation of this ticket, the standby will have sequence of msg as below: 1 -> receive checkpoint of finalize 2 -> receive checkpoint of client_down 3 -> receive checkpoint of initialize 4 -> process proc_ntfa_updn_mds_msg, which comes from MDS and sent to mailbox before checkpoint of finalize. So we just add the client to down_list, that will avoid the error in this ticket, but we are adding a client that had been down? [Canh] Yes, the agent down will be add to list and no longer removed from list in this case. But the client belong to agents down always are removed from database(clientMap…) in right way. The down_list is just helpful in case failover. I will increase the priority of MDS_DOWN to VERY_HIGH before putting in to mailbox. And I keep the standby ntfd remove the client at checkpoint of client_down, not in process proc_ntfa_updn_mds_msg How do you think about it? [Minh] Since you introduce the down_list, so when client instance is removed, its item in down_list must not existed. Let's see the above sequence, if next step there will be a failover, the patch will remove the client (this is new client from next initialize() but same mds_dest) because it is present in down_list. The result is that the client will be removed unexpectedly. If you increase the priority of MDS_DOWN before putting in mailbox, if it works, proc_ntfa_updn_mds_msg should be processed before checkpoint of finalize(). So then with just increasing the priority, the original problem would be solved without this patch? Thanks, Minh On 01/12/17 17:17, Canh Van Truong wrote: The issue happen because the clients are removed in both active and standby node when getting NCSMDS_DOWN event. In standby node, ntfd get NCSMDS_DOWN event is slower than next initialize request. This cause the ntfd will removed all client from data base including new client of next initialze. Any action relate to this client will fail. The fixing is that when getting NCSMDS_DOWN event, ntfd remove client in active node but does not remove client in standby node. At standby node, ntfd will remove client when process the checkpoint of NCSMDS_DOWN event. --- src/ntf/ntfd/NtfAdmin.cc | 89 +++ src/ntf/ntfd/NtfAdmin.h | 11 -- src/ntf/ntfd/ntfs_com.c | 15 src/ntf/ntfd/ntfs_com.h | 4 +++ src/ntf/ntfd/ntfs_evt.c | 17 +++-- src/ntf/ntfd/ntfs_mbcsv.c | 39 +++-- 6 files changed, 169 insertions(+), 6 deletions(-) diff --git a/src/ntf/ntfd/NtfAdmin.cc b/src/ntf/ntfd/NtfAdmin.cc index dad00383d..aa616d7ee 100644 --- a/src/ntf/ntfd/NtfAdmin.cc +++ b/src/ntf/ntfd/NtfAdmin.cc @@ -467,6 +467,80 @@ void NtfAdmin::clientRemoveMDS(MDS_DEST mds_dest) { } /** + * Checking if the ntf agent with MDS_DEST is valid + * + * @param agent_dest + */ +bool NtfAdmin::is_valid_ntf_agent(MDS_DEST agent_dest) { + TRACE_ENTER(); + ClientMap::iterator it; + bool valid = false; + for (it = clientMap.begin(); it != clientMap.end(); it++) { + NtfClient *client = it->second; + if (client->getMdsDest() == agent_dest) { + valid = true; + break; + } + } + TRACE_LEAVE2("The validation of ntfa: %d", valid); + return valid; +} + +/** + * Add the ntfa down to the list. This is helpful to remember the + * list of ntfa to process in case failover. + * + * @param agent_dest + */ +void NtfAdmin::AddNtfAgentDown(MDS_DEST agent_dest) { + TRACE_ENTER2(" Add ntfa down (%ld) to the list", agent_dest); + + if (is_valid_ntf_agent(agent_dest)) { + MDS_DEST *mds_dest = new MDS_DEST; + *mds_dest = agent_dest; + ntfa_down_list.push_back(mds_dest); + } +} +
Re: [devel] [PATCH 1/1] mds: show extra tipc port id when install_tipc invoked [#2732]
Hi Vu, ack, review and tested. Perhaps we should document what each field represents in the m_MDS_LOG_NOTIFY and remove the redundant server_inst? /Thanks HansN Från: Vu Minh NguyenSkickat: den 6 december 2017 17:08:16 Till: Hans Nordebäck; Zoran Milinkovic Kopia: opensaf-devel@lists.sourceforge.net; Vu Minh Nguyen Ämne: [PATCH 1/1] mds: show extra tipc port id when install_tipc invoked [#2732] Add extra information - tipc port id toghether with service id when install tipc invoked. Such information will help find out which OpenSAF services causes tipc trouble such as TIPC_ERR_OVERLOAD. --- src/mds/mds_dt_tipc.c | 36 +++- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/src/mds/mds_dt_tipc.c b/src/mds/mds_dt_tipc.c index b3237a0..0466c6b 100644 --- a/src/mds/mds_dt_tipc.c +++ b/src/mds/mds_dt_tipc.c @@ -163,6 +163,22 @@ uint32_t mdtm_global_frag_num; const unsigned int MAX_RECV_THRESHOLD = 30; +static bool get_tipc_port_id(int sock, uint32_t* port_id) { + struct sockaddr_tipc addr; + socklen_t sz = sizeof(addr); + + memset(, 0, sizeof(addr)); + *port_id = 0; + if (0 > getsockname(sock, (struct sockaddr *), )) { + syslog(LOG_ERR, "MDTM:TIPC Failed to get socket name, err: %s", + strerror(errno)); + return false; + } + + *port_id = addr.addr.id.ref; + return true; +} + /* Function NAME: mdtm_tipc_init @@ -182,11 +198,7 @@ uint32_t mdtm_tipc_init(NODE_ID nodeid, uint32_t *mds_tipc_ref) NCS_PATRICIA_PARAMS pat_tree_params; - struct sockaddr_tipc addr; - socklen_t sz = sizeof(addr); - memset(_cb, 0, sizeof(tipc_cb)); - *mds_tipc_ref = 0; /* Added to assist the shutdown bug */ mdtm_ref_hdl_list_hdr = NULL; @@ -225,19 +237,14 @@ uint32_t mdtm_tipc_init(NODE_ID nodeid, uint32_t *mds_tipc_ref) } /* Code for getting the self tipc random number */ - memset(, 0, sizeof(addr)); - if (0 > getsockname(tipc_cb.BSRsock, (struct sockaddr *), )) { - syslog(LOG_ERR, - "MDTM:TIPC Failed to get the BSR Sockname err :%s", - strerror(errno)); + if (!get_tipc_port_id(tipc_cb.BSRsock, mds_tipc_ref)) { close(tipc_cb.Dsock); close(tipc_cb.BSRsock); return NCSCC_RC_FAILURE; } - *mds_tipc_ref = addr.addr.id.ref; tipc_cb.adest = ((uint64_t)(nodeid)) << 32; - tipc_cb.adest |= addr.addr.id.ref; + tipc_cb.adest |= *mds_tipc_ref; tipc_cb.node_id = nodeid; get_adest_details(tipc_cb.adest, tipc_cb.adest_details); @@ -1860,8 +1867,11 @@ uint32_t mds_mdtm_svc_install_tipc(PW_ENV_ID pwe_id, MDS_SVC_ID svc_id, server_addr.addr.nameseq.lower = server_inst; server_addr.addr.nameseq.upper = server_inst; - m_MDS_LOG_INFO("MDTM: install_tipc : <%u,%u,%u>", server_type, - server_inst, server_inst); + uint32_t port_id = 0; + get_tipc_port_id(tipc_cb.BSRsock, _id); + m_MDS_LOG_NOTIFY("MDTM: install_tipc : <%u,%u,%u,%u>", +server_type, port_id, +server_inst, server_inst); if (0 != bind(tipc_cb.BSRsock, (struct sockaddr *)_addr, sizeof(server_addr))) { -- 1.9.1 -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for clmd: add dynamically created EEs to PLM entity group on standby [#2730]
Summary: clmd: add dynamically created EEs to PLM entity group on standby [#2730] Review request for Ticket(s): 2730 Peer Reviewer(s): Anders, Hans, Mathi, Ravi Pull request to: Affected branch(es): develop Development branch: ticket-2730 Base revision: 9ab54933456632260be87c2c763bd36b1ab7e5d2 Personal repository: git://git.code.sf.net/u/trguitar/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n NOTE: Patch(es) contain lines longer than 80 characers Comments (indicate scope for each "y" above): - revision be1f5e166884737b1786b4eab3a47f82c54e47f8 Author: Alex JonesDate: Wed, 6 Dec 2017 11:58:33 -0500 clmd: add dynamically created EEs to PLM entity group on standby [#2730] If EEs and corresponding CLM nodes are dynamically created, after a middleware si-swap when the former standby has become active, then one of those EEs is rebooted, clmd has not enabled PLM readiness state tracking on the EE and will not know when it comes back. Thus, the node will not be allowed back into the cluster because it thinks it is not a member. The dynamically created EE is not being added to the PLM entity group on the standby. Add the dynamically created EE to the PLM entity group on the standby. Complete diffstat: -- src/clm/clmd/clms_evt.c | 2 +- src/clm/clmd/clms_imm.c | 2 +- src/clm/clmd/clms_mbcsv.c | 20 +++- 3 files changed, 21 insertions(+), 3 deletions(-) Testing Commands: - 1) dynamically create a CLM node with an EE 2) middleware si-swap 3) reboot the EE node Testing, Expected Results: -- It should come back into the cluster Conditions of Submission: - Dec 12, or ack from developer Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
[devel] [PATCH 1/1] clmd: add dynamically created EEs to PLM entity group on standby [#2730]
If EEs and corresponding CLM nodes are dynamically created, after a middleware si-swap when the former standby has become active, then one of those EEs is rebooted, clmd has not enabled PLM readiness state tracking on the EE and will not know when it comes back. Thus, the node will not be allowed back into the cluster because it thinks it is not a member. The dynamically created EE is not being added to the PLM entity group on the standby. Add the dynamically created EE to the PLM entity group on the standby. --- src/clm/clmd/clms_evt.c | 2 +- src/clm/clmd/clms_imm.c | 2 +- src/clm/clmd/clms_mbcsv.c | 20 +++- 3 files changed, 21 insertions(+), 3 deletions(-) diff --git a/src/clm/clmd/clms_evt.c b/src/clm/clmd/clms_evt.c index 4d7010d..b65036d 100644 --- a/src/clm/clmd/clms_evt.c +++ b/src/clm/clmd/clms_evt.c @@ -992,7 +992,7 @@ static uint32_t proc_mds_node_evt(CLMSV_CLMS_EVT *evt) if (delete_existing_nodedown_records(node_id) == true) { TRACE_LEAVE(); return rc; - } else if (node->member == SA_FALSE) { + } else if (node->member == SA_FALSE && node->admin_state != SA_CLM_ADMIN_UNLOCKED) { /* One possibility is that an admin operation has made * this a non-member */ TRACE_LEAVE(); diff --git a/src/clm/clmd/clms_imm.c b/src/clm/clmd/clms_imm.c index 6809ce8..c245f67 100644 --- a/src/clm/clmd/clms_imm.c +++ b/src/clm/clmd/clms_imm.c @@ -2291,7 +2291,7 @@ SaAisErrorT clms_node_ccb_apply_cb(CcbUtilOperationData_t *opdata) rc = saPlmEntityGroupRemove(clms_cb->ent_group_hdl, entityNames, 1); if (rc != SA_AIS_OK) { - LOG_ER("saPlmEntityGroupAdd FAILED rc = %d", + LOG_ER("saPlmEntityGroupRemove FAILED rc = %d", rc); return rc; } diff --git a/src/clm/clmd/clms_mbcsv.c b/src/clm/clmd/clms_mbcsv.c index 47e4494..6976b03 100644 --- a/src/clm/clmd/clms_mbcsv.c +++ b/src/clm/clmd/clms_mbcsv.c @@ -282,6 +282,9 @@ static uint32_t ckpt_proc_node_csync_rec(CLMS_CB *cb, CLMS_CKPT_REC *data) CLMSV_CKPT_NODE *param = >param.node_csync_rec; CLMS_CLUSTER_NODE *node = NULL, *tmp_node = NULL; uint32_t rc = NCSCC_RC_SUCCESS; +#ifdef ENABLE_AIS_PLM + SaNameT *entityNames = NULL; +#endif TRACE_ENTER2("node_name:%s", param->node_name.value); @@ -315,6 +318,21 @@ static uint32_t ckpt_proc_node_csync_rec(CLMS_CB *cb, CLMS_CKPT_REC *data) LOG_ER("Patricia add failed"); } } +#ifdef ENABLE_AIS_PLM + /* Add it to the plm entity group */ + entityNames = >ee_name; + if (clms_cb->reg_with_plm == SA_TRUE) { + SaAisErrorT aisrc = saPlmEntityGroupAdd( + clms_cb->ent_group_hdl, + entityNames, + 1, + SA_PLM_GROUP_SINGLE_ENTITY); + if (aisrc != SA_AIS_OK) { + LOG_ER("saPlmEntityGroupAdd FAILED rc = %d", + aisrc); + } + } +#endif } TRACE_LEAVE(); return NCSCC_RC_SUCCESS; @@ -357,7 +375,7 @@ static uint32_t ckpt_proc_node_del_rec(CLMS_CB *cb, CLMS_CKPT_REC *data) rc = saPlmEntityGroupRemove(clms_cb->ent_group_hdl, entityNames, 1); if (rc != SA_AIS_OK) { - LOG_ER("saPlmEntityGroupAdd FAILED rc = %d", rc); + LOG_ER("saPlmEntityGroupRemove FAILED rc = %d", rc); return rc; } } -- 2.9.5 -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for mds: show extra tipc port id when install_tipc invoked [#2732]
Summary: mds: show extra tipc port id when install_tipc invoked [#2732] Review request for Ticket(s): 2732 Peer Reviewer(s): Hans, Zoran Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-2732 Base revision: 9ab54933456632260be87c2c763bd36b1ab7e5d2 Personal repository: git://git.code.sf.net/u/winhvu/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesn Core libraries y Samples n Tests n Other n Comments (indicate scope for each "y" above): - *** EXPLAIN/COMMENT THE PATCH SERIES HERE *** revision 1b816354f4905e073038d9fb9b986d54b1d387f2 Author: Vu Minh NguyenDate: Wed, 6 Dec 2017 23:02:24 +0700 mds: show extra tipc port id when install_tipc invoked [#2732] Add extra information - tipc port id toghether with service id when install tipc invoked. Such information will help find out which OpenSAF services causes tipc trouble such as TIPC_ERR_OVERLOAD. Complete diffstat: -- src/mds/mds_dt_tipc.c | 36 +++- 1 file changed, 23 insertions(+), 13 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 n n powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] mds: show extra tipc port id when install_tipc invoked [#2732]
Add extra information - tipc port id toghether with service id when install tipc invoked. Such information will help find out which OpenSAF services causes tipc trouble such as TIPC_ERR_OVERLOAD. --- src/mds/mds_dt_tipc.c | 36 +++- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/src/mds/mds_dt_tipc.c b/src/mds/mds_dt_tipc.c index b3237a0..0466c6b 100644 --- a/src/mds/mds_dt_tipc.c +++ b/src/mds/mds_dt_tipc.c @@ -163,6 +163,22 @@ uint32_t mdtm_global_frag_num; const unsigned int MAX_RECV_THRESHOLD = 30; +static bool get_tipc_port_id(int sock, uint32_t* port_id) { + struct sockaddr_tipc addr; + socklen_t sz = sizeof(addr); + + memset(, 0, sizeof(addr)); + *port_id = 0; + if (0 > getsockname(sock, (struct sockaddr *), )) { + syslog(LOG_ERR, "MDTM:TIPC Failed to get socket name, err: %s", + strerror(errno)); + return false; + } + + *port_id = addr.addr.id.ref; + return true; +} + /* Function NAME: mdtm_tipc_init @@ -182,11 +198,7 @@ uint32_t mdtm_tipc_init(NODE_ID nodeid, uint32_t *mds_tipc_ref) NCS_PATRICIA_PARAMS pat_tree_params; - struct sockaddr_tipc addr; - socklen_t sz = sizeof(addr); - memset(_cb, 0, sizeof(tipc_cb)); - *mds_tipc_ref = 0; /* Added to assist the shutdown bug */ mdtm_ref_hdl_list_hdr = NULL; @@ -225,19 +237,14 @@ uint32_t mdtm_tipc_init(NODE_ID nodeid, uint32_t *mds_tipc_ref) } /* Code for getting the self tipc random number */ - memset(, 0, sizeof(addr)); - if (0 > getsockname(tipc_cb.BSRsock, (struct sockaddr *), )) { - syslog(LOG_ERR, - "MDTM:TIPC Failed to get the BSR Sockname err :%s", - strerror(errno)); + if (!get_tipc_port_id(tipc_cb.BSRsock, mds_tipc_ref)) { close(tipc_cb.Dsock); close(tipc_cb.BSRsock); return NCSCC_RC_FAILURE; } - *mds_tipc_ref = addr.addr.id.ref; tipc_cb.adest = ((uint64_t)(nodeid)) << 32; - tipc_cb.adest |= addr.addr.id.ref; + tipc_cb.adest |= *mds_tipc_ref; tipc_cb.node_id = nodeid; get_adest_details(tipc_cb.adest, tipc_cb.adest_details); @@ -1860,8 +1867,11 @@ uint32_t mds_mdtm_svc_install_tipc(PW_ENV_ID pwe_id, MDS_SVC_ID svc_id, server_addr.addr.nameseq.lower = server_inst; server_addr.addr.nameseq.upper = server_inst; - m_MDS_LOG_INFO("MDTM: install_tipc : <%u,%u,%u>", server_type, - server_inst, server_inst); + uint32_t port_id = 0; + get_tipc_port_id(tipc_cb.BSRsock, _id); + m_MDS_LOG_NOTIFY("MDTM: install_tipc : <%u,%u,%u,%u>", +server_type, port_id, +server_inst, server_inst); if (0 != bind(tipc_cb.BSRsock, (struct sockaddr *)_addr, sizeof(server_addr))) { -- 1.9.1 -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] ntf: fix ntfd remove client in standby node while not finalize in active node [#2705]
Hi aMinh, Please see my comment. Regards Canh From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] Sent: Wednesday, December 6, 2017 7:20 AM To: Canh Van TruongCc: opensaf-devel@lists.sourceforge.net Subject: Re: [PATCH 1/1] ntf: fix ntfd remove client in standby node while not finalize in active node [#2705] Hi Canh, I got a build error with 32 bit src/ntf/ntfd/ntfs_com.c: In function ‘sendNtfaDownUpdate’: src/ntf/ntfd/ntfs_com.c:562:2: error: format ‘%ld’ expects argument of type ‘long int’, but argument 6 has type ‘MDS_DEST’ [-Werror=format=] TRACE_ENTER2("mdsDest: %ld", mdsDest); ^ cc1: all warnings being treated as errors make[2]: *** [src/ntf/ntfd/bin_osafntfd-ntfs_com.o] Error 1 And another question, with this patch the standby NTFD will no longer delete client in proc_ntfa_updn_mds_msg(), it will memorize the client to down_list. So in the situation of this ticket, the standby will have sequence of msg as below: 1 -> receive checkpoint of finalize 2 -> receive checkpoint of client_down 3 -> receive checkpoint of initialize 4 -> process proc_ntfa_updn_mds_msg, which comes from MDS and sent to mailbox before checkpoint of finalize. So we just add the client to down_list, that will avoid the error in this ticket, but we are adding a client that had been down? [Canh] Yes, the agent down will be add to list and no longer removed from list in this case. But the client belong to agents down always are removed from database(clientMap…) in right way. The down_list is just helpful in case failover. I will increase the priority of MDS_DOWN to VERY_HIGH before putting in to mailbox. And I keep the standby ntfd remove the client at checkpoint of client_down, not in process proc_ntfa_updn_mds_msg How do you think about it? Thanks, Minh On 01/12/17 17:17, Canh Van Truong wrote: The issue happen because the clients are removed in both active and standby node when getting NCSMDS_DOWN event. In standby node, ntfd get NCSMDS_DOWN event is slower than next initialize request. This cause the ntfd will removed all client from data base including new client of next initialze. Any action relate to this client will fail. The fixing is that when getting NCSMDS_DOWN event, ntfd remove client in active node but does not remove client in standby node. At standby node, ntfd will remove client when process the checkpoint of NCSMDS_DOWN event. --- src/ntf/ntfd/NtfAdmin.cc | 89 +++ src/ntf/ntfd/NtfAdmin.h | 11 -- src/ntf/ntfd/ntfs_com.c | 15 src/ntf/ntfd/ntfs_com.h | 4 +++ src/ntf/ntfd/ntfs_evt.c | 17 +++-- src/ntf/ntfd/ntfs_mbcsv.c | 39 +++-- 6 files changed, 169 insertions(+), 6 deletions(-) diff --git a/src/ntf/ntfd/NtfAdmin.cc b/src/ntf/ntfd/NtfAdmin.cc index dad00383d..aa616d7ee 100644 --- a/src/ntf/ntfd/NtfAdmin.cc +++ b/src/ntf/ntfd/NtfAdmin.cc @@ -467,6 +467,80 @@ void NtfAdmin::clientRemoveMDS(MDS_DEST mds_dest) { } /** + * Checking if the ntf agent with MDS_DEST is valid + * + * @param agent_dest + */ +bool NtfAdmin::is_valid_ntf_agent(MDS_DEST agent_dest) { + TRACE_ENTER(); + ClientMap::iterator it; + bool valid = false; + for (it = clientMap.begin(); it != clientMap.end(); it++) { +NtfClient *client = it->second; +if (client->getMdsDest() == agent_dest) { + valid = true; + break; +} + } + TRACE_LEAVE2("The validation of ntfa: %d", valid); + return valid; +} + +/** + * Add the ntfa down to the list. This is helpful to remember the + * list of ntfa to process in case failover. + * + * @param agent_dest + */ +void NtfAdmin::AddNtfAgentDown(MDS_DEST agent_dest) { + TRACE_ENTER2(" Add ntfa down (%ld) to the list", agent_dest); + + if (is_valid_ntf_agent(agent_dest)) { +MDS_DEST *mds_dest = new MDS_DEST; +*mds_dest = agent_dest; +ntfa_down_list.push_back(mds_dest); + } +} + +/** + * Remove the ntfa down from the list + * + * @param agent_dest + */ +void NtfAdmin::RemoveNtfAgentDownFromList(MDS_DEST agent_dest) { + TRACE_ENTER(); + std::list::iterator it; + for (it = ntfa_down_list.begin(); it != ntfa_down_list.end(); ++it) { +MDS_DEST *mds_dest = *it; +if (*mds_dest == agent_dest) { + ntfa_down_list.erase(it); + TRACE(" Remove ntfa down (%ld) from the list", agent_dest); + delete mds_dest; + return; +} + } +} + +/** + * Process to clear all ntfa down records in ntfd. The old client of + * agent down is removed from database. + * + * @param agent_dest + */ +void NtfAdmin::ProcessNtfAgentDownList() { + TRACE_ENTER(); + std::list::iterator it = ntfa_down_list.begin();; + while(it != ntfa_down_list.end()) { +MDS_DEST *mds_dest = *it; +it = ntfa_down_list.erase(it); +clientRemoveMDS(*mds_dest); +delete mds_dest; + } + TRACE_LEAVE(); + return; +} + +/** * The node object where the client who had the subscription is notified * so it can