Re: [devel] [PATCH 0/2] Review Request for fmd: improve failover response time [#3008]
Hi Gary, ack for code review. Still a few other places that call opensaf_quick_reboot can be visited later. Thanks, Minh > Summary: fmd: improve failover response time V2 [#3008] > Review request for Ticket(s): 3008 > Peer Reviewer(s): Hans, Minh > Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** > Affected branch(es): develop > Development branch: ticket-3008 > Base revision: 5766361568498f8a496d87d8daafe9bffbd75ed9 > Personal repository: git://git.code.sf.net/u/userid-2226215/review > > > Impacted area Impact y/n > > Docsn > Build systemn > RPM/packaging n > Configuration files n > Startup scripts n > SAF servicesy > OpenSAF servicesn > Core libraries n > Samples n > Tests n > Other n > > > Comments (indicate scope for each "y" above): > - > > revision 8ccffc2cd9cd117578227e9cd49421e5c578fec6 > Author: Gary Lee > Date: Tue, 19 Feb 2019 14:57:53 +1100 > > rded: do not send SUCCESS to main thread [#3008] > > do not send RDE_MSG_ACTIVE_PROMOTION_SUCCESS to > main thread if lock cannot be obtained > > > > revision 28e17d107f4a079155e03d9f875a3c0262ea19f5 > Author: Gary Lee > Date: Tue, 19 Feb 2019 14:57:53 +1100 > > fmd: improve failover response time [#3008] > > Improve failover response time if split brain prevention is enabled > but FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is set to 0. > > Also, return immediately if node promotion fails to avoid > sending active role to RDA. > > > > Complete diffstat: > -- > src/fm/fmd/fm_rda.cc | 14 +- > src/rde/rded/role.cc | 2 ++ > 2 files changed, 11 insertions(+), 5 deletions(-) > > > Testing Commands: > - > *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** > > > Testing, Expected Results: > -- > *** PASTE COMMAND OUTPUTS / TEST RESULTS *** > > > Conditions of Submission: > - > *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** > > > Arch Built StartedLinux distro > --- > mipsn n > mips64 n n > x86 n n > x86_64 y y > powerpc n n > powerpc64 n n > > > Reviewer Checklist: > --- > [Submitters: make sure that your review doesn't trigger any checkmarks!] > > > Your checkin has not passed review because (see checked entries): > > ___ Your RR template is generally incomplete; it has too many blank > entries > that need proper data filled in. > > ___ You have failed to nominate the proper persons for review and push. > > ___ Your patches do not have proper short+long header > > ___ You have grammar/spelling in your header that is unacceptable. > > ___ You have exceeded a sensible line length in your > headers/comments/text. > > ___ You have failed to put in a proper Trac Ticket # into your commits. > > ___ You have incorrectly put/left internal data in your comments/files > (i.e. internal bug tracking tool IDs, product names etc) > > ___ You have not given any evidence of testing beyond basic build tests. > Demonstrate some level of runtime or other sanity testing. > > ___ You have ^M present in some of your files. These have to be removed. > > ___ You have needlessly changed whitespace or added whitespace crimes > like trailing spaces, or spaces before tabs. > > ___ You have mixed real technical changes with whitespace and other > cosmetic code cleanup changes. These have to be separate commits. > > ___ You need to refactor your submission into logical chunks; there is > too much content into a single commit. > > ___ You have extraneous garbage in your review (merge commits etc) > > ___ You have giant attachments which should never have been sent; > Instead you should place your content in a public tree to be pulled. > > ___ You have too many commits attached to an e-mail; resend as threaded > commits, or place in a public tree for a pull. > > ___ You have resent this content multiple times without a clear indication > of what has changed between each re-send. > > ___ You have failed to adequately and individually address all of the > comments and change requests that were proposed in the initial review. > > ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email > etc) > > ___ Your computer have a badly configured date and time; confusing the > the threaded patch review. > > ___ Your changes affect IPC mechanism, and you don't present any results > for in-service upgradability test. > > ___ Your changes affect user manual and documentation, your patch series > do not contain the patch that updates the Doxygen manual. > >
[devel] SA_AIS_ERR_NOT_EXIST error when calling the function saAmfComponentNameGet
Hello all, I have an AMF application which runs as (High Availability Framework)HAFW in 2N redundant systems for confD active/standby configuration. So, when I call the function *saAmfComponentNameGet(confd_ha_cb.info.amfHandle, _ha_cb.info.compName)* I am getting an error stating *(SA_AIS_ERR_NOT_EXIST, 12). *what am i doing wrong here? why am i getting this error? I have attached the AppConfig file and my application files for your reference. Please find the attachments. Thanks in advance, Regards. http://www.saforum.org/IMMSchema; xsi:noNamespaceSchemaLocation="SAI-AIS-IMM-XSD-A.01.01.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;> safAppType=CONFD_HA safSgType=CONFD_HA safSuType=CONFD_HA safCompType=CONFD_HA safSvcType=CONFD_HA safCSType=CONFD_HA safVersion=1,safSvcType=CONFD_HA safVersion=1,safAppType=CONFD_HA saAmfApptSGTypes safVersion=1,safSgType=CONFD_HA safVersion=1,safSgType=CONFD_HA saAmfSgtRedundancyModel 1 saAmfSgtValidSuTypes safVersion=1,safSuType=CONFD_HA saAmfSgtDefAutoAdjustProb 100 saAmfSgtDefCompRestartProb 40 saAmfSgtDefCompRestartMax 10 saAmfSgtDefSuRestartProb 40 saAmfSgtDefSuRestartMax 10 safVersion=1,safSuType=CONFD_HA saAmfSutIsExternal 0 saAmfSutDefSUFailover 1 saAmfSutProvidesSvcTypes safVersion=1,safSvcType=CONFD_HA safVersion=1,safCompType=CONFD_HA saAmfCtCompCategory 1 saAmfCtSwBundle safSmfBundle=CONFD_HA saAmfCtDefClcCliTimeout 600 saAmfCtDefCallbackTimeout 100 saAmfCtRelPathInstantiateCmd confd_ha_inst.sh saAmfCtDefInstantiateCmdArgv saAmfCtRelPathCleanupCmd confd_ha_cleanup.sh saAmfCtDefCleanupCmdArgv saAmfCtDefQuiescingCompleteTimeout 500 saAmfCtDefRecoveryOnError 2 saAmfCtDefDisableRestart 0 saAmfCtDefCmdEnv AMF_DEMO_VAR1=CT_VALUE1 AMF_DEMO_VAR2=CT_VALUE2 safVersion=1,safCSType=CONFD_HA safMemberCompType=safVersion=1\,safCompType=CONFD_HA,safVersion=1,safSuType=CONFD_HA safMemberCSType=safVersion=1\,safCSType=CONFD_HA,safVersion=1,safSvcType=CONFD_HA safSupportedCsType=safVersion=1\,safCSType=CONFD_HA,safVersion=1,safCompType=CONFD_HA saAmfCtCompCapability 1 safHealthcheckKey=HC_CONFD_HA,safVersion=1,safCompType=CONFD_HA saAmfHctDefPeriod 150 saAmfHctDefMaxDuration 600 safApp=CONFD_HA saAmfAppType safVersion=1,safAppType=CONFD_HA safSg=SG_CONFD_HA,safApp=CONFD_HA saAmfSGType safVersion=1,safSgType=CONFD_HA saAmfSGSuHostNodeGroup safAmfNodeGroup=SCs,safAmfCluster=myAmfCluster saAmfSGAutoRepair 0 saAmfSGAutoAdjust 0 saAmfSGNumPrefInserviceSUs 10 saAmfSGNumPrefAssignedSUs 10 safSi=Si_CONFD_HA,safApp=CONFD_HA saAmfSvcType safVersion=1,safSvcType=CONFD_HA saAmfSIProtectedbySG safSg=SG_CONFD_HA,safApp=CONFD_HA safCsi=Csi_CONFD_HA,safSi=Si_CONFD_HA,safApp=CONFD_HA saAmfCSType safVersion=1,safCSType=CONFD_HA safSmfBundle=CONFD_HA safInstalledSwBundle=safSmfBundle=CONFD_HA,safAmfNode=SC-1,safAmfCluster=myAmfCluster saAmfNodeSwBundlePathPrefix /etc/opt/opensaf safSu=SuT_CONFD_HA_1,safSg=SG_CONFD_HA,safApp=CONFD_HA saAmfSUType safVersion=1,safSuType=CONFD_HA saAmfSURank 1 saAmfSUAdminState 3 saAmfSUHostNodeOrNodeGroup safAmfNode=SC-1,safAmfCluster=myAmfCluster safComp=CompT_CONFD_HA,safSu=SuT_CONFD_HA_1,safSg=SG_CONFD_HA,safApp=CONFD_HA saAmfCompType safVersion=1,safCompType=CONFD_HA safSupportedCsType=safVersion=1\,safCSType=CONFD_HA,safComp=CompT_CONFD_HA,safSu=SuT_CONFD_HA_1,safSg=SG_CONFD_HA,safApp=CONFD_HA safInstalledSwBundle=safSmfBundle=CONFD_HA,safAmfNode=SC-2,safAmfCluster=myAmfCluster saAmfNodeSwBundlePathPrefix /etc/opt/opensaf safSu=SuT_CONFD_HA_2,safSg=SG_CONFD_HA,safApp=CONFD_HA saAmfSUType safVersion=1,safSuType=CONFD_HA saAmfSURank 2 saAmfSUAdminState 3 saAmfSUHostNodeOrNodeGroup safAmfNode=SC-2,safAmfCluster=myAmfCluster safComp=CompT_CONFD_HA,safSu=SuT_CONFD_HA_2,safSg=SG_CONFD_HA,safApp=CONFD_HA saAmfCompType safVersion=1,safCompType=CONFD_HA safSupportedCsType=safVersion=1\,safCSType=CONFD_HA,safComp=CompT_CONFD_HA,safSu=SuT_CONFD_HA_2,safSg=SG_CONFD_HA,safApp=CONFD_HA /* Integration glue between ConfD and OpenSAF AMF Copyright (C) 2009
[devel] [PATCH 0/2] Review Request for fmd: improve failover response time [#3008]
Summary: fmd: improve failover response time V2 [#3008] Review request for Ticket(s): 3008 Peer Reviewer(s): Hans, Minh Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3008 Base revision: 5766361568498f8a496d87d8daafe9bffbd75ed9 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 8ccffc2cd9cd117578227e9cd49421e5c578fec6 Author: Gary Lee Date: Tue, 19 Feb 2019 14:57:53 +1100 rded: do not send SUCCESS to main thread [#3008] do not send RDE_MSG_ACTIVE_PROMOTION_SUCCESS to main thread if lock cannot be obtained revision 28e17d107f4a079155e03d9f875a3c0262ea19f5 Author: Gary Lee Date: Tue, 19 Feb 2019 14:57:53 +1100 fmd: improve failover response time [#3008] Improve failover response time if split brain prevention is enabled but FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is set to 0. Also, return immediately if node promotion fails to avoid sending active role to RDA. Complete diffstat: -- src/fm/fmd/fm_rda.cc | 14 +- src/rde/rded/role.cc | 2 ++ 2 files changed, 11 insertions(+), 5 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 2/2] rded: do not send SUCCESS to main thread [#3008]
do not send RDE_MSG_ACTIVE_PROMOTION_SUCCESS to main thread if lock cannot be obtained --- src/rde/rded/role.cc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/rde/rded/role.cc b/src/rde/rded/role.cc index 06e93c6..3effc25 100644 --- a/src/rde/rded/role.cc +++ b/src/rde/rded/role.cc @@ -114,6 +114,7 @@ void Role::PromoteNode(const uint64_t cluster_size, LOG_ER("Unable to set active controller in consensus service"); opensaf_quick_reboot("Unable to set active controller " "in consensus service"); +return; } RDE_CONTROL_BLOCK* cb = rde_get_control_block(); @@ -135,6 +136,7 @@ void Role::PromoteNode(const uint64_t cluster_size, LOG_ER("Unable to set active controller in consensus service"); opensaf_quick_reboot("Unable to set active controller in " "consensus service"); +return; } std::this_thread::sleep_for(std::chrono::seconds(1)); } -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/2] fmd: improve failover response time [#3008]
Improve failover response time if split brain prevention is enabled but FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is set to 0. Also, return immediately if node promotion fails to avoid sending active role to RDA. --- src/fm/fmd/fm_rda.cc | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/src/fm/fmd/fm_rda.cc b/src/fm/fmd/fm_rda.cc index 504757c..d3063ba 100644 --- a/src/fm/fmd/fm_rda.cc +++ b/src/fm/fmd/fm_rda.cc @@ -88,17 +88,20 @@ uint32_t fm_rda_set_role(FM_CB *fm_cb, PCS_RDA_ROLE role) { Consensus consensus_service; if (consensus_service.IsEnabled() == true) { -// Allow topology events to be processed first. The MDS thread may -// be processing MDS down events and updating cluster_size concurrently. -// We need cluster_size to be as accurate as possible, without waiting -// too long for node down events. -std::this_thread::sleep_for(std::chrono::seconds(4)); +if (consensus_service.PrioritisePartitionSize() == true) { + // Allow topology events to be processed first. The MDS thread may + // be processing MDS down events and updating cluster_size concurrently. + // We need cluster_size to be as accurate as possible, without waiting + // too long for node down events. + std::this_thread::sleep_for(std::chrono::seconds(4)); +} rc = consensus_service.PromoteThisNode(true, fm_cb->cluster_size); if (rc != SA_AIS_OK && rc != SA_AIS_ERR_EXIST) { LOG_ER("Unable to set active controller in consensus service"); opensaf_quick_reboot("Unable to set active controller " "in consensus service"); + return NCSCC_RC_FAILURE; } else if (rc == SA_AIS_ERR_EXIST) { // @todo if we don't reboot, we don't seem to recover from this. Can we // improve? @@ -107,6 +110,7 @@ uint32_t fm_rda_set_role(FM_CB *fm_cb, PCS_RDA_ROLE role) { "cluster?"); opensaf_quick_reboot("A controller is already active. We were separated " "from the cluster?"); + return NCSCC_RC_FAILURE; } } -- 2.7.4 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel