Re: [devel] [PATCH 0/4] Review Request for amfd: improve controller failover behavior V2 [#3029]
Hi Gary ACK (not tested) Regards Canh -Original Message- From: Gary Lee Sent: Tuesday, July 9, 2019 1:21 PM To: canh.v.tru...@dektech.com.au; minh.c...@dektech.com.au; hans.nordeb...@ericsson.com Cc: opensaf-devel@lists.sourceforge.net; Gary Lee Subject: [PATCH 0/4] Review Request for amfd: improve controller failover behavior V2 [#3029] Summary: amfd: improve controller failover behavior [#3029] Review request for Ticket(s): 3029 Peer Reviewer(s): Canh, Minh, Hans Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3029 Base revision: 71852f322b42437f074bfa4c618c021798357143 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesy Core libraries y Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 4feee2b631afa3393ae9e53fd6575c3768861dca Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 osaf: make wait time configurable [#3029] If FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is enabled, make the time that we wait for MDS node events configurable. revision 2c419ba5fffb85272f0d15118b561bcfc1de4814 Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 amfd: improve controller failover behavior [#3029] If consensus service is enabled, only perform node failover after peer controller has self-fenced (after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds). This also means if node failover delay is set to a large value, we do not unnecesarily wait too long before failing over assignments previously assigned to the peer controller. Remove unused fmd_conf_file variable. Change some LOG_ER calls to LOG_WA. revision 7c4fff483477082ca66a26f921a50b3bc1240538 Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 fmd: add active promotion supervision timer [#3029] Add supervision timer so controller will reboot if it cannot obtain consensus lock within the allocation period (2* FMS_TAKEOVER_REQUEST_VALID_TIME). The peer controller can then safely perform a node failover after this period of time. revision 8b596a228402ff99b26906138daf920c23e965e7 Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 osaf: add function to return takeover request expiry time [#3029] Complete diffstat: -- src/amf/amfd/cb.h | 1 - src/amf/amfd/clm.cc| 4 +- src/amf/amfd/main.cc | 1 - src/amf/amfd/ndfsm.cc | 8 ++-- src/amf/amfd/ndproc.cc | 19 src/amf/amfd/node_state.cc | 23 +- src/amf/amfd/node_state_machine.cc | 19 src/amf/amfd/node_state_machine.h | 2 + src/amf/amfd/proc.h| 1 + src/fm/fmd/fm_cb.h | 2 + src/fm/fmd/fm_main.cc | 14 +- src/fm/fmd/fm_rda.cc | 89 ++ src/fm/fmd/fmd.conf| 5 +++ src/osaf/consensus/consensus.cc| 13 ++ src/osaf/consensus/consensus.h | 4 ++ src/rde/rded/role.cc | 4 +- 16 files changed, 160 insertions(+), 49 deletions(-) Testing Commands: - 1) Ensure a 2N application is active on standby controller, and standy on the active controller 2) Isolate active & standby controller Testing, Expected Results: -- amfd should failover 2N application only after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds Conditions of Submission: - ack from any reviewer Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests.
[devel] [PATCH 0/4] Review Request for amfd: improve controller failover behavior V2 [#3029]
Summary: amfd: improve controller failover behavior [#3029] Review request for Ticket(s): 3029 Peer Reviewer(s): Canh, Minh, Hans Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-3029 Base revision: 71852f322b42437f074bfa4c618c021798357143 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesy Core libraries y Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 4feee2b631afa3393ae9e53fd6575c3768861dca Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 osaf: make wait time configurable [#3029] If FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is enabled, make the time that we wait for MDS node events configurable. revision 2c419ba5fffb85272f0d15118b561bcfc1de4814 Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 amfd: improve controller failover behavior [#3029] If consensus service is enabled, only perform node failover after peer controller has self-fenced (after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds). This also means if node failover delay is set to a large value, we do not unnecesarily wait too long before failing over assignments previously assigned to the peer controller. Remove unused fmd_conf_file variable. Change some LOG_ER calls to LOG_WA. revision 7c4fff483477082ca66a26f921a50b3bc1240538 Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 fmd: add active promotion supervision timer [#3029] Add supervision timer so controller will reboot if it cannot obtain consensus lock within the allocation period (2* FMS_TAKEOVER_REQUEST_VALID_TIME). The peer controller can then safely perform a node failover after this period of time. revision 8b596a228402ff99b26906138daf920c23e965e7 Author: Gary Lee Date: Tue, 9 Jul 2019 14:38:49 +1000 osaf: add function to return takeover request expiry time [#3029] Complete diffstat: -- src/amf/amfd/cb.h | 1 - src/amf/amfd/clm.cc| 4 +- src/amf/amfd/main.cc | 1 - src/amf/amfd/ndfsm.cc | 8 ++-- src/amf/amfd/ndproc.cc | 19 src/amf/amfd/node_state.cc | 23 +- src/amf/amfd/node_state_machine.cc | 19 src/amf/amfd/node_state_machine.h | 2 + src/amf/amfd/proc.h| 1 + src/fm/fmd/fm_cb.h | 2 + src/fm/fmd/fm_main.cc | 14 +- src/fm/fmd/fm_rda.cc | 89 ++ src/fm/fmd/fmd.conf| 5 +++ src/osaf/consensus/consensus.cc| 13 ++ src/osaf/consensus/consensus.h | 4 ++ src/rde/rded/role.cc | 4 +- 16 files changed, 160 insertions(+), 49 deletions(-) Testing Commands: - 1) Ensure a 2N application is active on standby controller, and standy on the active controller 2) Isolate active & standby controller Testing, Expected Results: -- amfd should failover 2N application only after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds Conditions of Submission: - ack from any reviewer Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These