Re: [devel] [PATCH 0/4] Review Request for amfd: improve controller failover behavior V2 [#3029]

2019-07-09 Thread Canh Van Truong
Hi Gary

ACK (not tested)

Regards
Canh

-Original Message-
From: Gary Lee  
Sent: Tuesday, July 9, 2019 1:21 PM
To: canh.v.tru...@dektech.com.au; minh.c...@dektech.com.au;
hans.nordeb...@ericsson.com
Cc: opensaf-devel@lists.sourceforge.net; Gary Lee 
Subject: [PATCH 0/4] Review Request for amfd: improve controller failover
behavior V2 [#3029]

Summary: amfd: improve controller failover behavior [#3029]
Review request for Ticket(s): 3029
Peer Reviewer(s): Canh, Minh, Hans 
Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE ***
Affected branch(es): develop
Development branch: ticket-3029
Base revision: 71852f322b42437f074bfa4c618c021798357143
Personal repository: git://git.code.sf.net/u/userid-2226215/review


Impacted area   Impact y/n

 Docsn
 Build systemn
 RPM/packaging   n
 Configuration files n
 Startup scripts n
 SAF servicesy 
 OpenSAF servicesy
 Core libraries  y
 Samples n
 Tests   n
 Other   n


Comments (indicate scope for each "y" above):
-

revision 4feee2b631afa3393ae9e53fd6575c3768861dca
Author: Gary Lee 
Date:   Tue, 9 Jul 2019 14:38:49 +1000

osaf: make wait time configurable [#3029]

If FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is enabled,
make the time that we wait for MDS node events configurable.



revision 2c419ba5fffb85272f0d15118b561bcfc1de4814
Author: Gary Lee 
Date:   Tue, 9 Jul 2019 14:38:49 +1000

amfd: improve controller failover behavior [#3029]

If consensus service is enabled, only perform node failover
after peer controller has self-fenced
(after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds).

This also means if node failover delay is set to a large value,
we do not unnecesarily wait too long before failing over assignments
previously assigned to the peer controller.

Remove unused fmd_conf_file variable.

Change some LOG_ER calls to LOG_WA.



revision 7c4fff483477082ca66a26f921a50b3bc1240538
Author: Gary Lee 
Date:   Tue, 9 Jul 2019 14:38:49 +1000

fmd: add active promotion supervision timer [#3029]

Add supervision timer so controller will reboot if it cannot obtain
consensus lock within the allocation period
(2* FMS_TAKEOVER_REQUEST_VALID_TIME).

The peer controller can then safely perform a node failover
after this period of time.



revision 8b596a228402ff99b26906138daf920c23e965e7
Author: Gary Lee 
Date:   Tue, 9 Jul 2019 14:38:49 +1000

osaf: add function to return takeover request expiry time [#3029]



Complete diffstat:
--
 src/amf/amfd/cb.h  |  1 -
 src/amf/amfd/clm.cc|  4 +-
 src/amf/amfd/main.cc   |  1 -
 src/amf/amfd/ndfsm.cc  |  8 ++--
 src/amf/amfd/ndproc.cc | 19 
 src/amf/amfd/node_state.cc | 23 +-
 src/amf/amfd/node_state_machine.cc | 19 
 src/amf/amfd/node_state_machine.h  |  2 +
 src/amf/amfd/proc.h|  1 +
 src/fm/fmd/fm_cb.h |  2 +
 src/fm/fmd/fm_main.cc  | 14 +-
 src/fm/fmd/fm_rda.cc   | 89
++
 src/fm/fmd/fmd.conf|  5 +++
 src/osaf/consensus/consensus.cc| 13 ++
 src/osaf/consensus/consensus.h |  4 ++
 src/rde/rded/role.cc   |  4 +-
 16 files changed, 160 insertions(+), 49 deletions(-)


Testing Commands:
-
1) Ensure a 2N application is active on standby controller,
   and standy on the active controller
2) Isolate active & standby controller


Testing, Expected Results:
--
amfd should failover 2N application only after
2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds

Conditions of Submission:
-
ack from any reviewer

Arch  Built StartedLinux distro
---
mipsn  n
mips64  n  n
x86 n  n
x86_64  y  y 
powerpc n  n
powerpc64   n  n


Reviewer Checklist:
---
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
(i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.

[devel] [PATCH 0/4] Review Request for amfd: improve controller failover behavior V2 [#3029]

2019-07-09 Thread Gary Lee
Summary: amfd: improve controller failover behavior [#3029]
Review request for Ticket(s): 3029
Peer Reviewer(s): Canh, Minh, Hans 
Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE ***
Affected branch(es): develop
Development branch: ticket-3029
Base revision: 71852f322b42437f074bfa4c618c021798357143
Personal repository: git://git.code.sf.net/u/userid-2226215/review


Impacted area   Impact y/n

 Docsn
 Build systemn
 RPM/packaging   n
 Configuration files n
 Startup scripts n
 SAF servicesy 
 OpenSAF servicesy
 Core libraries  y
 Samples n
 Tests   n
 Other   n


Comments (indicate scope for each "y" above):
-

revision 4feee2b631afa3393ae9e53fd6575c3768861dca
Author: Gary Lee 
Date:   Tue, 9 Jul 2019 14:38:49 +1000

osaf: make wait time configurable [#3029]

If FMS_TAKEOVER_PRIORITISE_PARTITION_SIZE is enabled,
make the time that we wait for MDS node events configurable.



revision 2c419ba5fffb85272f0d15118b561bcfc1de4814
Author: Gary Lee 
Date:   Tue, 9 Jul 2019 14:38:49 +1000

amfd: improve controller failover behavior [#3029]

If consensus service is enabled, only perform node failover
after peer controller has self-fenced
(after 2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds).

This also means if node failover delay is set to a large value,
we do not unnecesarily wait too long before failing over assignments
previously assigned to the peer controller.

Remove unused fmd_conf_file variable.

Change some LOG_ER calls to LOG_WA.



revision 7c4fff483477082ca66a26f921a50b3bc1240538
Author: Gary Lee 
Date:   Tue, 9 Jul 2019 14:38:49 +1000

fmd: add active promotion supervision timer [#3029]

Add supervision timer so controller will reboot if it cannot obtain
consensus lock within the allocation period
(2* FMS_TAKEOVER_REQUEST_VALID_TIME).

The peer controller can then safely perform a node failover
after this period of time.



revision 8b596a228402ff99b26906138daf920c23e965e7
Author: Gary Lee 
Date:   Tue, 9 Jul 2019 14:38:49 +1000

osaf: add function to return takeover request expiry time [#3029]



Complete diffstat:
--
 src/amf/amfd/cb.h  |  1 -
 src/amf/amfd/clm.cc|  4 +-
 src/amf/amfd/main.cc   |  1 -
 src/amf/amfd/ndfsm.cc  |  8 ++--
 src/amf/amfd/ndproc.cc | 19 
 src/amf/amfd/node_state.cc | 23 +-
 src/amf/amfd/node_state_machine.cc | 19 
 src/amf/amfd/node_state_machine.h  |  2 +
 src/amf/amfd/proc.h|  1 +
 src/fm/fmd/fm_cb.h |  2 +
 src/fm/fmd/fm_main.cc  | 14 +-
 src/fm/fmd/fm_rda.cc   | 89 ++
 src/fm/fmd/fmd.conf|  5 +++
 src/osaf/consensus/consensus.cc| 13 ++
 src/osaf/consensus/consensus.h |  4 ++
 src/rde/rded/role.cc   |  4 +-
 16 files changed, 160 insertions(+), 49 deletions(-)


Testing Commands:
-
1) Ensure a 2N application is active on standby controller,
   and standy on the active controller
2) Isolate active & standby controller


Testing, Expected Results:
--
amfd should failover 2N application only after
2 * FMS_TAKEOVER_REQUEST_VALID_TIME seconds

Conditions of Submission:
-
ack from any reviewer

Arch  Built StartedLinux distro
---
mipsn  n
mips64  n  n
x86 n  n
x86_64  y  y 
powerpc n  n
powerpc64   n  n


Reviewer Checklist:
---
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
(i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
cosmetic code cleanup changes. These