[devel] [PATCH 0/1] Review Request for rde: Add timeout of waiting for peer info [#3263]

2021-06-30 Thread Minh Chau
Summary: rde: Add timeout of waiting for peer info [#3263]
Review request for Ticket(s): 3263
Peer Reviewer(s): Surbhi, Thang, Hieu
Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE ***
Affected branch(es): develop
Development branch: ticket-3263
Base revision: 854a8e03042d6a53a45b903262f5197a52a87525
Personal repository: git://git.code.sf.net/u/minh-chau/review


Impacted area   Impact y/n

 Docsn
 Build systemn
 RPM/packaging   n
 Configuration files n
 Startup scripts n
 SAF servicesy
 OpenSAF servicesn
 Core libraries  n
 Samples n
 Tests   n
 Other   n


Comments (indicate scope for each "y" above):
-
*** EXPLAIN/COMMENT THE PATCH SERIES HERE ***

revision 6b4028c2d5c405a80fcf7ab6297678ae99900b6e
Author: Minh Chau 
Date:   Mon, 28 Jun 2021 13:41:45 +1000

rde: Add timeout of waiting for peer info [#3263]

This ticket revisit the waiting for peer info and
fix the problem of disordered peer_up and peer info
in the commit d1593b03b3c9bec292b14dde65264c261760bf46



Complete diffstat:
--
 src/rde/rded/rde_main.cc |  1 +
 src/rde/rded/role.cc | 63 +++-
 src/rde/rded/role.h  |  7 ++
 3 files changed, 70 insertions(+), 1 deletion(-)


Testing Commands:
-
*** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES ***


Testing, Expected Results:
--
*** PASTE COMMAND OUTPUTS / TEST RESULTS ***


Conditions of Submission:
-
*** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC ***


Arch  Built StartedLinux distro
---
mipsn  n
mips64  n  n
x86 n  n
x86_64  n  n
powerpc n  n
powerpc64   n  n


Reviewer Checklist:
---
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
(i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
cosmetic code cleanup changes. These have to be separate commits.

___ You need to refactor your submission into logical chunks; there is
too much content into a single commit.

___ You have extraneous garbage in your review (merge commits etc)

___ You have giant attachments which should never have been sent;
Instead you should place your content in a public tree to be pulled.

___ You have too many commits attached to an e-mail; resend as threaded
commits, or place in a public tree for a pull.

___ You have resent this content multiple times without a clear indication
of what has changed between each re-send.

___ You have failed to adequately and individually address all of the
comments and change requests that were proposed in the initial review.

___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc)

___ Your computer have a badly configured date and time; confusing the
the threaded patch review.

___ Your changes affect IPC mechanism, and you don't present any results
for in-service upgradability test.

___ Your changes affect user manual and documentation, your patch series
do not contain the patch that updates the Doxygen manual.



___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


[devel] [PATCH 1/1] rde: Add timeout of waiting for peer info [#3263]

2021-06-30 Thread Minh Chau
This ticket revisit the waiting for peer info and
fix the problem of disordered peer_up and peer info
in the commit d1593b03b3c9bec292b14dde65264c261760bf46
---
 src/rde/rded/rde_main.cc |  1 +
 src/rde/rded/role.cc | 63 +++-
 src/rde/rded/role.h  |  7 +
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/src/rde/rded/rde_main.cc b/src/rde/rded/rde_main.cc
index 8ed6b046e..33dd645e2 100644
--- a/src/rde/rded/rde_main.cc
+++ b/src/rde/rded/rde_main.cc
@@ -125,6 +125,7 @@ static void handle_mbx_event() {
 }
 case RDE_MSG_PEER_DOWN:
   LOG_NO("Peer down on node 0x%x", msg->fr_node_id);
+  role->RemovePeer(msg->fr_node_id);
   break;
 case RDE_MSG_NEW_ACTIVE_CALLBACK: {
   const std::string my_node = base::Conf::NodeName();
diff --git a/src/rde/rded/role.cc b/src/rde/rded/role.cc
index 3732be449..344702e63 100644
--- a/src/rde/rded/role.cc
+++ b/src/rde/rded/role.cc
@@ -196,9 +196,13 @@ Role::Role(NODE_ID own_node_id)
   discover_peer_timeout_{base::GetEnv("RDE_DISCOVER_PEER_TIMEOUT",
   kDefaultDiscoverPeerTimeout)},
   pre_active_script_timeout_{base::GetEnv(
-  "RDE_PRE_ACTIVE_SCRIPT_TIMEOUT", kDefaultPreActiveScriptTimeout)} {}
+  "RDE_PRE_ACTIVE_SCRIPT_TIMEOUT", kDefaultPreActiveScriptTimeout)},
+  received_peer_info_{true},
+  peer_info_wait_time_{},
+  peer_info_wait_timeout_ {kDefaultWaitPeerInfoTimeout} {}
 
 timespec* Role::Poll(timespec* ts) {
+  TRACE_ENTER();
   timespec* timeout = nullptr;
   if (role_ == PCS_RDA_UNDEFINED) {
 timespec now = base::ReadMonotonicClock();
@@ -238,6 +242,25 @@ timespec* Role::Poll(timespec* ts) {
 cb->state_refresh_thread_started = true;
 std::thread(::RefreshConsensusState, this, cb).detach();
   }
+  if (consensus_service.IsEnabled() == false) {
+// We are already ACTIVE, and has just discovered a new node
+// which makes the election_end_time_ reset
+if (received_peer_info_ == false) {
+  timespec now = base::ReadMonotonicClock();
+  if (peer_info_wait_time_ >= now) {
+*ts = peer_info_wait_time_ - now;
+timeout = ts;
+  } else {
+// Timeout but haven't received peer info
+// The peer RDE could be in ACTIVE
+// thus self-fence to avoid split-brain risk
+LOG_ER("Discovery peer up without peer info. Risk in split-brain,"
+"rebooting this node");
+opensaf_quick_reboot("Probable split-brain due to "
+"unknown RDE peer info");
+  }
+}
+  }
 }
   }
   return timeout;
@@ -251,12 +274,25 @@ void Role::ExecutePreActiveScript() {
 }
 
 void Role::AddPeer(NODE_ID node_id) {
+  TRACE_ENTER();
   auto result = known_nodes_.insert(node_id);
   if (result.second) {
 ResetElectionTimer();
+if (role_ == PCS_RDA_ACTIVE) {
+  ResetPeerInfoWaitTimer();
+  received_peer_info_ = false;
+}
   }
 }
 
+void Role::RemovePeer(NODE_ID node_id) {
+  TRACE_ENTER();
+  if (received_peer_info_ == false && role_ != PCS_RDA_ACTIVE) {
+StopPeerInfoWaitTimer();
+  }
+  known_nodes_.erase(node_id);
+}
+
 // call from main thread only
 bool Role::IsCandidate() {
   TRACE_ENTER();
@@ -330,10 +366,24 @@ uint32_t Role::SetRole(PCS_RDA_ROLE new_role) {
 }
 
 void Role::ResetElectionTimer() {
+  TRACE_ENTER();
   election_end_time_ = base::ReadMonotonicClock() +
base::MillisToTimespec(discover_peer_timeout_);
 }
 
+void Role::ResetPeerInfoWaitTimer() {
+  TRACE_ENTER();
+  LOG_NO("Start/restart waiting peer info timer");
+  peer_info_wait_time_ = base::ReadMonotonicClock() +
+   base::MillisToTimespec(peer_info_wait_timeout_);
+}
+
+void Role::StopPeerInfoWaitTimer() {
+  TRACE_ENTER();
+  // Turn off peer_info_timer
+  received_peer_info_ = true;
+}
+
 uint32_t Role::UpdateMdsRegistration(PCS_RDA_ROLE new_role,
  PCS_RDA_ROLE old_role) {
   uint32_t rc = NCSCC_RC_SUCCESS;
@@ -357,6 +407,7 @@ uint32_t Role::UpdateMdsRegistration(PCS_RDA_ROLE new_role,
 
 void Role::SetPeerState(PCS_RDA_ROLE node_role, NODE_ID node_id,
 uint64_t peer_promote_pending) {
+  TRACE_ENTER();
   if (role() == PCS_RDA_UNDEFINED) {
 bool give_up = false;
 RDE_CONTROL_BLOCK *cb = rde_get_control_block();
@@ -372,6 +423,14 @@ void Role::SetPeerState(PCS_RDA_ROLE node_role, NODE_ID 
node_id,
 }
 if (node_role == PCS_RDA_ACTIVE || node_role == PCS_RDA_STANDBY ||
 give_up) {
+  // broadcast QUIESCED role to all peers to stop their waiting peer
+  // info timer
+  rde_msg peer_info_req;
+  peer_info_req.type = RDE_MSG_PEER_INFO_RESP;
+  peer_info_req.info.peer_info.ha_role = PCS_RDA_QUIESCED;
+  peer_info_req.info.peer_info.promote_pending = 0;
+  rde_mds_broadcast(_info_req);
+