Hi Thang,

Ack from me.

Thanks

Minh

On 11/6/19 4:11 pm, Minh Hon Chau wrote:
Hi Thang,

I can see it's a race in main thread that how amfd processes the mds down and clm callback.

Node is going down

<143>1 2019-06-11T15:16:42.157517+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="38507"] 275:amf/amfd/mds.cc:398 >> avd_mds_svc_evt <143>1 2019-06-11T15:16:42.157526+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="38508"] 275:amf/amfd/mds.cc:459 TR avnd 2030f000000bd down <143>1 2019-06-11T15:16:42.157535+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="38509"] 275:amf/amfd/mds.cc:0 << avd_mds_svc_evt

<143>1 2019-06-11T15:16:48.332481+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="38623"] 272:amf/amfd/clm.cc:226 >> clm_track_cb: '0' '4' '1' <143>1 2019-06-11T15:16:48.33249+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="38624"] 272:amf/amfd/clm.cc:242 TR numberOfMembers:'4', numberOfItems:'1' <143>1 2019-06-11T15:16:48.3325+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="38625"] 272:amf/amfd/clm.cc:248 TR i = 0, node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3 <143>1 2019-06-11T15:16:48.33251+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="38626"] 272:amf/amfd/clm.cc:332 TR Node Left: rootCauseEntity safNode=PL-3,safCluster=myClmCluster for node 131855 <143>1 2019-06-11T15:16:48.332519+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="38627"] 272:amf/amfd/clm.cc:188 >> clm_node_exit_complete: 2030f <143>1 2019-06-11T15:16:48.332534+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="38628"] 272:amf/amfd/ndproc.cc:1267 >> avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster' <143>1 2019-06-11T15:16:48.332545+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="38629"] 272:amf/amfd/ndfsm.cc:1153 >> avd_node_mark_absent

Node is coming up again

<143>1 2019-06-11T15:16:48.34867+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="39826"] 272:amf/amfd/clm.cc:226 >> clm_track_cb: '0' '4' '1' <143>1 2019-06-11T15:16:48.348674+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="39827"] 272:amf/amfd/clm.cc:242 TR numberOfMembers:'5', numberOfItems:'1' <143>1 2019-06-11T15:16:48.348678+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="39828"] 272:amf/amfd/clm.cc:248 TR i = 0, node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:2 <143>1 2019-06-11T15:16:48.348685+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="39829"] 272:amf/amfd/node.cc:53 TR added node 131855 <143>1 2019-06-11T15:16:48.348689+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="39830"] 272:amf/amfd/clm.cc:417 TR Node Joined 'safNode=PL-3,safCluster=myClmCluster' '36'

Now amfd processes the mds down in main thread, its a race here then the @node_info.member set to FALSE

<143>1 2019-06-11T15:16:48.351948+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="39971"] 272:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2030f, 0x558e549a1650 <143>1 2019-06-11T15:16:48.351954+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="39972"] 272:amf/amfd/ndproc.cc:1267 >> avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster' <143>1 2019-06-11T15:16:48.351959+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="39973"] 272:amf/amfd/ndfsm.cc:1153 >> avd_node_mark_absent

Now the mds up comes, node_up come, but the node is not a clm member

<143>1 2019-06-11T15:16:48.701771+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40552"] 275:amf/amfd/mds.cc:398 >> avd_mds_svc_evt <143>1 2019-06-11T15:16:48.701791+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40553"] 275:amf/amfd/mds.cc:0 << avd_mds_svc_evt

<143>1 2019-06-11T15:16:48.706254+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40560"] 272:amf/amfd/ndfsm.cc:743 >> avd_mds_avnd_up_evh <143>1 2019-06-11T15:16:48.706271+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40561"] 275:amf/amfd/ndmsg.cc:389 << avd_n2d_msg_rcv <143>1 2019-06-11T15:16:48.706288+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40562"] 272:amf/amfd/ndfsm.cc:757 TR amfnd on 2030f is up <143>1 2019-06-11T15:16:48.706298+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40563"] 272:amf/amfd/ndfsm.cc:0 << avd_mds_avnd_up_evh

<143>1 2019-06-11T15:16:48.707145+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40596"] 272:amf/amfd/ndfsm.cc:275 >> avd_node_up_evh: from 2030f, safAmfNode=PL-3,safAmfCluster=myAmfCluster <143>1 2019-06-11T15:16:48.707153+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40597"] 272:amf/amfd/ndfsm.cc:292 TR leds_set 0 <143>1 2019-06-11T15:16:48.70716+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40598"] 272:amf/amfd/ndfsm.cc:308 TR node_id '2030f' not in failover_list. <141>1 2019-06-11T15:16:48.707185+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40599"] 272:amf/amfd/ndfsm.cc:232 NO Received node_up from 2030f: msg_id 1 <140>1 2019-06-11T15:16:48.7072+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40600"] 272:amf/amfd/ndfsm.cc:387 WA Not a Cluster Member dropping the msg <143>1 2019-06-11T15:16:48.707206+10:00 SC-1 osafamfd 272 osafamfd [meta sequenceId="40601"] 272:amf/amfd/ndfsm.cc:533 << avd_node_up_evh

The patch looks OK, I'm doing a bit more testing and will get back to you.

Thanks

Minh

On 7/5/19 5:41 am, thang.d.nguyen wrote:
When the PBE hung, amfd can process the events with
below order when a node was started then stop then started
- clm_track_cb for node down event
- clm_track_cb for second node up event
- avd_mds_avnd_down_evh was called to process amfnd down event

And it cause the node can not join the cluster.
---
  src/amf/amfd/ndfsm.cc | 7 +++++++
  1 file changed, 7 insertions(+)

diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc
index 8c8f3c5..7099196 100644
--- a/src/amf/amfd/ndfsm.cc
+++ b/src/amf/amfd/ndfsm.cc
@@ -800,6 +800,13 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) {
        daemon_exit();
      }
  +    if (node->node_state == AVD_AVND_STATE_ABSENT) {
+      TRACE("Ignore '%s' amfnd down event since node state absent",
+            node->node_name.c_str());
+      TRACE_LEAVE();
+      return;
+    }
+
      if (cb->failover_list.find(evt->info.node_id) != cb->failover_list.end()) {
        std::shared_ptr<NodeStateMachine> failed_node =
          cb->failover_list.at(evt->info.node_id);


_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to