- Description has changed:
Diff:
~~~~
--- old
+++ new
@@ -1,13 +1,13 @@
-After the active SC rebooted, the standby SC executed failover to active. The
new active SC notified a PL left cluster but that PL was still in cluster. The
reason is the connection between the standby SC and that PL was dropped in the
past, but that PL still connected with the active SC. It led the standby SC
considered that PL is down regardless the connection was established after
that. The standby SC only removes a down PL when it receives a check point from
the active SC. However, the active SC will not send that check point because it
still connect with the PL. During failover, the standby SC will notify all
recorded down nodes left cluster.
+After the active SC rebooted, the standby SC executed failover to active. The
new active SC notified a PL left cluster but that PL was still in cluster. The
reason is the connection between the standby SC and that PL was dropped in the
past, but that PL still connected with the active SC. It led the standby SC
considered that PL absented regardless the connection was established after
that. The standby SC only change the PL state when it receives a check point
from the active SC. However, the active SC will not send that check point
because it still connect with the PL. During failover, the standby SC will
notify all recorded absent nodes left cluster.
<pre>
- down list:PL-3 down list:PL-3
-SC-1(Act)----SC-2(Stb) SC-1(Act)----SC-2(Stb)
SC-1(Act)----SC-2(Stb)
- \ / \ \ /
- \ / \ \ /
- PL-3 PL-3 PL-3
+ absent nodes:PL-3 absent
nodes:PL-3
+SC-1(Act)----SC-2(Stb) SC-1(Act)----SC-2(Stb) SC-1(Act)----SC-2(Stb)
+ \ / \ \ /
+ \ / \ \ /
+ PL-3 PL-3 PL-3
- down list:PL-3,SC-1
- SC-1(Down) SC-2(Stb) SC-1(Stb)----SC-2(Atc)
+ absent nodes:PL-3,SC-1
+ SC-1(Down) SC-2(Stb) SC-1(Stb)----SC-2(Act)
/ \ /
/ \ /
PL-3 PL-3
~~~~
- **Component**: clm --> amf
- **Part**: - --> d
---
** [tickets:#3309] clm: the payload node unexpectedly left cluster right after
failover**
**Status:** accepted
**Milestone:** 5.22.04
**Created:** Thu Feb 24, 2022 03:57 AM UTC by Hieu Hong Hoang
**Last Updated:** Thu Feb 24, 2022 03:57 AM UTC
**Owner:** Hieu Hong Hoang
After the active SC rebooted, the standby SC executed failover to active. The
new active SC notified a PL left cluster but that PL was still in cluster. The
reason is the connection between the standby SC and that PL was dropped in the
past, but that PL still connected with the active SC. It led the standby SC
considered that PL absented regardless the connection was established after
that. The standby SC only change the PL state when it receives a check point
from the active SC. However, the active SC will not send that check point
because it still connect with the PL. During failover, the standby SC will
notify all recorded absent nodes left cluster.
<pre>
absent nodes:PL-3 absent
nodes:PL-3
SC-1(Act)----SC-2(Stb) SC-1(Act)----SC-2(Stb) SC-1(Act)----SC-2(Stb)
\ / \ \ /
\ / \ \ /
PL-3 PL-3 PL-3
absent nodes:PL-3,SC-1
SC-1(Down) SC-2(Stb) SC-1(Stb)----SC-2(Act)
/ \ /
/ \ /
PL-3 PL-3
</pre>
Log analysis:
* SC-2 (standby SC) lost contact with PL-3
2022-02-23 09:03:24.114 SC-2 osafdtmd[320]: NO Lost contact with 'PL-3'
* SC-2 (standby SC) re-established contact with PL-3
2022-02-23 09:03:24.513 SC-2 osafdtmd[320]: NO Established contact with 'PL-3'
* SC-2 finished the failover:
2022-02-23 09:03:25.582 SC-2 osafamfd[422]: NO FAILOVER StandBy --> Active DONE!
* SC-2 notified the PL-3 left the cluster:
2022-02-23 09:03:25.679 SC-2 osafamfd[422]: NO Node 'PL-3' left the cluster
* State of nodes:
safAmfNode=PL-3,safAmfCluster=myAmfCluster
saAmfNodeAdminState=UNLOCKED(1)
saAmfNodeOperState=DISABLED(2)
safAmfNode=PL-4,safAmfCluster=myAmfCluster
saAmfNodeAdminState=UNLOCKED(1)
saAmfNodeOperState=ENABLED(1)
safAmfNode=PL-5,safAmfCluster=myAmfCluster
saAmfNodeAdminState=UNLOCKED(1)
saAmfNodeOperState=ENABLED(1)
safAmfNode=SC-1,safAmfCluster=myAmfCluster
saAmfNodeAdminState=UNLOCKED(1)
saAmfNodeOperState=ENABLED(1)
safAmfNode=SC-2,safAmfCluster=myAmfCluster
saAmfNodeAdminState=UNLOCKED(1)
saAmfNodeOperState=ENABLED(1)
Steps to reproduce:
1. Drop connection between the standby SC-2 and PL-3
2. Reconnect SC-2 with PL-3
3. Execute "immdump" inside a node. (immd in the standby SC-2 will remove the
PL-3 from the list of detached nodes)
4. Reboot the active SC-1
5. Execute "amf-state node" inside a node
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets