[tickets] [opensaf:tickets] #2971 amf: standby amfd crash during failover to become active

2018-11-25 Thread Gary Lee via Opensaf-tickets
- **status**: accepted --> review



---

** [tickets:#2971] amf: standby amfd crash during failover to become active**

**Status:** review
**Milestone:** 5.18.12
**Created:** Fri Nov 23, 2018 01:15 AM UTC by Thuan
**Last Updated:** Mon Nov 26, 2018 03:11 AM UTC
**Owner:** Gary Lee


PL-9 was deleted from cluster, but somehow standby amfd still keep the node.
Then when failover happen, standby amfd crash as following:
~~~
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO FAILOVER StandBy --> Active
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO Node 'SC-1' left the cluster
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO FAILOVER StandBy --> Active DONE!
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO Node 'PL-9' left the cluster
Nov 20 04:09:14 SC-2 osafamfd[5079]: src/amf/amfd/sgproc.cc:2187: 
avd_node_down_mw_susi_failover: Assertion 'avnd->list_of_ncs_su.empty() != 
true' failed.
~~~
The root cause is amfnd down on SC-2 vs checkpoint from SC-1
~~~
<143>1 2018-11-24T14:43:17.870243+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="7238"] 261:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2050f, 
0x563eacfe3b50
<143>1 2018-11-24T14:43:17.870254+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="7239"] 261:amf/amfd/ndfsm.cc:853 << avd_mds_avnd_down_evh 

<143>1 2018-11-24T14:43:17.874433+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22818"] 285:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2050f, 
0x5601d0d9cb90
<143>1 2018-11-24T14:43:17.874439+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22819"] 285:amf/amfd/ndproc.cc:1235 >> avd_node_failover: 
'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
<143>1 2018-11-24T14:43:17.874443+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22820"] 285:amf/amfd/ndfsm.cc:1149 >> avd_node_mark_absent 

<141>1 2018-11-24T14:43:17.88228+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22908"] 285:amf/amfd/ndfsm.cc:1154 NO Node 'PL-5' left the cluster
<143>1 2018-11-24T14:43:17.882284+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22909"] 285:mbc/mbcsv_api.c:798 >> mbcsv_process_snd_ckpt_request: 
Sending checkpoint data to all STANDBY peers, as per the send-type specified

<143>1 2018-11-24T14:43:17.882637+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22943"] 285:amf/amfd/ndfsm.cc:1168 << avd_node_mark_absent 

<143>1 2018-11-24T14:43:17.900529+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="7564"] 261:amf/amfd/ckpt_updt.cc:49 >> avd_ckpt_node: update - 
'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
<143>1 2018-11-24T14:43:17.900575+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="7577"] 261:amf/amfd/ckpt_updt.cc:78 << avd_ckpt_node: 1

<143>1 2018-11-24T14:43:39.417927+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="8716"] 261:amf/amfd/node.cc:500 >> node_ccb_completed_delete_hdlr: 
'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
<143>1 2018-11-24T14:43:39.417932+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="8717"] 261:amf/amfd/imm.cc:2306 TR Node 
'safAmfNode=PL-5,safAmfCluster=myAmfCluster' is still cluster member
~~~


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2972 amfd: unsafe access to cb

2018-11-25 Thread Gary Lee via Opensaf-tickets
develop:

commit 8c5f9ab333231b093489e60071083a1452b93d0e
Author: thuan.tran 
Date:   Mon Nov 26 16:14:36 2018 +1100

amf: should not run check_nodes_after_reinit_imm() out of main [#2972]


---

** [tickets:#2972] amfd: unsafe access to cb**

**Status:** review
**Milestone:** 5.18.12
**Created:** Fri Nov 23, 2018 07:23 AM UTC by Gary Lee
**Last Updated:** Mon Nov 26, 2018 03:36 AM UTC
**Owner:** Thuan


The fix for [#2949] seems to have issues. It calls 
avd_check_nodes_after_reinit_imm() which can modify variables in cb outside the 
main thread.

We should call avd_check_nodes_after_reinit_imm() after getting 
AVD_IMM_REINITIALIZED in the main thread.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2972 amfd: unsafe access to cb

2018-11-25 Thread Gary Lee via Opensaf-tickets
- **status**: review --> fixed



---

** [tickets:#2972] amfd: unsafe access to cb**

**Status:** fixed
**Milestone:** 5.18.12
**Created:** Fri Nov 23, 2018 07:23 AM UTC by Gary Lee
**Last Updated:** Mon Nov 26, 2018 05:18 AM UTC
**Owner:** Thuan


The fix for [#2949] seems to have issues. It calls 
avd_check_nodes_after_reinit_imm() which can modify variables in cb outside the 
main thread.

We should call avd_check_nodes_after_reinit_imm() after getting 
AVD_IMM_REINITIALIZED in the main thread.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2975 amfd: Msg d->nd is popped out of queue if failed to send

2018-11-25 Thread Minh Hon Chau via Opensaf-tickets
- **status**: unassigned --> accepted
- **assigned_to**: Minh Hon Chau



---

** [tickets:#2975] amfd: Msg d->nd is popped out of queue if failed to send**

**Status:** accepted
**Milestone:** 5.18.12
**Created:** Mon Nov 26, 2018 03:56 AM UTC by Minh Hon Chau
**Last Updated:** Mon Nov 26, 2018 03:56 AM UTC
**Owner:** Minh Hon Chau


In function avd_d2n_msg_dequeue, if amfd fails to send msg to amfnd, it pops 
and delete msg, thus the msg is missed at amfnd.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2975 amfd: Msg d->nd is popped out of queue if failed to send

2018-11-25 Thread Minh Hon Chau via Opensaf-tickets



---

** [tickets:#2975] amfd: Msg d->nd is popped out of queue if failed to send**

**Status:** unassigned
**Milestone:** 5.18.12
**Created:** Mon Nov 26, 2018 03:56 AM UTC by Minh Hon Chau
**Last Updated:** Mon Nov 26, 2018 03:56 AM UTC
**Owner:** nobody


In function avd_d2n_msg_dequeue, if amfd fails to send msg to amfnd, it pops 
and delete msg, thus the msg is missed at amfnd.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2972 amfd: unsafe access to cb

2018-11-25 Thread Thuan via Opensaf-tickets
- **status**: fixed --> review



---

** [tickets:#2972] amfd: unsafe access to cb**

**Status:** review
**Milestone:** 5.18.12
**Created:** Fri Nov 23, 2018 07:23 AM UTC by Gary Lee
**Last Updated:** Mon Nov 26, 2018 02:57 AM UTC
**Owner:** Thuan


The fix for [#2949] seems to have issues. It calls 
avd_check_nodes_after_reinit_imm() which can modify variables in cb outside the 
main thread.

We should call avd_check_nodes_after_reinit_imm() after getting 
AVD_IMM_REINITIALIZED in the main thread.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2971 amf: standby amfd crash during failover to become active

2018-11-25 Thread Thuan via Opensaf-tickets
- Description has changed:

Diff:



--- old
+++ new
@@ -1,5 +1,4 @@
 PL-9 was deleted from cluster, but somehow standby amfd still keep the node.
-The most possible reason is that standby amfd miss node delete apply callback 
by somehow.
 Then when failover happen, standby amfd crash as following:
 ~~~
 Nov 20 04:09:14 SC-2 osafamfd[5079]: NO FAILOVER StandBy --> Active
@@ -8,3 +7,23 @@
 Nov 20 04:09:14 SC-2 osafamfd[5079]: NO Node 'PL-9' left the cluster
 Nov 20 04:09:14 SC-2 osafamfd[5079]: src/amf/amfd/sgproc.cc:2187: 
avd_node_down_mw_susi_failover: Assertion 'avnd->list_of_ncs_su.empty() != 
true' failed.
 ~~~
+The root cause is amfnd down on SC-2 vs checkpoint from SC-1
+~~~
+<143>1 2018-11-24T14:43:17.870243+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="7238"] 261:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2050f, 
0x563eacfe3b50
+<143>1 2018-11-24T14:43:17.870254+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="7239"] 261:amf/amfd/ndfsm.cc:853 << avd_mds_avnd_down_evh 
+
+<143>1 2018-11-24T14:43:17.874433+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22818"] 285:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2050f, 
0x5601d0d9cb90
+<143>1 2018-11-24T14:43:17.874439+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22819"] 285:amf/amfd/ndproc.cc:1235 >> avd_node_failover: 
'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
+<143>1 2018-11-24T14:43:17.874443+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22820"] 285:amf/amfd/ndfsm.cc:1149 >> avd_node_mark_absent 
+
+<141>1 2018-11-24T14:43:17.88228+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22908"] 285:amf/amfd/ndfsm.cc:1154 NO Node 'PL-5' left the cluster
+<143>1 2018-11-24T14:43:17.882284+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22909"] 285:mbc/mbcsv_api.c:798 >> mbcsv_process_snd_ckpt_request: 
Sending checkpoint data to all STANDBY peers, as per the send-type specified
+
+<143>1 2018-11-24T14:43:17.882637+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22943"] 285:amf/amfd/ndfsm.cc:1168 << avd_node_mark_absent 
+
+<143>1 2018-11-24T14:43:17.900529+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="7564"] 261:amf/amfd/ckpt_updt.cc:49 >> avd_ckpt_node: update - 
'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
+<143>1 2018-11-24T14:43:17.900575+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="7577"] 261:amf/amfd/ckpt_updt.cc:78 << avd_ckpt_node: 1
+
+<143>1 2018-11-24T14:43:39.417927+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="8716"] 261:amf/amfd/node.cc:500 >> node_ccb_completed_delete_hdlr: 
'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
+<143>1 2018-11-24T14:43:39.417932+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="8717"] 261:amf/amfd/imm.cc:2306 TR Node 
'safAmfNode=PL-5,safAmfCluster=myAmfCluster' is still cluster member
+~~~






---

** [tickets:#2971] amf: standby amfd crash during failover to become active**

**Status:** accepted
**Milestone:** 5.18.12
**Created:** Fri Nov 23, 2018 01:15 AM UTC by Thuan
**Last Updated:** Sat Nov 24, 2018 08:48 AM UTC
**Owner:** Gary Lee


PL-9 was deleted from cluster, but somehow standby amfd still keep the node.
Then when failover happen, standby amfd crash as following:
~~~
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO FAILOVER StandBy --> Active
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO Node 'SC-1' left the cluster
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO FAILOVER StandBy --> Active DONE!
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO Node 'PL-9' left the cluster
Nov 20 04:09:14 SC-2 osafamfd[5079]: src/amf/amfd/sgproc.cc:2187: 
avd_node_down_mw_susi_failover: Assertion 'avnd->list_of_ncs_su.empty() != 
true' failed.
~~~
The root cause is amfnd down on SC-2 vs checkpoint from SC-1
~~~
<143>1 2018-11-24T14:43:17.870243+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="7238"] 261:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2050f, 
0x563eacfe3b50
<143>1 2018-11-24T14:43:17.870254+07:00 SC-2 osafamfd 261 osafamfd [meta 
sequenceId="7239"] 261:amf/amfd/ndfsm.cc:853 << avd_mds_avnd_down_evh 

<143>1 2018-11-24T14:43:17.874433+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22818"] 285:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2050f, 
0x5601d0d9cb90
<143>1 2018-11-24T14:43:17.874439+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22819"] 285:amf/amfd/ndproc.cc:1235 >> avd_node_failover: 
'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
<143>1 2018-11-24T14:43:17.874443+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22820"] 285:amf/amfd/ndfsm.cc:1149 >> avd_node_mark_absent 

<141>1 2018-11-24T14:43:17.88228+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22908"] 285:amf/amfd/ndfsm.cc:1154 NO Node 'PL-5' left the cluster
<143>1 2018-11-24T14:43:17.882284+07:00 SC-1 osafamfd 285 osafamfd [meta 
sequenceId="22909"] 285:mbc/mbcsv_api.c:798 >> mbcsv_process_snd_ckpt_request: 
Sending checkpoint data to all STANDBY peers, as per the send-type specified

<143>1 2018-11-24T14:43:17.882637+07:00 SC-1 osafamfd 285 osafamfd [meta 

[tickets] [opensaf:tickets] #2972 amfd: unsafe access to cb

2018-11-25 Thread Thuan via Opensaf-tickets
- **status**: review --> fixed



---

** [tickets:#2972] amfd: unsafe access to cb**

**Status:** fixed
**Milestone:** 5.18.12
**Created:** Fri Nov 23, 2018 07:23 AM UTC by Gary Lee
**Last Updated:** Fri Nov 23, 2018 12:48 PM UTC
**Owner:** Thuan


The fix for [#2949] seems to have issues. It calls 
avd_check_nodes_after_reinit_imm() which can modify variables in cb outside the 
main thread.

We should call avd_check_nodes_after_reinit_imm() after getting 
AVD_IMM_REINITIALIZED in the main thread.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets