[tickets] [opensaf:tickets] #2213 AMFND: Coredump if suFailover while shutting down

2017-03-07 Thread Nagendra Kumar
- **status**: review --> fixed
- **Comment**:

changeset:   8656:a56101161326
branch:  opensaf-5.0.x
parent:  8651:a90faf589254
user:Nagendra Kumar
date:Tue Mar 07 13:18:45 2017 +0530
summary: amfnd: avoid null pointer access [#2213]

changeset:   8657:a203318fb21e
branch:  opensaf-5.1.x
parent:  8652:a7c62f1de1a3
user:Nagendra Kumar
date:Tue Mar 07 13:19:02 2017 +0530
summary: amfnd: avoid null pointer access [#2213]

changeset:   8658:136a8f432da6
tag: tip
parent:  8655:45be1e612ab6
user:Nagendra Kumar
date:Tue Mar 07 13:19:16 2017 +0530
summary: amfnd: avoid null pointer access [#2213]

[staging:a56101]
[staging:a20331]
[staging:136a8f]




---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** fixed
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Tue Mar 07, 2017 07:21 AM UTC
**Owner:** Nagendra Kumar
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x00449cc9 in avsv_dblist_sastring_cmp (key1=, 
key2=) at util.c:361
i = 0
str1 = 
str2 = 
#2  0x7f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 ) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3  0x00416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x0040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = 
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = 
rc = 1
#6  0x0040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = 
final_st = 
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x0040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f9298c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = 
clc_evt = 0x7f9298e0
comp = 0x1ee8200
rc = 1
#8  0x0042676f in avnd_evt_process (evt=0x7f9298c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9  avnd_main_process () at main.cc:577
ret = 
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f9298c0
__FUNCTION__ = "avnd_main_process"
result = 
rc = 
#10 0x004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=4,safSg=1,safApp=npm_2' 

[tickets] [opensaf:tickets] #2213 AMFND: Coredump if suFailover while shutting down

2017-03-06 Thread Nagendra Kumar
- **status**: assigned --> review
- **Version**:  --> 5.1 GA



---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** review
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Thu Mar 02, 2017 08:08 AM UTC
**Owner:** Nagendra Kumar
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x00449cc9 in avsv_dblist_sastring_cmp (key1=, 
key2=) at util.c:361
i = 0
str1 = 
str2 = 
#2  0x7f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 ) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3  0x00416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x0040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = 
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = 
rc = 1
#6  0x0040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = 
final_st = 
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x0040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f9298c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = 
clc_evt = 0x7f9298e0
comp = 0x1ee8200
rc = 1
#8  0x0042676f in avnd_evt_process (evt=0x7f9298c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9  avnd_main_process () at main.cc:577
ret = 
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f9298c0
__FUNCTION__ = "avnd_main_process"
result = 
rc = 
#10 0x004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 amfclccli[729]: CLEANUP request 
'safComp=A,safSu=4,safSg=1,safApp=npm_2'
2016-11-20 22:01:21 PL-5 amfclccli[728]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:01:21 PL-5 amfclccli[727]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:12 PL-5 osafamfnd[411]: NO Removed 'safSi=2,safApp=nway_1' 
from 'safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:02:12 PL-5 osafimmnd[394]: NO Global discard node received for 
nodeId:2040f pid:399
2016-11-20 22:02:12 PL-5 osafdtmd[380]: 

[tickets] [opensaf:tickets] #2213 AMFND: Coredump if suFailover while shutting down

2017-03-02 Thread Minh Hon Chau
I think it will


---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** assigned
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Thu Mar 02, 2017 07:09 AM UTC
**Owner:** Nagendra Kumar
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x00449cc9 in avsv_dblist_sastring_cmp (key1=, 
key2=) at util.c:361
i = 0
str1 = 
str2 = 
#2  0x7f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 ) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3  0x00416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x0040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = 
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = 
rc = 1
#6  0x0040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = 
final_st = 
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x0040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f9298c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = 
clc_evt = 0x7f9298e0
comp = 0x1ee8200
rc = 1
#8  0x0042676f in avnd_evt_process (evt=0x7f9298c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9  avnd_main_process () at main.cc:577
ret = 
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f9298c0
__FUNCTION__ = "avnd_main_process"
result = 
rc = 
#10 0x004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 amfclccli[729]: CLEANUP request 
'safComp=A,safSu=4,safSg=1,safApp=npm_2'
2016-11-20 22:01:21 PL-5 amfclccli[728]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:01:21 PL-5 amfclccli[727]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:12 PL-5 osafamfnd[411]: NO Removed 'safSi=2,safApp=nway_1' 
from 'safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:02:12 PL-5 osafimmnd[394]: NO Global discard node received for 
nodeId:2040f pid:399
2016-11-20 22:02:12 PL-5 osafdtmd[380]: NO Lost contact with 'PL-4'
2016-11-20 

[tickets] [opensaf:tickets] #2213 AMFND: Coredump if suFailover while shutting down

2017-03-01 Thread Nagendra Kumar
Will this work? :
diff --git a/src/amf/amfnd/comp.cc b/src/amf/amfnd/comp.cc
--- a/src/amf/amfnd/comp.cc
+++ b/src/amf/amfnd/comp.cc
@@ -2650,6 +2650,9 @@ void avnd_comp_cmplete_all_csi_rec(AVND_
/* generate csi-remove-done event... csi may be 
deleted */
(void)avnd_comp_csi_remove_done(cb, comp, curr);

+   if (curr == nullptr)
+   break;
+
if (0 == m_AVND_COMPDB_REC_CSI_GET(*comp, 
curr->name.c_str())) {
curr =
(prv) ?



---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** assigned
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Thu Mar 02, 2017 06:09 AM UTC
**Owner:** Nagendra Kumar
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x00449cc9 in avsv_dblist_sastring_cmp (key1=, 
key2=) at util.c:361
i = 0
str1 = 
str2 = 
#2  0x7f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 ) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3  0x00416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x0040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = 
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = 
rc = 1
#6  0x0040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = 
final_st = 
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x0040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f9298c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = 
clc_evt = 0x7f9298e0
comp = 0x1ee8200
rc = 1
#8  0x0042676f in avnd_evt_process (evt=0x7f9298c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9  avnd_main_process () at main.cc:577
ret = 
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f9298c0
__FUNCTION__ = "avnd_main_process"
result = 
rc = 
#10 0x004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Presence State 

[tickets] [opensaf:tickets] #2213 AMFND: Coredump if suFailover while shutting down

2017-03-01 Thread Minh Hon Chau
Hi Nagu,

I can not reproduce this scenario. I think there's one place we could improve 
to avoid coredump.
It's in avnd_comp_cmplete_all_csi_rec() 
~~~
{
...
/* generate csi-remove-done event... csi may be 
deleted */
(void)avnd_comp_csi_remove_done(cb, comp, curr);

if (0 == m_AVND_COMPDB_REC_CSI_GET(*comp, 
curr->name.c_str())) {
curr =
(prv) ?

m_AVND_CSI_REC_FROM_COMP_DLL_NODE_GET(m_NCS_DBLIST_FIND_NEXT(>comp_dll_node))
 :

m_AVND_CSI_REC_FROM_COMP_DLL_NODE_GET(m_NCS_DBLIST_FIND_FIRST(>csi_list));
} else
   
...
}
~~~
The call avnd_comp_csi_remove_done() is a recursion and can lead to a deletion 
of csi. Maybe we can also check @curr is non-null pointer in the "if" block of 
m_AVND_COMPDB_REC_CSI_GET ?

Thanks,
Minh


---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** assigned
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Wed Mar 01, 2017 09:06 PM UTC
**Owner:** Nagendra Kumar
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x00449cc9 in avsv_dblist_sastring_cmp (key1=, 
key2=) at util.c:361
i = 0
str1 = 
str2 = 
#2  0x7f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 ) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3  0x00416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x0040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = 
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = 
rc = 1
#6  0x0040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = 
final_st = 
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x0040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f9298c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = 
clc_evt = 0x7f9298e0
comp = 0x1ee8200
rc = 1
#8  0x0042676f in avnd_evt_process (evt=0x7f9298c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9  avnd_main_process () at main.cc:577
ret = 
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f9298c0
__FUNCTION__ = "avnd_main_process"
result = 
rc = 
#10 0x004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 

[tickets] [opensaf:tickets] #2213 AMFND: Coredump if suFailover while shutting down

2017-03-01 Thread Minh Hon Chau
Hi Nagu,

We only have seen it once so far and didn't have trace, I try to reproduce it 
again on latest changeset to see if it happens.

Thanks,
Minh


---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** assigned
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Wed Mar 01, 2017 11:30 AM UTC
**Owner:** Nagendra Kumar
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x00449cc9 in avsv_dblist_sastring_cmp (key1=, 
key2=) at util.c:361
i = 0
str1 = 
str2 = 
#2  0x7f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 ) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3  0x00416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x0040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = 
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = 
rc = 1
#6  0x0040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = 
final_st = 
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x0040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f9298c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = 
clc_evt = 0x7f9298e0
comp = 0x1ee8200
rc = 1
#8  0x0042676f in avnd_evt_process (evt=0x7f9298c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9  avnd_main_process () at main.cc:577
ret = 
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f9298c0
__FUNCTION__ = "avnd_main_process"
result = 
rc = 
#10 0x004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 amfclccli[729]: CLEANUP request 
'safComp=A,safSu=4,safSg=1,safApp=npm_2'
2016-11-20 22:01:21 PL-5 amfclccli[728]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:01:21 PL-5 amfclccli[727]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:12 PL-5 osafamfnd[411]: NO Removed 'safSi=2,safApp=nway_1' 
from 'safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:02:12 PL-5 osafimmnd[394]: NO 

[tickets] [opensaf:tickets] #2213 AMFND: Coredump if suFailover while shutting down

2017-03-01 Thread Nagendra Kumar
Hi Minh, I am not getting any clue about the faults. Can you please provide 
traces if possible (via email).
Thanks
-Nagu


---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** assigned
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Wed Mar 01, 2017 10:57 AM UTC
**Owner:** Nagendra Kumar
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x00449cc9 in avsv_dblist_sastring_cmp (key1=, 
key2=) at util.c:361
i = 0
str1 = 
str2 = 
#2  0x7f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 ) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3  0x00416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x0040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = 
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = 
rc = 1
#6  0x0040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = 
final_st = 
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x0040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f9298c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = 
clc_evt = 0x7f9298e0
comp = 0x1ee8200
rc = 1
#8  0x0042676f in avnd_evt_process (evt=0x7f9298c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9  avnd_main_process () at main.cc:577
ret = 
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f9298c0
__FUNCTION__ = "avnd_main_process"
result = 
rc = 
#10 0x004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 amfclccli[729]: CLEANUP request 
'safComp=A,safSu=4,safSg=1,safApp=npm_2'
2016-11-20 22:01:21 PL-5 amfclccli[728]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:01:21 PL-5 amfclccli[727]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:12 PL-5 osafamfnd[411]: NO Removed 'safSi=2,safApp=nway_1' 
from 'safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:02:12 PL-5 osafimmnd[394]: NO Global discard node received for 

[tickets] [opensaf:tickets] #2213 AMFND: Coredump if suFailover while shutting down

2017-03-01 Thread Nagendra Kumar
- **status**: unassigned --> assigned
- **assigned_to**: Nagendra Kumar



---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** assigned
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Fri Dec 02, 2016 04:54 AM UTC
**Owner:** Nagendra Kumar
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x00449cc9 in avsv_dblist_sastring_cmp (key1=, 
key2=) at util.c:361
i = 0
str1 = 
str2 = 
#2  0x7f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 ) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3  0x00416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x0040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = 
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = 
rc = 1
#6  0x0040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = 
final_st = 
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x0040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f9298c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = 
clc_evt = 0x7f9298e0
comp = 0x1ee8200
rc = 1
#8  0x0042676f in avnd_evt_process (evt=0x7f9298c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9  avnd_main_process () at main.cc:577
ret = 
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f9298c0
__FUNCTION__ = "avnd_main_process"
result = 
rc = 
#10 0x004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 amfclccli[729]: CLEANUP request 
'safComp=A,safSu=4,safSg=1,safApp=npm_2'
2016-11-20 22:01:21 PL-5 amfclccli[728]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:01:21 PL-5 amfclccli[727]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:12 PL-5 osafamfnd[411]: NO Removed 'safSi=2,safApp=nway_1' 
from 'safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:02:12 PL-5 osafimmnd[394]: NO Global discard node received for 
nodeId:2040f pid:399
2016-11-20 22:02:12 PL-5 

[tickets] [opensaf:tickets] #2213 AMFND: Coredump if suFailover while shutting down

2016-12-01 Thread Minh Hon Chau



---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** unassigned
**Milestone:** 5.2.FC
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Fri Dec 02, 2016 04:54 AM UTC
**Owner:** nobody
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x00449cc9 in avsv_dblist_sastring_cmp (key1=, 
key2=) at util.c:361
i = 0
str1 = 
str2 = 
#2  0x7f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 ) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3  0x00416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x0040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = 
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = 
rc = 1
#6  0x0040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = 
final_st = 
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x0040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f9298c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = 
clc_evt = 0x7f9298e0
comp = 0x1ee8200
rc = 1
#8  0x0042676f in avnd_evt_process (evt=0x7f9298c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9  avnd_main_process () at main.cc:577
ret = 
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f9298c0
__FUNCTION__ = "avnd_main_process"
result = 
rc = 
#10 0x004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 600 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 amfclccli[729]: CLEANUP request 
'safComp=A,safSu=4,safSg=1,safApp=npm_2'
2016-11-20 22:01:21 PL-5 amfclccli[728]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:01:21 PL-5 amfclccli[727]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:12 PL-5 osafamfnd[411]: NO Removed 'safSi=2,safApp=nway_1' 
from 'safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:02:12 PL-5 osafimmnd[394]: NO Global discard node received for 
nodeId:2040f pid:399
2016-11-20 22:02:12 PL-5 osafdtmd[380]: NO Lost contact with 'PL-4'
2016-11-20 22:02:13 PL-5 opensafd: