Re: [devel] FW: [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2 (delayed failover issue)
Hi Zoran, In TC#16 #17 #18, there're no amfd crash. I didn't mean TC #16 failed because of one payload IMM limitation Thanks, Minh On 11/02/16 19:18, Zoran Milinkovic wrote: > Hi Minh, > > In TC#16, it's written that the test has been done with PL-3 and PL-4. So, it > is not a case with one payload. > This problem looks more like AMF than IMM limitation problem. > > Thanks, > Zoran > > -Original Message- > From: minh chau [mailto:minh.c...@dektech.com.au] > Sent: Thursday, February 11, 2016 5:47 AM > To: Nagendra Kumar; Hans Nordebäck; Gary Lee; Praveen Malviya > Cc: opensaf-devel@lists.sourceforge.net > Subject: Re: [devel] FW: [PATCH 0 of 5] Review Request for amf: Add support > for cloud resilience [#1620] V2 (delayed failover issue) > > Hi Nagu, > > There's known limitation in IMM with configuration 2+1 or 1+1, which results > in IMM reload at second controller restart. We're discussing to avoid the > crash Meanwhile I'm also looking at the delayed failover issues which you > reported on TC#2 #12 #13 #15 We have similar automated tests but they all > pass, so I guess your test has something special. > Can you run those tests once and send me syslog + amfd/amfnd traces? > > Thanks, > Minh > On 10/02/16 23:00, Nagendra Kumar wrote: >> TC #16. Same configuration as #12: Run SI shutdown and keep sleep of 5 >> sec before saAmfCSIQuiescingComplete and stop controller and then after >> sleep, reject saAmfCSIQuiescingComplete with SA_AIS_ERR_FAILED_OPERATION. >> All the assignment from SU1 on PL-3 and SU2 on PL-4 are removed and SI admin >> state is 2(locked): >> saAmfSIAdminState SA_UINT32_T 2 (0x2) >> >> "Si going into locked state" is different behaviour when controller is up >> and running and run this test case. In case, controller is available, SI >> will be in unlocked state and all the assignments will be on SU2 as Act and >> SU3 as Standby (on PL-4). This need either correction or documentation. >> >> TC #17. Same configuration as #12: Run SG shutdown and keep >> sleep of 5 sec before saAmfCSIQuiescingComplete and stop controller and >> then after sleep, reject saAmfCSIQuiescingComplete with >> SA_AIS_ERR_FAILED_OPERATION. Amfnd crashes[Please note that this test case >> works with controller up]: >> Syslog and bt: >> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO component with >> QUIESCED/QUIESCING assignment failed Feb 10 11:44:29 PM_PL-3 >> osafamfnd[15508]: NO recovery action 'comp restart' escalated to 'comp >> failover' >> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO SU failover probation >> timer started (timeout: 12000 ns) Feb 10 11:44:29 PM_PL-3 >> osafamfnd[15508]: NO Performing failover of >> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' (SU failover count: 1) Feb 10 >> 11:44:29 PM_PL-3 osafamfnd[15508]: NO >> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' recovery action >> escalated from 'componentRestart' to 'componentFailover' >> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO >> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to >> 'csiSetcallbackFailed' : Recovery is 'componentFailover' >> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO >> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => >> TERMINATING Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO Removed >> 'safSi=AmfDemo,safApp=AmfDemo1' from >> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' >> Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: saAmfHAStateGet FAILED - 7 >> Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: exiting (caught term signal) >> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO avnd_di_oper_send() >> deferred as AMF director is offline Feb 10 11:44:29 PM_PL-3 >> osafimmnd[15760]: AL AMF Node Director is down, terminate this process >> >> Program terminated with signal 11, Segmentation fault. >> #0 0x00412b50 in >> avnd_comp_cmplete_all_assignment(avnd_cb_tag*, avnd_comp_tag*) () >> (gdb) bt >> #0 0x00412b50 in >> avnd_comp_cmplete_all_assignment(avnd_cb_tag*, avnd_comp_tag*) () >> #1 0x0040a093 in >> avnd_comp_clc_terming_cleansucc_hdler(avnd_cb_tag*, avnd_comp_tag*) () >> #2 0x0040c7d4 in avnd_comp_clc_fsm_run(avnd_cb_tag*, >> avnd_comp_tag*, avnd_comp_clc_pres_fsm_ev) () >> #3 0x0040ce49 in avnd_evt_clc_resp_evh(avnd_cb_tag*, >> avnd_evt_tag*) () >> #4 0x0042133f in avnd_main_process() () at main.cc:667 >> #5 0x00405517 in main () at main.cc:186 >> (gdb) thread apply all bt >> >> Thread 4 (Thread 0x7fdaf3c05b00 (LWP 15512)): >> #0 0x7fdaf2b2b415 in __lll_unlock_wake () from >> /lib64/libpthread.so.0 >> #1 0x7fdaf2b27ac4 in _L_unlock_553 () from /lib64/libpthread.so.0 >> #2 0x7fdaf2b279f7 in __pthread_mutex_unlock_usercnt () from >> /lib64/libpthread.so.0 >> #3 0x7fdaf37edac3 in ncs_os_lock () from >> /usr/local/lib/libopensaf_core.so.0 >> #4 0x7fdaf37e084d in ncs_ipc_send () from >>
Re: [devel] FW: [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2 (delayed failover issue)
Hi Minh, Thanks. Yes, I will send the traces of the TC mentioned below after I am done with some other tests. The devel lists has limitation of 400 KB limit of attachment, so will see how can I send it, may be I can attach in the ticket as temporary. Thanks -Nagu > -Original Message- > From: minh chau [mailto:minh.c...@dektech.com.au] > Sent: 11 February 2016 10:17 > To: Nagendra Kumar; hans.nordeb...@ericsson.com; > gary@dektech.com.au; Praveen Malviya > Cc: opensaf-devel@lists.sourceforge.net > Subject: Re: [devel] FW: [PATCH 0 of 5] Review Request for amf: Add support > for cloud resilience [#1620] V2 (delayed failover issue) > > Hi Nagu, > > There's known limitation in IMM with configuration 2+1 or 1+1, which results > in IMM reload at second controller restart. We're discussing to avoid the > crash Meanwhile I'm also looking at the delayed failover issues which you > reported on TC#2 #12 #13 #15 We have similar automated tests but they all > pass, so I guess your test has something special. > Can you run those tests once and send me syslog + amfd/amfnd traces? > > Thanks, > Minh > On 10/02/16 23:00, Nagendra Kumar wrote: > > TC #16. Same configuration as #12: Run SI shutdown and keep sleep > of 5 sec before saAmfCSIQuiescingComplete and stop controller and then > after sleep, reject saAmfCSIQuiescingComplete with > SA_AIS_ERR_FAILED_OPERATION. All the assignment from SU1 on PL-3 and > SU2 on PL-4 are removed and SI admin state is 2(locked): > > saAmfSIAdminState SA_UINT32_T 2 (0x2) > > > > "Si going into locked state" is different behaviour when controller is up > > and > running and run this test case. In case, controller is available, SI will be > in > unlocked state and all the assignments will be on SU2 as Act and SU3 as > Standby (on PL-4). This need either correction or documentation. > > > > TC #17. Same configuration as #12: Run SG shutdown and > keep sleep of 5 sec before saAmfCSIQuiescingComplete and stop controller > and then after sleep, reject saAmfCSIQuiescingComplete with > SA_AIS_ERR_FAILED_OPERATION. Amfnd crashes[Please note that this test > case works with controller up]: > > Syslog and bt: > > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO component with > > QUIESCED/QUIESCING assignment failed Feb 10 11:44:29 PM_PL-3 > osafamfnd[15508]: NO recovery action 'comp restart' escalated to 'comp > failover' > > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO SU failover probation > > timer started (timeout: 12000 ns) Feb 10 11:44:29 PM_PL-3 > > osafamfnd[15508]: NO Performing failover of > 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' (SU failover count: 1) Feb > 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO > 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' > recovery action escalated from 'componentRestart' to 'componentFailover' > > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO > 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' > faulted due to 'csiSetcallbackFailed' : Recovery is 'componentFailover' > > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO > > 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State > INSTANTIATED => TERMINATING Feb 10 11:44:29 PM_PL-3 > osafamfnd[15508]: NO Removed 'safSi=AmfDemo,safApp=AmfDemo1' from > 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' > > Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: saAmfHAStateGet FAILED - 7 > > Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: exiting (caught term signal) > > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO avnd_di_oper_send() > > deferred as AMF director is offline Feb 10 11:44:29 PM_PL-3 > > osafimmnd[15760]: AL AMF Node Director is down, terminate this process > > > > Program terminated with signal 11, Segmentation fault. > > #0 0x00412b50 in > > avnd_comp_cmplete_all_assignment(avnd_cb_tag*, avnd_comp_tag*) () > > (gdb) bt > > #0 0x00412b50 in > > avnd_comp_cmplete_all_assignment(avnd_cb_tag*, avnd_comp_tag*) () > > #1 0x0040a093 in > > avnd_comp_clc_terming_cleansucc_hdler(avnd_cb_tag*, > avnd_comp_tag*) () > > #2 0x0040c7d4 in avnd_comp_clc_fsm_run(avnd_cb_tag*, > > avnd_comp_tag*, avnd_comp_clc_pres_fsm_ev) () > > #3 0x0040ce49 in avnd_evt_clc_resp_evh(avnd_cb_tag*, > > avnd_evt_tag*) () > > #4 0x0042133f in avnd_main_process() () at main.cc:667 > > #5 0x00405517 in main () at main.cc:186 > > (gdb) thread apply all bt > > > > Thread 4 (Thread 0x7fdaf3c05b00 (LWP 15512)): > > #0 0x7fdaf2b2b415 in __lll_unlock_wake () from > > /lib64/libpthread.so.0 > > #1 0x7fdaf2b27ac4 in _L_unlock_553 () from /lib64/libpthread.so.0 > > #2 0x7fdaf2b279f7 in __pthread_mutex_unlock_usercnt () from > > /lib64/libpthread.so.0 > > #3 0x7fdaf37edac3 in ncs_os_lock () from > > /usr/local/lib/libopensaf_core.so.0 > > #4 0x7fdaf37e084d in ncs_ipc_send () from > > /usr/local/lib/libopensaf_core.so.0 > > #5 0x0041eea1 in avnd_evt_send(avnd_cb_tag*,
Re: [devel] FW: [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2 (delayed failover issue)
Hi Nagu, There's known limitation in IMM with configuration 2+1 or 1+1, which results in IMM reload at second controller restart. We're discussing to avoid the crash Meanwhile I'm also looking at the delayed failover issues which you reported on TC#2 #12 #13 #15 We have similar automated tests but they all pass, so I guess your test has something special. Can you run those tests once and send me syslog + amfd/amfnd traces? Thanks, Minh On 10/02/16 23:00, Nagendra Kumar wrote: > TC #16. Same configuration as #12: Run SI shutdown and keep sleep of 5 > sec before saAmfCSIQuiescingComplete and stop controller and then after > sleep, reject saAmfCSIQuiescingComplete with SA_AIS_ERR_FAILED_OPERATION. All > the assignment from SU1 on PL-3 and SU2 on PL-4 are removed and SI admin > state is 2(locked): > saAmfSIAdminState SA_UINT32_T 2 (0x2) > > "Si going into locked state" is different behaviour when controller is up and > running and run this test case. In case, controller is available, SI will be > in unlocked state and all the assignments will be on SU2 as Act and SU3 as > Standby (on PL-4). This need either correction or documentation. > > TC #17. Same configuration as #12: Run SG shutdown and keep > sleep of 5 sec before saAmfCSIQuiescingComplete and stop controller and then > after sleep, reject saAmfCSIQuiescingComplete with > SA_AIS_ERR_FAILED_OPERATION. Amfnd crashes[Please note that this test case > works with controller up]: > Syslog and bt: > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO component with > QUIESCED/QUIESCING assignment failed > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO recovery action 'comp restart' > escalated to 'comp failover' > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO SU failover probation timer > started (timeout: 12000 ns) > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO Performing failover of > 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' (SU failover count: 1) > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO > 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' recovery action > escalated from 'componentRestart' to 'componentFailover' > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO > 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to > 'csiSetcallbackFailed' : Recovery is 'componentFailover' > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO > 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => > TERMINATING > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO Removed > 'safSi=AmfDemo,safApp=AmfDemo1' from 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' > Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: saAmfHAStateGet FAILED - 7 > Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: exiting (caught term signal) > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO avnd_di_oper_send() deferred as > AMF director is offline > Feb 10 11:44:29 PM_PL-3 osafimmnd[15760]: AL AMF Node Director is down, > terminate this process > > Program terminated with signal 11, Segmentation fault. > #0 0x00412b50 in avnd_comp_cmplete_all_assignment(avnd_cb_tag*, > avnd_comp_tag*) () > (gdb) bt > #0 0x00412b50 in avnd_comp_cmplete_all_assignment(avnd_cb_tag*, > avnd_comp_tag*) () > #1 0x0040a093 in avnd_comp_clc_terming_cleansucc_hdler(avnd_cb_tag*, > avnd_comp_tag*) () > #2 0x0040c7d4 in avnd_comp_clc_fsm_run(avnd_cb_tag*, avnd_comp_tag*, > avnd_comp_clc_pres_fsm_ev) () > #3 0x0040ce49 in avnd_evt_clc_resp_evh(avnd_cb_tag*, avnd_evt_tag*) > () > #4 0x0042133f in avnd_main_process() () at main.cc:667 > #5 0x00405517 in main () at main.cc:186 > (gdb) thread apply all bt > > Thread 4 (Thread 0x7fdaf3c05b00 (LWP 15512)): > #0 0x7fdaf2b2b415 in __lll_unlock_wake () from /lib64/libpthread.so.0 > #1 0x7fdaf2b27ac4 in _L_unlock_553 () from /lib64/libpthread.so.0 > #2 0x7fdaf2b279f7 in __pthread_mutex_unlock_usercnt () from > /lib64/libpthread.so.0 > #3 0x7fdaf37edac3 in ncs_os_lock () from > /usr/local/lib/libopensaf_core.so.0 > #4 0x7fdaf37e084d in ncs_ipc_send () from > /usr/local/lib/libopensaf_core.so.0 > #5 0x0041eea1 in avnd_evt_send(avnd_cb_tag*, avnd_evt_tag*) () > #6 0x0040a2cb in > comp_clc_resp_callback(NCS_OS_PROC_EXECUTE_TIMED_CB_INFO*) () > #7 0x7fdaf37ecdfb in give_exec_mod_cb () from > /usr/local/lib/libopensaf_core.so.0 > #8 0x7fdaf37ecfde in ncs_exec_mod_hdlr () from > /usr/local/lib/libopensaf_core.so.0 > #9 0x7fdaf2b247b6 in start_thread () from /lib64/libpthread.so.0 > #10 0x7fdaf20da9cd in clone () from /lib64/libc.so.6 > #11 0x in ?? () > > Thread 3 (Thread 0x7fdaf3c25b00 (LWP 15510)): > #0 0x7fdaf20d14f6 in poll () from /lib64/libc.so.6 > #1 0x7fdaf3817623 in mdtm_process_recv_events () from > /usr/local/lib/libopensaf_core.so.0 > #2 0x7fdaf2b247b6 in start_thread () from /lib64/libpthread.so.0 > #3 0x7fdaf20da9cd in