Re: [devel] FW: [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2 (delayed failover issue)

2016-02-12 Thread minh chau
Hi Zoran,

In TC#16 #17 #18, there're no amfd crash. I didn't mean TC #16 failed 
because of one payload IMM limitation

Thanks,
Minh
On 11/02/16 19:18, Zoran Milinkovic wrote:
> Hi Minh,
>
> In TC#16, it's written that the test has been done with PL-3 and PL-4. So, it 
> is not a case with one payload.
> This problem looks more like AMF than IMM limitation problem.
>
> Thanks,
> Zoran
>
> -Original Message-
> From: minh chau [mailto:minh.c...@dektech.com.au]
> Sent: Thursday, February 11, 2016 5:47 AM
> To: Nagendra Kumar; Hans Nordebäck; Gary Lee; Praveen Malviya
> Cc: opensaf-devel@lists.sourceforge.net
> Subject: Re: [devel] FW: [PATCH 0 of 5] Review Request for amf: Add support 
> for cloud resilience [#1620] V2 (delayed failover issue)
>
> Hi Nagu,
>
> There's known limitation in IMM with configuration 2+1 or 1+1, which results 
> in IMM reload  at second controller restart. We're discussing to avoid the 
> crash Meanwhile I'm also looking at the delayed failover issues which you 
> reported on TC#2 #12 #13 #15 We have similar automated tests but they all 
> pass, so I guess your test has something special.
> Can you run those tests once and send me syslog + amfd/amfnd traces?
>
> Thanks,
> Minh
> On 10/02/16 23:00, Nagendra Kumar wrote:
>> TC #16.  Same configuration as #12: Run SI shutdown and keep sleep of 5 
>> sec before saAmfCSIQuiescingComplete  and stop controller and then after 
>> sleep, reject saAmfCSIQuiescingComplete with SA_AIS_ERR_FAILED_OPERATION. 
>> All the assignment from SU1 on PL-3 and SU2 on PL-4 are removed and SI admin 
>> state is 2(locked):
>> saAmfSIAdminState  SA_UINT32_T  2 (0x2)
>>
>> "Si going into locked state" is different behaviour when controller is up 
>> and running and run this test case. In case, controller is available, SI 
>> will be in unlocked state and all the assignments will be on SU2 as Act and 
>> SU3 as Standby (on PL-4). This need either correction or documentation.
>>
>> TC #17.  Same configuration as #12: Run SG shutdown and keep 
>> sleep of 5 sec before saAmfCSIQuiescingComplete  and stop controller and 
>> then after sleep, reject saAmfCSIQuiescingComplete with 
>> SA_AIS_ERR_FAILED_OPERATION. Amfnd crashes[Please note that this test case 
>> works with controller up]:
>> Syslog and bt:
>> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO component with
>> QUIESCED/QUIESCING assignment failed Feb 10 11:44:29 PM_PL-3 
>> osafamfnd[15508]: NO recovery action 'comp restart' escalated to 'comp 
>> failover'
>> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO SU failover probation
>> timer started (timeout: 12000 ns) Feb 10 11:44:29 PM_PL-3
>> osafamfnd[15508]: NO Performing failover of 
>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' (SU failover count: 1) Feb 10 
>> 11:44:29 PM_PL-3 osafamfnd[15508]: NO 
>> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' recovery action 
>> escalated from 'componentRestart' to 'componentFailover'
>> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO 
>> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to 
>> 'csiSetcallbackFailed' : Recovery is 'componentFailover'
>> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO
>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => 
>> TERMINATING Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO Removed 
>> 'safSi=AmfDemo,safApp=AmfDemo1' from 
>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
>> Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: saAmfHAStateGet FAILED - 7
>> Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: exiting (caught term signal)
>> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO avnd_di_oper_send()
>> deferred as AMF director is offline Feb 10 11:44:29 PM_PL-3
>> osafimmnd[15760]: AL AMF Node Director is down, terminate this process
>>
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x00412b50 in
>> avnd_comp_cmplete_all_assignment(avnd_cb_tag*, avnd_comp_tag*) ()
>> (gdb) bt
>> #0  0x00412b50 in
>> avnd_comp_cmplete_all_assignment(avnd_cb_tag*, avnd_comp_tag*) ()
>> #1  0x0040a093 in
>> avnd_comp_clc_terming_cleansucc_hdler(avnd_cb_tag*, avnd_comp_tag*) ()
>> #2  0x0040c7d4 in avnd_comp_clc_fsm_run(avnd_cb_tag*,
>> avnd_comp_tag*, avnd_comp_clc_pres_fsm_ev) ()
>> #3  0x0040ce49 in avnd_evt_clc_resp_evh(avnd_cb_tag*,
>> avnd_evt_tag*) ()
>> #4  0x0042133f in avnd_main_process() () at main.cc:667
>> #5  0x00405517 in main () at main.cc:186
>> (gdb) thread apply all bt
>>
>> Thread 4 (Thread 0x7fdaf3c05b00 (LWP 15512)):
>> #0  0x7fdaf2b2b415 in __lll_unlock_wake () from
>> /lib64/libpthread.so.0
>> #1  0x7fdaf2b27ac4 in _L_unlock_553 () from /lib64/libpthread.so.0
>> #2  0x7fdaf2b279f7 in __pthread_mutex_unlock_usercnt () from
>> /lib64/libpthread.so.0
>> #3  0x7fdaf37edac3 in ncs_os_lock () from
>> /usr/local/lib/libopensaf_core.so.0
>> #4  0x7fdaf37e084d in ncs_ipc_send () from
>> 

Re: [devel] FW: [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2 (delayed failover issue)

2016-02-10 Thread Nagendra Kumar
Hi Minh,
Thanks. Yes, I will send the traces of the TC mentioned below 
after I am done with some other tests. The devel lists has limitation of 400 KB 
limit of attachment, so will see how can I send it, may be I can attach in the 
ticket as temporary.

Thanks
-Nagu

> -Original Message-
> From: minh chau [mailto:minh.c...@dektech.com.au]
> Sent: 11 February 2016 10:17
> To: Nagendra Kumar; hans.nordeb...@ericsson.com;
> gary@dektech.com.au; Praveen Malviya
> Cc: opensaf-devel@lists.sourceforge.net
> Subject: Re: [devel] FW: [PATCH 0 of 5] Review Request for amf: Add support
> for cloud resilience [#1620] V2 (delayed failover issue)
> 
> Hi Nagu,
> 
> There's known limitation in IMM with configuration 2+1 or 1+1, which results
> in IMM reload  at second controller restart. We're discussing to avoid the
> crash Meanwhile I'm also looking at the delayed failover issues which you
> reported on TC#2 #12 #13 #15 We have similar automated tests but they all
> pass, so I guess your test has something special.
> Can you run those tests once and send me syslog + amfd/amfnd traces?
> 
> Thanks,
> Minh
> On 10/02/16 23:00, Nagendra Kumar wrote:
> > TC #16. Same configuration as #12: Run SI shutdown and keep sleep
> of 5 sec before saAmfCSIQuiescingComplete  and stop controller and then
> after sleep, reject saAmfCSIQuiescingComplete with
> SA_AIS_ERR_FAILED_OPERATION. All the assignment from SU1 on PL-3 and
> SU2 on PL-4 are removed and SI admin state is 2(locked):
> > saAmfSIAdminState  SA_UINT32_T  2 (0x2)
> >
> > "Si going into locked state" is different behaviour when controller is up 
> > and
> running and run this test case. In case, controller is available, SI will be 
> in
> unlocked state and all the assignments will be on SU2 as Act and SU3 as
> Standby (on PL-4). This need either correction or documentation.
> >
> > TC #17. Same configuration as #12: Run SG shutdown and
> keep sleep of 5 sec before saAmfCSIQuiescingComplete  and stop controller
> and then after sleep, reject saAmfCSIQuiescingComplete with
> SA_AIS_ERR_FAILED_OPERATION. Amfnd crashes[Please note that this test
> case works with controller up]:
> > Syslog and bt:
> > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO component with
> > QUIESCED/QUIESCING assignment failed Feb 10 11:44:29 PM_PL-3
> osafamfnd[15508]: NO recovery action 'comp restart' escalated to 'comp
> failover'
> > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO SU failover probation
> > timer started (timeout: 12000 ns) Feb 10 11:44:29 PM_PL-3
> > osafamfnd[15508]: NO Performing failover of
> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' (SU failover count: 1) Feb
> 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO
> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> recovery action escalated from 'componentRestart' to 'componentFailover'
> > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO
> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> faulted due to 'csiSetcallbackFailed' : Recovery is 'componentFailover'
> > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO
> > 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State
> INSTANTIATED => TERMINATING Feb 10 11:44:29 PM_PL-3
> osafamfnd[15508]: NO Removed 'safSi=AmfDemo,safApp=AmfDemo1' from
> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> > Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: saAmfHAStateGet FAILED - 7
> > Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: exiting (caught term signal)
> > Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO avnd_di_oper_send()
> > deferred as AMF director is offline Feb 10 11:44:29 PM_PL-3
> > osafimmnd[15760]: AL AMF Node Director is down, terminate this process
> >
> > Program terminated with signal 11, Segmentation fault.
> > #0  0x00412b50 in
> > avnd_comp_cmplete_all_assignment(avnd_cb_tag*, avnd_comp_tag*) ()
> > (gdb) bt
> > #0  0x00412b50 in
> > avnd_comp_cmplete_all_assignment(avnd_cb_tag*, avnd_comp_tag*) ()
> > #1  0x0040a093 in
> > avnd_comp_clc_terming_cleansucc_hdler(avnd_cb_tag*,
> avnd_comp_tag*) ()
> > #2  0x0040c7d4 in avnd_comp_clc_fsm_run(avnd_cb_tag*,
> > avnd_comp_tag*, avnd_comp_clc_pres_fsm_ev) ()
> > #3  0x0040ce49 in avnd_evt_clc_resp_evh(avnd_cb_tag*,
> > avnd_evt_tag*) ()
> > #4  0x0042133f in avnd_main_process() () at main.cc:667
> > #5  0x00405517 in main () at main.cc:186
> > (gdb) thread apply all bt
> >
> > Thread 4 (Thread 0x7fdaf3c05b00 (LWP 15512)):
> > #0  0x7fdaf2b2b415 in __lll_unlock_wake () from
> > /lib64/libpthread.so.0
> > #1  0x7fdaf2b27ac4 in _L_unlock_553 () from /lib64/libpthread.so.0
> > #2  0x7fdaf2b279f7 in __pthread_mutex_unlock_usercnt () from
> > /lib64/libpthread.so.0
> > #3  0x7fdaf37edac3 in ncs_os_lock () from
> > /usr/local/lib/libopensaf_core.so.0
> > #4  0x7fdaf37e084d in ncs_ipc_send () from
> > /usr/local/lib/libopensaf_core.so.0
> > #5  0x0041eea1 in avnd_evt_send(avnd_cb_tag*, 

Re: [devel] FW: [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2 (delayed failover issue)

2016-02-10 Thread minh chau
Hi Nagu,

There's known limitation in IMM with configuration 2+1 or 1+1, which 
results in IMM reload  at second controller restart. We're discussing to 
avoid the crash
Meanwhile I'm also looking at the delayed failover issues which you 
reported on TC#2 #12 #13 #15
We have similar automated tests but they all pass, so I guess your test 
has something special.
Can you run those tests once and send me syslog + amfd/amfnd traces?

Thanks,
Minh
On 10/02/16 23:00, Nagendra Kumar wrote:
> TC #16.   Same configuration as #12: Run SI shutdown and keep sleep of 5 
> sec before saAmfCSIQuiescingComplete  and stop controller and then after 
> sleep, reject saAmfCSIQuiescingComplete with SA_AIS_ERR_FAILED_OPERATION. All 
> the assignment from SU1 on PL-3 and SU2 on PL-4 are removed and SI admin 
> state is 2(locked):
> saAmfSIAdminState  SA_UINT32_T  2 (0x2)
>
> "Si going into locked state" is different behaviour when controller is up and 
> running and run this test case. In case, controller is available, SI will be 
> in unlocked state and all the assignments will be on SU2 as Act and SU3 as 
> Standby (on PL-4). This need either correction or documentation.
>
> TC #17.   Same configuration as #12: Run SG shutdown and keep 
> sleep of 5 sec before saAmfCSIQuiescingComplete  and stop controller and then 
> after sleep, reject saAmfCSIQuiescingComplete with 
> SA_AIS_ERR_FAILED_OPERATION. Amfnd crashes[Please note that this test case 
> works with controller up]:
> Syslog and bt:
> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO component with 
> QUIESCED/QUIESCING assignment failed
> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO recovery action 'comp restart' 
> escalated to 'comp failover'
> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO SU failover probation timer 
> started (timeout: 12000 ns)
> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO Performing failover of 
> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' (SU failover count: 1)
> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO 
> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' recovery action 
> escalated from 'componentRestart' to 'componentFailover'
> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO 
> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to 
> 'csiSetcallbackFailed' : Recovery is 'componentFailover'
> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO 
> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => 
> TERMINATING
> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO Removed 
> 'safSi=AmfDemo,safApp=AmfDemo1' from 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: saAmfHAStateGet FAILED - 7
> Feb 10 11:44:29 PM_PL-3 amf_demo[15721]: exiting (caught term signal)
> Feb 10 11:44:29 PM_PL-3 osafamfnd[15508]: NO avnd_di_oper_send() deferred as 
> AMF director is offline
> Feb 10 11:44:29 PM_PL-3 osafimmnd[15760]: AL AMF Node Director is down, 
> terminate this process
>
> Program terminated with signal 11, Segmentation fault.
> #0  0x00412b50 in avnd_comp_cmplete_all_assignment(avnd_cb_tag*, 
> avnd_comp_tag*) ()
> (gdb) bt
> #0  0x00412b50 in avnd_comp_cmplete_all_assignment(avnd_cb_tag*, 
> avnd_comp_tag*) ()
> #1  0x0040a093 in avnd_comp_clc_terming_cleansucc_hdler(avnd_cb_tag*, 
> avnd_comp_tag*) ()
> #2  0x0040c7d4 in avnd_comp_clc_fsm_run(avnd_cb_tag*, avnd_comp_tag*, 
> avnd_comp_clc_pres_fsm_ev) ()
> #3  0x0040ce49 in avnd_evt_clc_resp_evh(avnd_cb_tag*, avnd_evt_tag*) 
> ()
> #4  0x0042133f in avnd_main_process() () at main.cc:667
> #5  0x00405517 in main () at main.cc:186
> (gdb) thread apply all bt
>
> Thread 4 (Thread 0x7fdaf3c05b00 (LWP 15512)):
> #0  0x7fdaf2b2b415 in __lll_unlock_wake () from /lib64/libpthread.so.0
> #1  0x7fdaf2b27ac4 in _L_unlock_553 () from /lib64/libpthread.so.0
> #2  0x7fdaf2b279f7 in __pthread_mutex_unlock_usercnt () from 
> /lib64/libpthread.so.0
> #3  0x7fdaf37edac3 in ncs_os_lock () from 
> /usr/local/lib/libopensaf_core.so.0
> #4  0x7fdaf37e084d in ncs_ipc_send () from 
> /usr/local/lib/libopensaf_core.so.0
> #5  0x0041eea1 in avnd_evt_send(avnd_cb_tag*, avnd_evt_tag*) ()
> #6  0x0040a2cb in 
> comp_clc_resp_callback(NCS_OS_PROC_EXECUTE_TIMED_CB_INFO*) ()
> #7  0x7fdaf37ecdfb in give_exec_mod_cb () from 
> /usr/local/lib/libopensaf_core.so.0
> #8  0x7fdaf37ecfde in ncs_exec_mod_hdlr () from 
> /usr/local/lib/libopensaf_core.so.0
> #9  0x7fdaf2b247b6 in start_thread () from /lib64/libpthread.so.0
> #10 0x7fdaf20da9cd in clone () from /lib64/libc.so.6
> #11 0x in ?? ()
>
> Thread 3 (Thread 0x7fdaf3c25b00 (LWP 15510)):
> #0  0x7fdaf20d14f6 in poll () from /lib64/libc.so.6
> #1  0x7fdaf3817623 in mdtm_process_recv_events () from 
> /usr/local/lib/libopensaf_core.so.0
> #2  0x7fdaf2b247b6 in start_thread () from /lib64/libpthread.so.0
> #3  0x7fdaf20da9cd in