[tickets] [opensaf:tickets] #2162 AMF: Headless recovery failed if SC failover during headless sync

2017-01-22 Thread Nagendra Kumar
Please find the logs attached for TC mentioned in the email.


Attachments:

- 
[Logs-tc.rar](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/f093e418/d8dc/attachment/Logs-tc.rar)
 (477.7 kB; application/octet-stream)


---

** [tickets:#2162] AMF: Headless recovery failed if SC failover during headless 
sync**

**Status:** review
**Milestone:** 5.2.FC
**Labels:** headless recovery 
**Created:** Thu Nov 03, 2016 11:01 AM UTC by Minh Hon Chau
**Last Updated:** Mon Jan 09, 2017 11:24 AM UTC
**Owner:** Minh Hon Chau
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2162/attachment/log.tgz) 
(1.4 MB; application/x-compressed)


Test steps:
- Set up 2N assignment, PL4 hosts SU4 (active assignment), PL5 host SU5 
(standby assignment)
- Stop SCs
- Stop PL4
- Restart SC1
- Restart SC2
- Since PL4 is stopped, headless sync will be time out in 10 secs. During this 
10 secs, reboot SC1 to trigger SC failover
Observation: SC2 becomes active controller, cold sync complete, but SU5 still 
has standby assignment.

When SC2 becomes active controller, the part of code that performs headless 
recovery is not executed (function failover_absent_assignment()). Therefore, 
the transient assignments remain after SC failover.

Log/trace are attached.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2151 osaf: system in not in correct state during Act controller comming up

2017-01-22 Thread Ramesh
- **status**: assigned --> accepted



---

** [tickets:#2151] osaf: system in not in correct state during Act controller 
comming up**

**Status:** accepted
**Milestone:** 5.2.FC
**Created:** Mon Oct 31, 2016 10:54 AM UTC by Nagendra Kumar
**Last Updated:** Tue Nov 22, 2016 09:06 AM UTC
**Owner:** Ramesh


Steps to reproduce:
1. Start two controllers(SC-1 Act, SC-2 Standby) and two paylods. Configure 50 
components on SC-2 and unlock them. Keep 1 sec delay in each component stop 
script.
2. Stop SC-1 and after that, stop SC-2.
3. During SC-2 is going down, start SC-1.

Observed behaviour:
Since components are taking time in stopping all components during 'opensad 
stop' of SC-2, Amfnd hasn't exited. But, all middleware components assignments 
are stopped. Only Amfnd and Amfd is alive with few more components to stop.
But SC-1 has come up till Amfd and since two Amfd is Act now, so SC-2 Amfd 
exits by saying "Duplicate ACTIVE detected, exiting".
Till this time, services states including Amfd is in bad state as they couldn't 
differentiate whether it is headless state or failover. This is true also as 
the system is in half middle of headless and failover.


Expected behaviour
In my view:
FMS should stop and shouldn't proceed if peer is going down. i.e. FMS should 
figure out on SC-1 that the peer system is going down. And should allow SC-1 
only if all services are down i.e. it gets node down (may be cb->immd_down && 
cb->immnd_down && cb->amfnd_down && cb->amfd_down && cb->fm_down).





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2096 AMF : SG in unstable state for fault in component during admin unlock (headless)

2017-01-22 Thread Minh Hon Chau
Hi Srikanth,

This ticket had been reported before #1725 part2 was pushed (on 31/10/2016), 
could you please repeat the test to see if it is still happening?

Thanks,
Minh


---

** [tickets:#2096] AMF : SG in unstable state for fault in component during 
admin unlock (headless)**

**Status:** unassigned
**Milestone:** 5.2.FC
**Created:** Wed Oct 05, 2016 08:08 AM UTC by Srikanth R
**Last Updated:** Wed Oct 05, 2016 08:08 AM UTC
**Owner:** nobody
**Attachments:**

- 
[2096.tgz](https://sourceforge.net/p/opensaf/tickets/2096/attachment/2096.tgz) 
(4.6 MB; application/x-compressed-tar)


Environment :
-
Changeset:  7997 5.1.FC
Setup : 5 nodes setup with 2 controllers and headless feature enabled and PBE 
disabled.
Application : 2N application with 2 SUs and 4 SIs with out si-si deps.

Steps performed :
--

SG moved to unstable state for fault in component when admin unlock operation 
is performed on SG and headless state is invoked. Below are the steps performed.

-> The application is brought up initially and the SIs are fully assigned.

-> Now performed lock,lock-in , unlock-in and unlock operation performed on SG 
with the sufficient time gap.

-> During unlock operation of SG, component 2 of SU1 did not respond to the 
active assignment, headless scenario is invoked.

  3148 12:34:05 10/05/2016 NO safApp=safAmfService "Admin op "UNLOCK" 
initiated for 'safSg=TestApp_SG1,safApp=TestApp_TwoN', invocation: 
1683627180042"
  3149 12:34:05 10/05/2016 NO safApp=safAmfService 
"safSg=TestApp_SG1,safApp=TestApp_TwoN AdmState LOCKED => UNLOCKED"

-> After headless state is achieved, component2 faulted with csi set callback 
timeout.

Oct  5 12:34:33 SYSTEST-PLD-1 osafamfnd[2626]: NO 
'safComp=COMP2,safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' faulted 
due to 'csiSetcallbackTimeout' : Recovery is 'componentRestart'


-> After controllers joined back the cluster, SU2 did not get any assignments.

--> Further operations on SG resulted in UNSTABLE state.
  3202 12:40:59 10/05/2016 NO safApp=safAmfService "Admin op "LOCK" 
initiated for 'safSg=TestApp_SG1,safApp=TestApp_TwoN', invocation: 
1696512081921"
  3203 12:40:59 10/05/2016 NO safApp=safAmfService "Admin op invocation: 
1696512081921, err: 'SG not in STABLE state 
(safSg=TestApp_SG1,safApp=TestApp_TwoN)'"
  3204 12:40:59 10/05/2016 NO safApp=safAmfService "Admin op done for 
invocation: 1696512081921, result 6"


Logs :

 The traces of SC-1 ( active controller before headless and after headless ) 
and PL-3 ( SU1 hosted) are attached.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2179 AMF: Update PR/README for headless feature limitation

2017-01-22 Thread Minh Hon Chau
One point to be added in documentation from ticket #1902:
"Node failover, node switchover, and node fail fast during SC absence period 
will be delayed until a SC comes back, it's due to the related configurations 
(saAmfNodeAutoRepair, aAmfNodeFailfastOnTerminationFailure, and 
saAmfNodeFailfastOnInstantiationFailure) which determines a node reboot are 
currently stored at amfd"


---

** [tickets:#2179] AMF: Update PR/README for headless feature limitation**

**Status:** accepted
**Milestone:** 5.2.FC
**Created:** Tue Nov 08, 2016 02:23 PM UTC by Minh Hon Chau
**Last Updated:** Mon Jan 23, 2017 01:41 AM UTC
**Owner:** Minh Hon Chau


Update documents suggested by Praveen:
"
1)When ssytem bcomes headless When AMFD sends some assignment message because 
of admin operation or recovery from fault but message does not reach to AMFND.
2)Similarly when AMFND seds some assignment response but it does not reach to 
AMFD as system becomes headless.
These were the cases where AMFD may require to self trigger the FSM which is 
not possbile today.. Also there were cases where AMFD could not update IMM for 
attributes like SG FSM state and SUSI FSM state etc and system become headless. 
IN this case also recovery is not possible after headless.
"


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2179 AMF: Update PR/README for headless feature limitation

2017-01-22 Thread Minh Hon Chau
- **status**: assigned --> accepted



---

** [tickets:#2179] AMF: Update PR/README for headless feature limitation**

**Status:** accepted
**Milestone:** 5.2.FC
**Created:** Tue Nov 08, 2016 02:23 PM UTC by Minh Hon Chau
**Last Updated:** Tue Nov 08, 2016 02:23 PM UTC
**Owner:** Minh Hon Chau


Update documents suggested by Praveen:
"
1)When ssytem bcomes headless When AMFD sends some assignment message because 
of admin operation or recovery from fault but message does not reach to AMFND.
2)Similarly when AMFND seds some assignment response but it does not reach to 
AMFD as system becomes headless.
These were the cases where AMFD may require to self trigger the FSM which is 
not possbile today.. Also there were cases where AMFD could not update IMM for 
attributes like SG FSM state and SUSI FSM state etc and system become headless. 
IN this case also recovery is not possible after headless.
"


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2210 AMFD: Loss of RT attribute update before headless

2017-01-22 Thread Minh Hon Chau
- **status**: assigned --> accepted



---

** [tickets:#2210] AMFD: Loss of RT attribute update before headless**

**Status:** accepted
**Milestone:** 5.2.FC
**Created:** Mon Nov 28, 2016 10:18 PM UTC by Minh Hon Chau
**Last Updated:** Fri Dec 16, 2016 04:13 AM UTC
**Owner:** Minh Hon Chau


A loss of IMM RT saAmfSIAdminState update in AMFD has been seen just before 
cluster goes headless. It results in coredump after headless.

One scenario is:
- Issue amf-admin shutdown SI, delay csi quiescing callback
- Stop SCs, release csi quiescing callback
- Restart SCs
Observation: the saAmfSIAdminState is read as UNLOCKED while related SUSI was 
QUIESCED, and coredump as below

~~~
Thread 1 (Thread 0x7fec174a0780 (LWP 493)):
#0  0x004fbfd5 in SG_2N::node_fail_si_oper (this=0x24109d0, 
su=0x2413440) at sg_2n_fsm.cc:3102
s_susi = 0x8f5000b
susi_temp = 0x5fa169
o_su = 0x2417f98
__FUNCTION__ = "node_fail_si_oper"
cb = 0x919240 <_control_block>
#1  0x004fe69c in SG_2N::node_fail (this=0x24109d0, cb=0x919240 
<_control_block>, su=0x2413440) at sg_2n_fsm.cc:
3469
a_susi = 0x1
s_susi = 0x7fffedecd2d0
o_su = 0x5a50bd 
flag = 2
__FUNCTION__ = "node_fail"
su_ha_state = 0
#2  0x00513010 in AVD_SG::failover_absent_assignment (this=0x24109d0) 
at sg.cc:2273
su = @0x2411330: 0x2413440
__for_range = std::vector of length 2, capacity 2 = {0x2413440, 
0x24111e0}
__for_begin = 
__for_end = 
__FUNCTION__ = "failover_absent_assignment"
#3  0x0043be65 in avd_cluster_tmr_init_evh (cb=0x919240 
<_control_block>, evt=0x7fec04000df0) at cluster.cc:103
i_sg = 0x24109d0
it = {first = "safSg=1,safApp=osaftest", second = }
__FUNCTION__ = "avd_cluster_tmr_init_evh"
su = 0x0
node = 0x240f9b0
~~~



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #1902 AMF: Extend escalation support during headless

2017-01-22 Thread Minh Hon Chau
- **status**: assigned --> fixed
- **assigned_to**: Minh Hon Chau -->  nobody 
- **Comment**:

The component failover problem during headless is due to problem of ticket 
#2233, which is under review, so close this ticket #1902.
One point to be added in documentation ticket #2179:
"Node failover, node switchover, and node fail fast during SC absence period 
will be delayed until a SC comes back, it's due to the related configurations 
(saAmfNodeAutoRepair, aAmfNodeFailfastOnTerminationFailure, and 
saAmfNodeFailfastOnInstantiationFailure) which determines a node reboot are 
currently stored at amfd"



---

** [tickets:#1902] AMF: Extend escalation support during headless**

**Status:** fixed
**Milestone:** 5.2.FC
**Created:** Wed Jun 29, 2016 12:02 PM UTC by Minh Hon Chau
**Last Updated:** Mon Dec 19, 2016 04:03 AM UTC
**Owner:** nobody


If a comp/su failover occurs during headless, amfnd will escalate to reboot. 
This will unexpectedly impact on other comp/su which are up and running if 
there's no node failover escalation configured on this faulty comp/su 

2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safComp=AmfDemo2,safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' faulted due 
to 'avaDown' : Recovery is 'suFailover'
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO Terminating components of 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon'(abruptly & unordered)
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State INSTANTIATED => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: Rebooting OpenSAF NodeId = 132111 EE 
Name = , Reason: Can't perform recovery while controllers are down. Recovery is 
node failfast., OwnNodeId = 132111, SupervisionTime = 60
2016-06-29 21:30:07 PL-4 opensaf_reboot: Rebooting local node; timeout=60

This ticket will remove unexpected reboot due to failover during headless which 
is mentioned as limitation in AMF opensaf documentation.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery

2017-01-22 Thread Minh Hon Chau
- **status**: accepted --> review



---

** [tickets:#2233] AMF: SG is unstable after component failover recovery**

**Status:** review
**Milestone:** 5.0.2
**Labels:** unstable sg 
**Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau
**Last Updated:** Wed Jan 18, 2017 12:20 AM UTC
**Owner:** Minh Hon Chau


This issue occurs as component failover recovery in context of locking node.

**Configuration and steps:**
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps 
safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4
4- Set a few seconds delay csi remove callback in component of SU4
5- Set a few seconds delay quiesced csi set callback in component of SU5
6- When SU5 finishes active assignment, SU4 now receives assignment removal 
from amfd. In mean time, component failover report is triggered by component of 
SU5.
7- Now SU5 receives quiesced csi set callback from amfd
8- Release both callback in step 4 and 5

**Observation: **
SG unstable, could not repair failed SU (SU5) or lock/unlock any entities

At the time amfd process quiesced assignment response in REALIGN state, no 
action from amfd
> Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> 
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' 
> act=5, state=3
> Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR 
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in 
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> 
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR 
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << 
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> stdby: '(null)'
> Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << 
> susi_success_sg_realign: rc:1

In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is 
currently IN_SERVICE
SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] 
as part of component failover report
Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2

The failed component is instantiated again, and generates another message 
su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE
Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1

SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover 
recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation 
of failed component happens faster as in this test then the sg fsm results in 
unexpected sequence.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets