[tickets] [opensaf:tickets] #2025 Cluster reset happend during headless as CLMNA faulted due to healthCheckcallbackTimeout

2016-09-20 Thread Anders Widell
- **Milestone**: 4.7.2 --> 5.0.2



---

** [tickets:#2025] Cluster reset happend  during headless as CLMNA  faulted due 
to healthCheckcallbackTimeout**

**Status:** unassigned
**Milestone:** 5.0.2
**Created:** Mon Sep 12, 2016 07:32 AM UTC by Ritu Raj
**Last Updated:** Mon Sep 12, 2016 08:18 AM UTC
**Owner:** nobody
**Attachments:**

- 
[PL-4.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2025/attachment/PL-4.tar.bz2)
 (38.4 kB; application/x-bzip)
- 
[PL-5.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2025/attachment/PL-5.tar.bz2)
 (59.0 kB; application/x-bzip)
- 
[SC-1.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2025/attachment/SC-1.tar.bz2)
 (160.7 kB; application/x-bzip)
- 
[SC-2.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2025/attachment/SC-2.tar.bz2)
 (107.4 kB; application/x-bzip)
- 
[SC-3.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2025/attachment/SC-3.tar.bz2)
 (109.8 kB; application/x-bzip)


#Environment details
OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)
Setup : 5 nodes ( 3 controllers and 2 payloads with headless feature enabled & 
1PBE with 10K objects

#Summary :
Cluster reset happend  during headless as CLMNA  faulted due to 
healthCheckcallbackTimeout

#Steps followed & Observed behaviour
1. Invoked headless by killing Active followed by Standby and Spare Controller,
maintaining gap of 6 sec between controller reboot

2. After couple of failover, CLMNA faulted on PL-4 and PL-5 due to 
healthCheckcallbackTimeout, and cluster reset happened.

Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO SU failover probation timer 
started (timeout: 12000 ns)
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO Performing failover of 
'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' (SU failover count: 1)
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO 
'safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' recovery action escalated 
from 'componentFailover' to 'suFailover'
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO 
'safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' faulted due to 
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: ER 
safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF Faulted due 
to:healthCheckcallbackTimeout Recovery is:suFailover
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: Rebooting OpenSAF NodeId = 
132111 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 132111, SupervisionTime = 60


Notes:
1. There is time gap between system
With respect to PL-4(Sep 10 17:52:46 SCALE_SLOT-74) the corresponding time for 
other system as:
Sep 27 18:46:53: SC-1
Oct 03 10:02:54: SC-2
Oct 03 10:26:44: SC-3
Sep 10 17:54:46: PL-5
There is No syslog logged on controller's during above time. 

2. Syslog of SC-1,SC-2,SC-3, PL-4 and PL-5 attached
3. clmnd traces not enabled


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2025 Cluster reset happend during headless as CLMNA faulted due to healthCheckcallbackTimeout

2016-09-12 Thread Ritu Raj
- **summary**: Cluster reset happend  during headless as CLMNA  faulted due to 
csiSetcallbackTimeout --> Cluster reset happend  during headless as CLMNA  
faulted due to healthCheckcallbackTimeout
- Description has changed:

Diff:



--- old
+++ new
@@ -4,13 +4,13 @@
 Setup : 5 nodes ( 3 controllers and 2 payloads with headless feature enabled & 
1PBE with 10K objects
 
 #Summary :
-Cluster reset happend  during headless as CLMNA  faulted due to 
csiSetcallbackTimeout 
+Cluster reset happend  during headless as CLMNA  faulted due to 
healthCheckcallbackTimeout
 
 #Steps followed & Observed behaviour
 1. Invoked headless by killing Active followed by Standby and Spare Controller,
 maintaining gap of 6 sec between controller reboot
 
-2. After couple of failover, CLMNA faulted on PL-4 and PL-5 due to 
csiSetcallbackTimeout, and cluster reset happened.
+2. After couple of failover, CLMNA faulted on PL-4 and PL-5 due to 
healthCheckcallbackTimeout, and cluster reset happened.
 
 Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO SU failover probation timer 
started (timeout: 12000 ns)
 Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO Performing failover of 
'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' (SU failover count: 1)






---

** [tickets:#2025] Cluster reset happend  during headless as CLMNA  faulted due 
to healthCheckcallbackTimeout**

**Status:** unassigned
**Milestone:** 4.7.2
**Created:** Mon Sep 12, 2016 07:32 AM UTC by Ritu Raj
**Last Updated:** Mon Sep 12, 2016 07:32 AM UTC
**Owner:** nobody
**Attachments:**

- 
[PL-4.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2025/attachment/PL-4.tar.bz2)
 (38.4 kB; application/x-bzip)
- 
[PL-5.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2025/attachment/PL-5.tar.bz2)
 (59.0 kB; application/x-bzip)
- 
[SC-1.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2025/attachment/SC-1.tar.bz2)
 (160.7 kB; application/x-bzip)
- 
[SC-2.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2025/attachment/SC-2.tar.bz2)
 (107.4 kB; application/x-bzip)
- 
[SC-3.tar.bz2](https://sourceforge.net/p/opensaf/tickets/2025/attachment/SC-3.tar.bz2)
 (109.8 kB; application/x-bzip)


#Environment details
OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)
Setup : 5 nodes ( 3 controllers and 2 payloads with headless feature enabled & 
1PBE with 10K objects

#Summary :
Cluster reset happend  during headless as CLMNA  faulted due to 
healthCheckcallbackTimeout

#Steps followed & Observed behaviour
1. Invoked headless by killing Active followed by Standby and Spare Controller,
maintaining gap of 6 sec between controller reboot

2. After couple of failover, CLMNA faulted on PL-4 and PL-5 due to 
healthCheckcallbackTimeout, and cluster reset happened.

Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO SU failover probation timer 
started (timeout: 12000 ns)
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO Performing failover of 
'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' (SU failover count: 1)
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO 
'safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' recovery action escalated 
from 'componentFailover' to 'suFailover'
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO 
'safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' faulted due to 
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: ER 
safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF Faulted due 
to:healthCheckcallbackTimeout Recovery is:suFailover
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: Rebooting OpenSAF NodeId = 
132111 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 132111, SupervisionTime = 60


Notes:
1. There is time gap between system
With respect to PL-4(Sep 10 17:52:46 SCALE_SLOT-74) the corresponding time for 
other system as:
Sep 27 18:46:53: SC-1
Oct 03 10:02:54: SC-2
Oct 03 10:26:44: SC-3
Sep 10 17:54:46: PL-5
There is No syslog logged on controller's during above time. 

2. Syslog of SC-1,SC-2,SC-3, PL-4 and PL-5 attached
3. clmnd traces not enabled


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets