Hans,
       The log string "faulted due to 6 -rcvr=9" transltes to -  6 being
AVND_ERR_SRC_CBK_HC_TIMEOUT(AMF health check callback times out),  9
being AVSV_ERR_RCVR_SU_FAILOVER. It says that since the component was
not able to respond to the health check response, the SU of that
componentt failed. This will only happen when the system is heavily
loaded and MAS and other components are not getting time to respond in 2
sec (health check value of MAS,MQD components) especially in a PC
environment. You need to fine tune these components in the BOM file as
per you configurations. It may happen during failover. 

There should n't be any connection between DRBD and OpenSAF failover
timings. Actually, openSAF has control over DRBD for failover using
PDRBD.

In this case DRBD had nothing to do with it.

Send all the logs using the script attached.
I would like to know what is your approach to test failover.

Failover is decided by AvSv only, which maintains Heart Beat with its
peer AVD using TIPC. While, the RDF (reference implementation in
OpenSAF) maintains TCP/IP connection with its peer RDE for detecting the
blade status information.
The details of what it does in case of failover, refer B.1 section of
6806800D52A_OpenSAF_Platform_Control_Services_PR.pdf.


Regadrs
-Nagendra


> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt
> Sent: Wednesday, October 31, 2007 3:15 PM
> To: [email protected]
> Subject: [Users] Help with controller fail-over issues?
> 
> I am testing controller fail-over. I does not work very well. 
> I get this in the syslog of the standby:
> 
> Oct 31 10:13:11 SC_2_1 kernel: TIPC: Lost contact with 
> <1.1.47> Oct 31 10:13:13 SC_2_1 ncs_scap: NCS_AvSv: Card 
> going for reboot
> -safComp=CompT_MQD,safSu=SuT_NCS_CNTLR,safNode=SC_2_1 faulted due to 6
> -rcvr=9
> Oct 31 10:13:20 SC_2_1 kernel: drbd0: PingAck did not arrive in time.
> 
> As you can see, I get the MQD error at a time when I have no 
> disk partition since DRBD has not performed a fail-over yet. 
> And the result is that the fail-over does not work, the 
> active reboots and becomes active again. The standby reboots 
> and becomes standby again.
> 
> So what does the part "faulted due to 6 -rcvr=9" mean?
> 
> There is nothing in the MQD log files. No core dump, nothing. 
> I have seen similar problems in other processes such as MAS, 
> DTS when testing fail-over so I guess there is a general problem.
> 
> Could the problem be that OpenSAF fails-over before DRBD fails-over?
> 
> Thanks,
> Hans
> _______________________________________________
> Users mailing list
> [email protected]
> http://list.opensaf.org/maillist/listinfo/users
> 

Attachment: collect_logs_controller.sh
Description: collect_logs_controller.sh

_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users

Reply via email to