Yep. I extended the healthcheck period and duration for MQD a lot (20s
and 10s), and fail-over works.

But that is not good so what MQD thread is executing the healtcheck
dispatch and does I/O in the same thread?

If my theory is right, no I/O can be performed in that thread.

Thanks,
Hans

> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt
> Sent: den 31 oktober 2007 12:14
> To: Kumar Nagendra-G20235; [email protected]
> Subject: Re: [Users] Help with controller fail-over issues?
> 
> Thanks, see below.
> 
> /Hans
> 
> > -----Original Message-----
> > From: Kumar Nagendra-G20235 [mailto:[EMAIL PROTECTED]
> 
> >        The log string "faulted due to 6 -rcvr=9" transltes to
> > -  6 being AVND_ERR_SRC_CBK_HC_TIMEOUT(AMF health check 
> callback times 
> > out),  9 being AVSV_ERR_RCVR_SU_FAILOVER. It says that since the 
> > component was not able to respond to the health check 
> response, the SU 
> > of that componentt failed. This will only happen when the system is 
> > heavily loaded and MAS and other components are not getting time to 
> > respond in 2 sec (health check value of MAS,MQD components) 
> especially 
> > in a PC environment. You need to fine tune these components 
> in the BOM 
> > file as per you configurations. It may happen during failover.
> 
> But what if the component is "hanging" in an I/O operation? 
> An operation that will not succeed before the healtcheck 
> timeout since the replicated parition is not available.
> 
> I could try to increase the healthcheck timeout for MQD a lot 
> and see what happens.
> 
> > There should n't be any connection between DRBD and OpenSAF 
> failover 
> > timings.
> 
> But the replicated partition is unavailable for some time! 
> OpenSAF configuration (pssv_store) and logs are stored there. 
> A lot of writing to the replicated partition will take place 
> during fail-over I assume. I think there is connection 
> between DRBD OpenSAF fail-over.
> 
> > Actually, openSAF has control over DRBD for failover using PDRBD.
> 
> Not currently in our setup.
> 
> > Send all the logs using the script attached.
> 
> I have no interesting logs to send. The DTS logs are empty 
> from the time of failover (since the replicated partition is 
> unavailable?).
> 
> > I would like to know what is your approach to test failover.
> 
> I just did 'pkill cpd' on the active controller.
> 
> _______________________________________________
> Users mailing list
> [email protected]
> http://list.opensaf.org/maillist/listinfo/users
> 
_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users

Reply via email to