Could the problem be because in my system I have the nidlog & stdout
directories on the replicated partition?
I could change that and give it a try.

/Hans

> -----Original Message-----
> From: Hans Feldt 
> Sent: den 31 oktober 2007 12:39
> To: Hans Feldt; Kumar Nagendra-G20235; [email protected]
> Subject: RE: [Users] Help with controller fail-over issues?
> 
> Yep. I extended the healthcheck period and duration for MQD a 
> lot (20s and 10s), and fail-over works.
> 
> But that is not good so what MQD thread is executing the 
> healtcheck dispatch and does I/O in the same thread?
> 
> If my theory is right, no I/O can be performed in that thread.
> 
> Thanks,
> Hans
> 
> > -----Original Message-----
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt
> > Sent: den 31 oktober 2007 12:14
> > To: Kumar Nagendra-G20235; [email protected]
> > Subject: Re: [Users] Help with controller fail-over issues?
> > 
> > Thanks, see below.
> > 
> > /Hans
> > 
> > > -----Original Message-----
> > > From: Kumar Nagendra-G20235 [mailto:[EMAIL PROTECTED]
> > 
> > >        The log string "faulted due to 6 -rcvr=9" transltes to
> > > -  6 being AVND_ERR_SRC_CBK_HC_TIMEOUT(AMF health check
> > callback times
> > > out),  9 being AVSV_ERR_RCVR_SU_FAILOVER. It says that since the 
> > > component was not able to respond to the health check
> > response, the SU
> > > of that componentt failed. This will only happen when the 
> system is 
> > > heavily loaded and MAS and other components are not 
> getting time to 
> > > respond in 2 sec (health check value of MAS,MQD components)
> > especially
> > > in a PC environment. You need to fine tune these components
> > in the BOM
> > > file as per you configurations. It may happen during failover.
> > 
> > But what if the component is "hanging" in an I/O operation? 
> > An operation that will not succeed before the healtcheck 
> timeout since 
> > the replicated parition is not available.
> > 
> > I could try to increase the healthcheck timeout for MQD a 
> lot and see 
> > what happens.
> > 
> > > There should n't be any connection between DRBD and OpenSAF
> > failover
> > > timings.
> > 
> > But the replicated partition is unavailable for some time! 
> > OpenSAF configuration (pssv_store) and logs are stored there. 
> > A lot of writing to the replicated partition will take place during 
> > fail-over I assume. I think there is connection between 
> DRBD OpenSAF 
> > fail-over.
> > 
> > > Actually, openSAF has control over DRBD for failover using PDRBD.
> > 
> > Not currently in our setup.
> > 
> > > Send all the logs using the script attached.
> > 
> > I have no interesting logs to send. The DTS logs are empty from the 
> > time of failover (since the replicated partition is unavailable?).
> > 
> > > I would like to know what is your approach to test failover.
> > 
> > I just did 'pkill cpd' on the active controller.
> > 
> > _______________________________________________
> > Users mailing list
> > [email protected]
> > http://list.opensaf.org/maillist/listinfo/users
> > 
_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users

Reply via email to