Could the problem be because in my system I have the nidlog & stdout directories on the replicated partition? I could change that and give it a try.
/Hans > -----Original Message----- > From: Hans Feldt > Sent: den 31 oktober 2007 12:39 > To: Hans Feldt; Kumar Nagendra-G20235; [email protected] > Subject: RE: [Users] Help with controller fail-over issues? > > Yep. I extended the healthcheck period and duration for MQD a > lot (20s and 10s), and fail-over works. > > But that is not good so what MQD thread is executing the > healtcheck dispatch and does I/O in the same thread? > > If my theory is right, no I/O can be performed in that thread. > > Thanks, > Hans > > > -----Original Message----- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt > > Sent: den 31 oktober 2007 12:14 > > To: Kumar Nagendra-G20235; [email protected] > > Subject: Re: [Users] Help with controller fail-over issues? > > > > Thanks, see below. > > > > /Hans > > > > > -----Original Message----- > > > From: Kumar Nagendra-G20235 [mailto:[EMAIL PROTECTED] > > > > > The log string "faulted due to 6 -rcvr=9" transltes to > > > - 6 being AVND_ERR_SRC_CBK_HC_TIMEOUT(AMF health check > > callback times > > > out), 9 being AVSV_ERR_RCVR_SU_FAILOVER. It says that since the > > > component was not able to respond to the health check > > response, the SU > > > of that componentt failed. This will only happen when the > system is > > > heavily loaded and MAS and other components are not > getting time to > > > respond in 2 sec (health check value of MAS,MQD components) > > especially > > > in a PC environment. You need to fine tune these components > > in the BOM > > > file as per you configurations. It may happen during failover. > > > > But what if the component is "hanging" in an I/O operation? > > An operation that will not succeed before the healtcheck > timeout since > > the replicated parition is not available. > > > > I could try to increase the healthcheck timeout for MQD a > lot and see > > what happens. > > > > > There should n't be any connection between DRBD and OpenSAF > > failover > > > timings. > > > > But the replicated partition is unavailable for some time! > > OpenSAF configuration (pssv_store) and logs are stored there. > > A lot of writing to the replicated partition will take place during > > fail-over I assume. I think there is connection between > DRBD OpenSAF > > fail-over. > > > > > Actually, openSAF has control over DRBD for failover using PDRBD. > > > > Not currently in our setup. > > > > > Send all the logs using the script attached. > > > > I have no interesting logs to send. The DTS logs are empty from the > > time of failover (since the replicated partition is unavailable?). > > > > > I would like to know what is your approach to test failover. > > > > I just did 'pkill cpd' on the active controller. > > > > _______________________________________________ > > Users mailing list > > [email protected] > > http://list.opensaf.org/maillist/listinfo/users > > _______________________________________________ Users mailing list [email protected] http://list.opensaf.org/maillist/listinfo/users
