Hi Stefan Thanks for your suggestions! In fact I have similar guess. I performed same test with the JVM socket timeout increased from 10 seconds to 30 seconds, it seems the problem no longer repeat (will do further load test to confirm).
However, my major concern is that if such case do happened, we expects the controller 1 is disabled and all traffic should be directed to controller 2. Now controller 1 still accepting connection, and only throws errors when executing message, this seems strange. Regards Francis -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stefan Lischke Sent: Wednesday, December 19, 2007 6:51 PM To: Sequoia general mailing list Subject: Re: [Sequoia] Controller can be connected but failed executingstatement Hi Francis, *sarcasm* This looks like normal Sequoia behavior ;-) *sarcasm* But have you looked at the load on the controller machines. If load is too high and the Appia timeout is to short, you get this "failed in group" Problem 1 behavior. But the Problem 2 is really bad, i try to explain to you. If Controller 2 says Controller 1 failed, it still accepts write requests. And the write Requests are not distributed to the failed controller. So the DB's are not in sync anymore. Now when the load gets lower, Controller 1 automatically rejoins (he does not tell you about this) and the following writes are distributed again. Then you have a lot of problems, like yours. I think this problem is by design, cause there is not splitbrain scenario. If one controller was lost and only one write was not distributed to him, the failed controller should NEVER rejoin the cluster. They should work in splitbrain scenario. This scenario does not only apply to high loads, its also a big problem when there is a very short network split (you should never use internet connection between two nodes, always use crosslinked network connection, or an own network with an own switch) I tried to talk about this problem some time ago and i opened some bugs, but we never discussed this big issue in this community. ;-( If its the load on your system, try to change the following values in the appia.xml <chsession name="suspectl"> <parameter name="suspect_sweep">10000</parameter> <parameter name="suspect_time">30000</parameter> </chsession> hth stefan _______________________________________________ Sequoia mailing list [email protected] https://forge.continuent.org/mailman/listinfo/sequoia
